I wish there was a web service that used this tool to scrape nicely-formatted plain text from any website, then archive it and serve it as a super basic web reader.
You sort of, kind of, maybe just asked for roughly what RSS (Really Simple Syndication) provides...although your wish is more of a "pull", while RSS is more of a "push" in content access/distribution. :-) Don't get me wrong, I'm in agreement with you. I wish every website, web app, well, pretty much everything digital had an automated RSS feed available to consume and subscribe to!
With RSS you are at the mercy of the server, though. The content creator may only syndicate an excerpt of the whole article, remove pictures or formatting, yada yada. But yes the Web would be so much nicer if more websites provided at least some form of content syndication...
Agreed, one would definitely be at the mercy of the author/content creator...but I often feel like someone who is willing to offer an RSS feed probably would likely enable easier consumption of their content even if one needs to actually visit the website. Of course tthat is a very broad generalization I'm making. But you're certainly not wrong.
Archive box (https://archivebox.io/) will create a local dump of any site in a multitude of formats from raw html, printed PDF, and extracted body text. Also has option to request internet archive to trigger a scrape of the page.
Not sure how it fares nowadays, but I used to employ Mercury Reader/API for this, now called Postlight Reader[1]. While not perfect, I found it to work for most daily reading needs.
Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.
For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.
In general I'd be curious to try this, but your homepage is not very convincing.
The "demo" doesn't look like typing, it's a fade right, and it's painfully slow. And then, there's no library, it's just 'import requests', so even the demo is extra long. (Why not show curl then?)
Also, are there any benchmarks? Why should I take the time to evaluate this myself against existing open-source tools? It seems like that should be your responsibility, not mine, to spend the time doing a detailed comparison and evaluation. In a way that feels open and trustworthy.
I respect what you are doing and share this feedback from the heart.
There’s been a few such things over the years. I even built one for iOS/iPad that’s still in the store. I found that doing the parsing client side is preferable because so many sites have paywalls and render some of their content with JS. I never did much with the app because it’s hard to monetize, but I maintain it occasionally.