I wish there was a web service that used this tool to scrape nicely-formatted pl...

mxuribe · on Aug 14, 2023

You sort of, kind of, maybe just asked for roughly what RSS (Really Simple Syndication) provides...although your wish is more of a "pull", while RSS is more of a "push" in content access/distribution. :-) Don't get me wrong, I'm in agreement with you. I wish every website, web app, well, pretty much everything digital had an automated RSS feed available to consume and subscribe to!

lou1306 · on Aug 15, 2023

With RSS you are at the mercy of the server, though. The content creator may only syndicate an excerpt of the whole article, remove pictures or formatting, yada yada. But yes the Web would be so much nicer if more websites provided at least some form of content syndication...

mxuribe · on Aug 15, 2023

Agreed, one would definitely be at the mercy of the author/content creator...but I often feel like someone who is willing to offer an RSS feed probably would likely enable easier consumption of their content even if one needs to actually visit the website. Of course tthat is a very broad generalization I'm making. But you're certainly not wrong.

0cf8612b2e1e · on Aug 14, 2023

Archive box (https://archivebox.io/) will create a local dump of any site in a multitude of formats from raw html, printed PDF, and extracted body text. Also has option to request internet archive to trigger a scrape of the page.

dleeftink · on Aug 14, 2023

Not sure how it fares nowadays, but I used to employ Mercury Reader/API for this, now called Postlight Reader[1]. While not perfect, I found it to work for most daily reading needs.

[1]: https://reader.postlight.com/

adbarba · on Aug 15, 2023

Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability

selcuka · on Aug 15, 2023

Sounds trivial to implement using this library with a bit of glue code for the web bits:

https://archiw.fly.dev/

nghota · on Aug 14, 2023

check out nghota.com api. It is able to pull out the main text from most non-ecommerce web pages and return that to you in json.

bravura · on Aug 14, 2023

In general I'd be curious to try this, but your homepage is not very convincing.

The "demo" doesn't look like typing, it's a fade right, and it's painfully slow. And then, there's no library, it's just 'import requests', so even the demo is extra long. (Why not show curl then?)

Also, are there any benchmarks? Why should I take the time to evaluate this myself against existing open-source tools? It seems like that should be your responsibility, not mine, to spend the time doing a detailed comparison and evaluation. In a way that feels open and trustworthy.

I respect what you are doing and share this feedback from the heart.

jamil7 · on Aug 15, 2023

There’s been a few such things over the years. I even built one for iOS/iPad that’s still in the store. I found that doing the parsing client side is preferable because so many sites have paywalls and render some of their content with JS. I never did much with the app because it’s hard to monetize, but I maintain it occasionally.