Hacker News new | past | comments | ask | show | jobs | submit login

Maybe bs4 + newspaper3k rolled into one? But still, what's the gap?



Regarding content extraction it's more accurate than newspaper3k (especially for languages other than English) and it entails more information: metadata, text, and comments. It works out of the box in most cases so no need to write a particular scraper for a given websites, which saves time. If you care about 2-3 websites and are willing to write and maintain scraping scripts then bs4/lxml/whatever is also fine.

It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: