What's the point of crawling the common crawl archive? It's pointless. You can s...

GistNoesis · on Dec 26, 2021

I think crawling and indexing should be treated differently. Indexing is about extracting value from the data, while crawling is about gathering data.

Once a reference snapshot has been crawled, the indexing task is more easily verifiable.

The crawling task is harder to verify, because external website could lie to the crawler. So the sensible thing to do is have multiple people crawl the same site and compare their results. Every crawler will publish its snapshots (which may contain some errors or not), and then that's the job of the indexer to combine multiple snapshots of various crawler and filter the errors out and do the de-duplication.

The crawling task is less necessary now than it was a few years ago, because there is already plenty of available data. Also most of the valuable data is locked in walled garden, and companies like Cloudflare make the crawling difficult for the rest of the fat tail. So it's better to only have data submitted to you, and outsource the crawling.