Hacker News new | past | comments | ask | show | jobs | submit login

> We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

Is there a concern that volunteers could manipulate results through their crawler?

You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.

[edit]

Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.




There are already loads of browser extensions that do a lot of screen scraping of all the sites you visit, without you even realizing it,


I can imagine, but i only use one, ublock


Yes, that is a concern. I'd probably worry about it if and when it started happening, however.


If you wait until then, it may become too late to mitigate. Unless you have a plan to remain in complete control of who contributes.


Famous last words


There could be legal issues if the crawler starts to crawl into regions which should be left alone. I don't mean things like the dark web, but for example if someone is a subscriber to an online magazine it could start crawling paywalled content if 3rd party cookies enable this.


Google already seems to crawl paywalled content somehow, this doesn't seem to be much of a legal issue since you cannot click through - it's just annoying as a user.

This might even be intentional through robots.txt ... A browser extension that passively crawls visited sites could easily download robots.txt as the single extra but minimal download requirement.


Pretty sure most paywalled sites explicitly allow the googlebot to enter. If you spoof your UserAgent to be that of the googlebot they check your IP address to make sure you really are Google.

The new fly in the crawler ointment is Cloudflare: If you're not the googlebot and you hit a Cloudflare customer you need to be running javascript so they can verify you're not a bot. It's a continual arms race.


Google has instructions for paywall sites to allow crawling. I suppose it brings them traffic when users click on a search result and arrive at the sign up page.

https://developers.google.com/search/docs/advanced/structure...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: