> We plan to start work on a distributed crawler, probably implemented as a brow...

altdataseller · on Dec 26, 2021

There are already loads of browser extensions that do a lot of screen scraping of all the sites you visit, without you even realizing it,

tomxor · on Dec 26, 2021

I can imagine, but i only use one, ublock

daoudc · on Dec 26, 2021

Yes, that is a concern. I'd probably worry about it if and when it started happening, however.

Schiendelman · on Dec 26, 2021

If you wait until then, it may become too late to mitigate. Unless you have a plan to remain in complete control of who contributes.

bryguy32403 · on Dec 27, 2021

Famous last words

qwertox · on Dec 26, 2021

There could be legal issues if the crawler starts to crawl into regions which should be left alone. I don't mean things like the dark web, but for example if someone is a subscriber to an online magazine it could start crawling paywalled content if 3rd party cookies enable this.

tomxor · on Dec 26, 2021

Google already seems to crawl paywalled content somehow, this doesn't seem to be much of a legal issue since you cannot click through - it's just annoying as a user.

This might even be intentional through robots.txt ... A browser extension that passively crawls visited sites could easily download robots.txt as the single extra but minimal download requirement.

dreamcompiler · on Dec 26, 2021

Pretty sure most paywalled sites explicitly allow the googlebot to enter. If you spoof your UserAgent to be that of the googlebot they check your IP address to make sure you really are Google.

The new fly in the crawler ointment is Cloudflare: If you're not the googlebot and you hit a Cloudflare customer you need to be running javascript so they can verify you're not a bot. It's a continual arms race.

imglorp · on Dec 26, 2021

Google has instructions for paywall sites to allow crawling. I suppose it brings them traffic when users click on a search result and arrive at the sign up page.

https://developers.google.com/search/docs/advanced/structure...