Hacker News new | past | comments | ask | show | jobs | submit login

How much compute, storage, and network speed would a minimal up-to-date web search engine need?



I'd estimate that Google uses ~1000 TB of fast storage, Bing 500 TB and Yandex 100 TB, so the most basic useful search engine would use at least... 10 TB?


"The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size."

https://www.google.com/intl/en_uk/search/howsearchworks/craw...


Actually I doubt that this is a true statement and not something to discourage others. Check out these queries: https://www.google.com/search?q=1 12B results https://www.google.com/search?q=an 9B results https://www.google.com/search?q=the 6B results If we estimate that about half of all English pages contain 'the' or 'an' article we'll have about 15B English pages. If half of all pages contain the '1' then the total number of pages is about 24B. If half of all the pages are in English then the total number of all the pages is 30B. So even the maximum is less than the "hundreds". Similar numbers are at https://www.worldwidewebsize.com/


What is fast storage? Is that, for right now, the fastest SSDs available?


HDD is definitely not enough because of low iops, likely Google keeps index in RAM. I think NVME should be good enough, idk for sure.


Ah. I thought that fast storage was a more specific type of storage that might be more expensive. I mean 1000tb is expensive, but it’s feasible to get to that scale with the right funding.


If you have to ask, you can't afford it


see, that's the problem with engineers today. AltaVista ran on 3x300MHz 64bit processors with 512M of RAM back in 1995. Resources cost peanuts these days. It's just that we're so used to bloat and digital inflation, we can't even start considering unbloated implementations as we perceive them as "not modern". Apparently we are also stuck in the centralized/ownership mentality. If you crowdsource search indexing and processing, it scales alongwith amount of users. Oh and also another bane of the internet is the need to monetize. FFS, if email would be invented today, we probably would have to buy NFT poststamps.


I'm sorry what? Have you seen the rate at which data is being created today? I mean, if you want to index the size of the 1995 web with your raspberry pi, go ahead, but it costs insane amounts of money to index the December 2021 web and keep the index up-to-date.

edit / full disclosure: I work at google but nothing related to search


Is it really necessary to index EVERYTHING? It's true that we have much more data today than 26 years ago, but not all of these websites qualify or provide value (duplicate results, promotional content, outdated content).

The challenge then moves to the curation, but it's no longer infeasible.


I think the words "minimally viable" are being ignored here.

I think OP's point is, assume you only have 10/100 terabytes of space and limited compute ability - how would you approach the problem? I assume 90% of google's searches probably come from less than 1% of their total index, not to mention that Google is also keeping full cached versions of the whole website including images.


I just eyeballed my browser history from the last 2-3 days and I'd estimate 15% is current/latest news related, some 25% is programming related, 15% is e-commerce stuff, the rest random crap. I'd imagine 10-100 TB can easily serve all of _my_ search space (even the links I didn't look at on page 10) from the past few years, but that's the thing -- it's just my search space. How do you serve the rest of the world? I wish I knew the answer :)


Well Google doesn't know the answer either. Results are complete trash when you try find something niche or in a local language.

The challenge isn't to index the entire web, it's to index the useful parts of it, and I think an index covering most of the useful web can be seeded quite easily with some community effort.


You might think that. My search engine runs off <$5k worth of consumer hardware off domestic broadband. It survived the hacker news front page for a week, saw a sustained load of 8000 searches per hour for nearly a day.

It's got a fairly small index, but yeah, it's not particularly hardware-hungry.


Your project is fascinating, but 8000 queries/hr is just over 2 queries/sec. Even bursting up to 10x that (20 QPS), it doesn’t seem surprising that it can run on consumer hardware? Am I missing something?


Well you got to consider a search engine isn't a blog where it just fetches a file off disk and is done with it, a trivial search query might as well be, but a non-trivial search query may have to do a search through dozens of megabytes of ids to produce its response (given these are the 800k documents that contain 'search', which of those contain 'engine', and which of those contain 'algorithm').

I honestly don't know what the actual limit is, all I know is it dealt with 2 QPS without affecting response times. But 2 QPS for a search engine is actually kind of a lot. Most people don't actually search that much. Like you get a few queries per day. Put it this way: 2 QPS is what you'd expect if you had around a million regular users. That's not half bad for consumer hardware.


What's the size of your index in records?


I run three separate indices at about 10-20mn documents each. But I'm fairly far off any sort of limit (ram and disk-wise I'm at maybe 40%).

I'm confident 100mn is doable with the current code, maybe .5bn if I did some additional space optimization. There are some low hanging fruit that seem very promising. Sorted integers are highly compressable, and right now I'm not doing that at all.


Yes doclist compression is a must. Higher intersection throughput and less bandwidth stress. Are you loading your doclists from persistent storage? What is your current max rps?


I'm loading the data off a memory mapped SSD, trivial questions will probably be answered entirely from memory, although the disk-read performance doesn't seem terrible either.

> What is your current max rps?

It depends on the complexity of the request, and repeated retrievals are cached, so I'm not even sure there is a good answer to this.


Fair point.

How is it to be funded?


is the size of the common Web already way too large to play catch up against google/bing at this point?

my dream, is a distributed/p2p index. each browser contribute to storing part of the overall index, and handle queries coming from other users so that how to fund huge data centers never become a question.


  > is the size of the common Web already way too large to play catch up against google/bing at this point?
Probably. But I would prefer a search engine that didn't search the whole web. I would prefer a search engine that searched the sites related to fields that I'm interested in.

So I would pay for or donate to a search engine that provided me good results in e.g. software development. They could add additional fields as demand warrants, so long as quality as maintained. I would even like to see a faceting feature, so I could search for e.g. Matrix and get results on the mathematical concept when need be without having to wade through movie review or fiddle with magic search keywords.


not searching the whole Web makes sense, except that it isn't clear what belongs to your field of interest and what doesn't. should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects? and what if you want to search about pottery that Sunday morning, turn to Google? I think segmenting search results by topics is a useful consumer query feature, not sure segmenting what gets indexed would provide a useful service other than covering niches hence not really fulfilling a web search engine. I find the idea less ambitious so maybe that's how an open search engine should approach the problem, federation of hosts could, cover the whole Web eventually.


  > should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects?
Yes, because that author's pages wouldn't even be fetched at the point of development that we are discussing.

  > and what if you want to search about pottery that Sunday morning, turn to Google?
Yes, Google still exists. Why not?


why not Google? What I find most beneficial in a search engine alternative is to be able to fetch results or topics I'm less knowledgeable about.

I'm somewhat fine with very commercial search engines when searching for something within a field I'm familiar with, e.g I if search for "kubernetes configmap best practice", skimming infomercial sites, ads, junk is trivial. if I search for "baby milk safety" I have no clue how to digest the results list, I'm totally unfamiliar with the media outlets catering to parenting, nutrition or food health, I know absolutely no authors in the field, no renowned media outlets for tips, not even brand reputations to make a somewhat informed decision on what to skim and what to read with attention.

But I don't disagree an engine focusing on specific fields provides great value. and perhaps that's where it should start to have a chạnce to win.


Check out YaCy. It's great if you're happy with slow search!


I had problems with YaCy. Slow search and slow crawling. Regarding search I think it could be improved if instead of HTTP requests to other peers a more efficient protocol (UDP) could be used. Regarding crawling it's quite possible that doing this in Java and running it on a Rock64 may not be the best combination (: It started OOMing after some days.


YaCy's results weren't great and there was the raking was very bad letting sides game it by just spamming a ton of keywords.


The plan is to fund it through donations


Can you make it a contributory database? I wouldn't mind "donating" my browsing history and page downloads to build the index and train the algorithm.

You'd have to find a way to verify reputation to make sure no bad actors could contribute.


How does Internet Archive verify dumps submitted by Archive Team and other groups? This may already be a solved problem.

Not knowing their implementation details, I’m guessing it could be doable without reinventing much. An oracle could dispatch a P2P archive job to a pool of clients randomly assigned tasks, with both the first to archive and the first to validate being recognized by the swarm somehow, with periodic re-archiving and re-verification, rate adjusted by popularity of site and of search keywords.


Yes, I'm planning to do something like this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: