You might think that. My search engine runs off <$5k worth of consumer hardware off domestic broadband. It survived the hacker news front page for a week, saw a sustained load of 8000 searches per hour for nearly a day.
It's got a fairly small index, but yeah, it's not particularly hardware-hungry.
Your project is fascinating, but 8000 queries/hr is just over 2 queries/sec. Even bursting up to 10x that (20 QPS), it doesn’t seem surprising that it can run on consumer hardware? Am I missing something?
Well you got to consider a search engine isn't a blog where it just fetches a file off disk and is done with it, a trivial search query might as well be, but a non-trivial search query may have to do a search through dozens of megabytes of ids to produce its response (given these are the 800k documents that contain 'search', which of those contain 'engine', and which of those contain 'algorithm').
I honestly don't know what the actual limit is, all I know is it dealt with 2 QPS without affecting response times. But 2 QPS for a search engine is actually kind of a lot. Most people don't actually search that much. Like you get a few queries per day. Put it this way: 2 QPS is what you'd expect if you had around a million regular users. That's not half bad for consumer hardware.
I run three separate indices at about 10-20mn documents each. But I'm fairly far off any sort of limit (ram and disk-wise I'm at maybe 40%).
I'm confident 100mn is doable with the current code, maybe .5bn if I did some additional space optimization. There are some low hanging fruit that seem very promising. Sorted integers are highly compressable, and right now I'm not doing that at all.
Yes doclist compression is a must. Higher intersection throughput and less bandwidth stress. Are you loading your doclists from persistent storage? What is your current max rps?
I'm loading the data off a memory mapped SSD, trivial questions will probably be answered entirely from memory, although the disk-read performance doesn't seem terrible either.
> What is your current max rps?
It depends on the complexity of the request, and repeated retrievals are cached, so I'm not even sure there is a good answer to this.
It's got a fairly small index, but yeah, it's not particularly hardware-hungry.