Hey, great project - the more competition in this space the better. To be honest, at the moment the algorithm doesn't return any sensible results for anything (at least that I can find), but I hope that you can find a way past this as it's a great place to have a project.
I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).
> Search "best car brands"
Expected: Car Reviews
Returns a page showing the best mobile phone brands.
then...
> Then searching "Best Mobile Phone"
Expected: The article from the search above.
Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
> Searching "What is a test?"
Expected result: Some page describing what a test is, maybe wikipedia?
Returns "Test could confirm if Brad Pitt does suffer from face blindness"
> Searching "Duck Duck Go"
Expected result: DDG.com
Returns "There be dragons? Why net neutrality groups won't go to Congress"
> Searching "Google"
Expected result: Google.com
Returns: An article from the independent, "Google has just created the world’s bluest jeans"
I guess that's the real problem. People like to wonder what would be the "ideal world" in a search engine. It may be wishful thinking, I don't know.
It seems really hard to produce quality search results. Takes a lot of investment. Makes it an expensive product. But no one wants to pay. So selling ads it's the only way forward.
Maybe there's a way to convince people to pay what it takes? I dunno...
I would gladly pay $5 a month for a Google-quality search service that doesn’t track me. I’ve been using Duck Duck Go for most of this year, but frequently find myself falling back to !g because Google’s results really are much better.
I wonder how much money Google search makes per the average user. Is it more than $5/mo?
It is a bit above your price point, but I have been using Kagi.com (not affiliated, just impressed). They're in beta, but will charge ~$10 once they go GA. Like you, I tried DuckDuckGo for awhile, but resorted to g! so often that I started using it for everything out of habit.
In contrast, Kagi provides Google-quality results mosts of the time, better-than-Google semi-often, and worse-than-google rarely. They support g!, but I only use it a couple of times a week, usually for site-specific searches.
Additionally, I really like that I am their customer and not their product - incentives are aligned for them to continue respecting my privacy and preferences.
One only uses Google services if one absolutely needs to. Google on the other hand never needs you. You are absolutely unnecessary and super easy to replace. Whichever you chose, the next search engine really needs our queries. If we give them enough they might be able to create a competitive product. If they do google will dramatically improve. I'm sure they have plenty of ideas, the incentive is just not there.
The problem with this approach is that the group most likely to pay to remove ads is also the same group that is most attractive to advertisers. So your pool of users to whom you serve ads is less attractive overall. That means you need to figure out not just how much you make serving those users ads, but also how much you lose by removing them from the ad viewer pool.
Likewise, I'd pay that as well (I'd like no ads as well, but tracking is the main issue for me).
Similarly, I do have DDG as my main search on all machines and devices just out of principle, but its region-aware searching (I'm in NZ, and often only want NZ results) are very close to useless in my experience (with NZ as the region ticked, it will still return results from .ca and .co.uk domains, which I would have hoped would be almost trivial to remove), and Google seems much better in this area (but not perfect).
Similarly, there's often technical/programming things I'll search for that DDG doesn't have indexed at all, and Google does.
Google also seems a lot better at ignoring spelling differences (color/colour, favorite/favourite) than DDG, which is often (but not always!) useful.
While search funds most of Google, most employees at Google don't work on search. A small proportion of Google's revenue would be plenty for good quality search.
I've been enjoying a lot neeva, not affiliated at all, just a happy user :-)
I think they are $4.95/mo or something? haven't payed a cent yet since there are a few discounts that they do to prompt you to learn how to use it (I really liked that, and def made me more likely to stick with it!)
$5 per month is sadly out of budget for many third world citizens because that $5 can be urgently used elsewhere to plug a need. I think somewhere in the range of $0.1-$0.5 might be doable though.
If you have specific areas of interest that are unpopular enough your local yacy instance can index those [say] 100 or 10 000 websites and the results will blow you away. Google is still useful of course but its a joke by comparison.
Say for laughs you are only interested in yourself. You put every page by you and about you in the crawler. It will obviously render fantastic results. Using google you would have to start every query with your full name OR user names, whatever you type behind it doesn't even matter, it wont return pages with all keywords. With yacy you just type the query and it will return EVERYTHING. To compare the 2 would be to compare useless with perfection.
The first thing I do to test a search engine is to search for my own username on various public sites to see if it can find me. It didn’t find me. But keep it up and I’m sure I’ll be in there eventually (or maybe a overestimate how interesting I am, hehe).
I get this is your usual testikg case for search engines, but if you've read their README you'd have seen its inaproppriate for the project at the current stage.
I was curious and tried a bunch of other searches, with similarly disappointing results. My searches were a bit more esoteric than Closi's.
"langlands program" (pure mathematics thing): yup, top result is indeed related to the Langlands program, though it isn't obviously what anyone would want as their first result for that search. Not bad.
"asmodeus" (evil spirit in one of the deuterocanonical books of the Bible, features extensively in later demonology, name used for an evil god in Dungeons & Dragons, etc.): completely blank page, no results, no "sorry, we have no results" message, nothing. Not good.
"clerihew" (a kind of comic biographical short poem popular in the late 19th / early 20th century): completely blank page. Not good.
"marlon brando" (Hollywood actor): first few results are at least related to the actor -- good! -- but I'd have expected to see something like his Wikipedia or IMDB page near the top, rather than the tangentially related things I actually god.
"b minor mass" (one of J S Bach's major compositions): nothing to do with Bach anywhere in the results; putting quotation marks around the search string doesn't help.
"top quark" (fundamental particle): results -- of which there were only 7 -- do seem to be about particle physics, and in some cases about the top quark, but as with Marlon Brando they're not exactly the results one would expect.
"ferrucio busoni" (composer and pianist): blank page.
"dry brine goose" (a thing one might be interested in doing at this time of year): five results, none relevant; top two were about Untitled Goose Game.
"alphazero" (game-playing AI made by Google): blank page. Putting a space in results in lots of results related to the word "alpha", none of which has anything to do with AlphaZero.
OK, let's try some more mainstream things.
"harry potter": blank page. Wat. Tried again; did give some results this time. They are indeed relevant to Harry Potter, though the unexpected first-place hit is Eric Raymond's rave review of Eliezer Yudkowsky's "Harry Potter and the Methods of Rationality", which I am fairly sure is not what Google gives as its first result for "harry potter" :-).
"iphone 12" (confession: I couldn't remember what the current generation was, and actually this is last year's): top results are all iPhone-related, but first one is about the iPhone 6, second is from 2007, this is about the iPhone 6, fourth is from 2007, fifth is about the iPhone 4S, etc.
"pfizer vaccine": does give fairly relevant-looking results, yay.
Thanks for the detailed feedback! I think most of most of the problems here are because we have a really small index right now. Increasing the number of documents is our top priority. I agree that some kind of feedback when there are no results would be a good idea.
I actually think it’s probably the algorithm too - if I take one of the search items returned from a search that I know is in the index, but then search for it with slightly different terminology (or a different tense / pluralisation), the same item doesn’t come up.
> We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.
Is there a concern that volunteers could manipulate results through their crawler?
You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.
[edit]
Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.
There could be legal issues if the crawler starts to crawl into regions which should be left alone. I don't mean things like the dark web, but for example if someone is a subscriber to an online magazine it could start crawling paywalled content if 3rd party cookies enable this.
Google already seems to crawl paywalled content somehow, this doesn't seem to be much of a legal issue since you cannot click through - it's just annoying as a user.
This might even be intentional through robots.txt ... A browser extension that passively crawls visited sites could easily download robots.txt as the single extra but minimal download requirement.
Pretty sure most paywalled sites explicitly allow the googlebot to enter. If you spoof your UserAgent to be that of the googlebot they check your IP address to make sure you really are Google.
The new fly in the crawler ointment is Cloudflare: If you're not the googlebot and you hit a Cloudflare customer you need to be running javascript so they can verify you're not a bot. It's a continual arms race.
Google has instructions for paywall sites to allow crawling. I suppose it brings them traffic when users click on a search result and arrive at the sign up page.
The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.
That little thought experiment is true for many online services, from social networking to (marginally) publishing. But nowhere is it more true than for search results, which differ in two fundamental ways: being text-only, they don't bother me to anywhere near the degree of other ads. And, second, they are an order of magnitude more valuable than drive-by display ads, because people have indicated a need and a willingness to visit a website that isn't among their bookmarks. These two, combined, make this the worst possible case for replacing an ad-based business with a donation model.
The idea mentioned in this readme that "Google intentionally degrades search results to make you also view the second page" is also wrong, bordering on self-delusion. The typical answer to conspiracy theories works here: there are tens of thousands of people at Google. Such self-sabotage would be obvious to many people on the inside, far too many to keep something like this secret.
I would consider Google randomly excluding the most relevant words from my search query intentionally degrading results. It's incredibly frustrating. This shouldn't be the default behavior, maybe an optional link the user can click to try again with some of the terms excluded.
Yes, I know verbatim mode exists, but I always forget to enable it, and the setting eventually gets lost when my cookies are cleared or something.
Unfortunately I can't switch to another search engine because in my experience every other search engine has far inferior results, despite not having the annoying behaviors Google does. DuckDuckGo is only useful for !bangs for me.
> The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.
Is this relevant for non-profit project? Do you pay $30/year for Wikipedia?
Duck Duck Go is profitable despite not blanketing the first page with ads (just like google were once upon a time); you can have no ads at all if you like also. Do they make money in other ways, sure but not in a way that degrades the user experience.
Do you have a source for this figure? Maybe it's mostly true for the "average non-tech-savvy user in an English-speaking country", but I've found DuckDuckGo and everything other than Google inferior in many cases, especially when looking for Hungarian content.
Sorry, for the "average non-tech-savvy user in an English-speaking country" is more or less what I meant. Although this would have been true for Google also in the early days.
1. What is the rationale behind choosing Python as a implementation language? Performance and efficiency are paramount in keeping operational costs low and ensuring a good user experience even if the search engine will be used by many users. I guess Python is not the best choice for this, compared to C, Rust or Java.
2. What is the rationale behind implementing a search engine from scratch versus using existing Open Source search engine libraries like Apache Lucene, Apache Solr and Apache Nutch (crawler)?
Is keeping performance in mind and choosing the tech stack accordingly really a premature optimization?
This might be the most abused phrase in CS history. Perhaps we should add "Premature optimization fallacy" to the list of cognitive errors programmers use as an excuse to not seriously think about performance.
> Is keeping performance in mind and choosing the tech stack accordingly really a premature optimization? This might be the most abused phrase in CS history.
And it's often misquoted, and/or taken out of context. The full quote goes like this: "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."
Back then - this was the 60s, remember - CPU cycles were costly.
But with all theories, it stands its time in the sun. We have the same "problem" today, but in a different form. We call it "agile" today, though; make sure that the customer is happy before the programmer is happy. If the programmer is allowed to spend too much time on trying to become happy, the customer is either gone, or someone else came up with a better solution.
In regards to your specific "is keeping performance in mind and choosing the tech stack accordingly really a premature optimization" question, and keeping in mind OP's endevour, you're on the right spot. But the real question is how programmers _get there_.
By experience.
And in turn, by relating to clients more directly these days, programmers have to adhere to the laws that were separated from them back in the days. And your business isn't worth shit without customers, even if you have the best programmers that can create the best code from day 1.
Hence Knuth's quote, translated: "if you spend so much time on planning your journey that you don't reach your flight, you are getting nowhere."
Agreed, this was my thinking - and since I'm better at Python, it's faster for me to get stuff done. I would like to rewrite it in Rust though, all help from Rustaceans gladly accepted!
In general, speed isn't the problem with search (at least the retrieval aspect), but memory efficiency is. Things like small object overhead and the ability to memory map large data ranges are extremely beneficial for a language if you want to implement a search index.
But I agree, get it working first, then re-implement it in another language if it turns out to be necessary.
"Ecosia is a search engine based in Berlin, Germany. It donates 80% of its profits to nonprofit organizations that focus on reforestation" [1]
"80% of profits will be distributed among charities and non-profit organizations. The remaining 20% will be put aside for a rainy day." [2]
"Ekoru.org is a search engine dedicated to saving the planet. The company donates 60% of revenue generated from clicks on sponsored search results to partner organizations who work on climate change issues" [3]
While Ecosia is not technically a non-profit, no one can sell Ecosia shares at a profit.
> Ecosia says that it was built on the premise that profits wouldn’t be taken out of the company. In 2018 this commitment was made legally binding when the company sold a 1% share to The Purpose Foundation, entering into a ‘steward-ownership’ relationship.
> The Purpose Foundation's steward-ownership of Ecosia legally binds Ecosia in the following ways:
- Shares can't be sold at a profit or owned by people outside of the company and
- No profits can be taken out of the company.
Also https://searchmysite.net/ for personal and independent websites (essentially a loss-leader for its open source self-hostable search as a service).
In the early web 2, it was very in for things to be spelled unpronounceably. For the life of me I can only remember Twittr, but I wanna say Spotify also had an unreadable name in the early days.
I have this feeling that most of time I "search" for something I already know what I'm looking for, but google via firefox's omnibox, is just the fastest way to get there, even though it's a bit indirect. Are they getting paid for that, or am I costing them money in the short term, but they get to build up a profile on me to provide more effective ads later?
I wonder if it's possible to take advanage of that type of search by putting a facade in front of the "search engine" and based on the search term and the private local user history, then go direct to a known site, or if it seems a search is needed, go to a specific search engine. This may open up opportunities for say program language specific search engines, or error messages from a program specific search, or shopping for X sites.
I bookmark every site I might possibly want to revisit - make a habit of Ctrl+D. They're totally unsorted, but the key is to wipe the regular history on exit, leaving only the bookmarks as source material for completion. That way I can type something in the url bar and get completion to interesting sites. The url bar (or omnibox) matches on page title as well as the actual address, so it's easy, and always faster than a search engine.
if u set duckduckgo as your default search provider, you can use bang in the omibox.
also you can toogle between local-area or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi
Most wikis or resource/documentation sites have a local search bar on their homepage, Firefox has a feature where it lets you add a search keyword for that specific site. So if you add, say, pydocs as a keyword for docs.python.org you can do "@pydocs <query>" it looks up the query on that page.
This is a business model I've been thinking about: what if users earned credits for running a crawler on their machine? In other words, as much as I hate crypto scams, a "tokenized" search engine where the "mining" power was put to good use, i.e crawling and indexing.
Well they are in a sense: you can just do the task yourself. It’s expensive of course, so you can use methods applied to human labelling for ML: _periodically_ injecting tasks with known results and checking how trustworthy the party is, vending the task to multiple parties and aggregating results, blocking parties that make many “mistakes”, etc.
You build a system based on trust but verify.
If the output is the result of a known deterministic program on a known input, anyone can verify it.
So they just have to sign their work. If later someone find that they have lied and provided a false result, they lose reputation/stacked coins.
The second associated problem is how would one prevent them from appropriating the work of others that would just re-sign it.
One way would be to allow the worker to introduce a few voluntary errors but have a secret joker that allow him to pass the challenge of a failed verification.
One alternative way is based on data malleability. The worker pick a secret one way function and compute is F( data + secretFunction(data,epsilon) ) ~ F(data) and publish the values of the secretFunction(data,epsilon) but not the secretFunction. Only someone with knowledge of the secretFunction can make a claim on the work done. If there is a challenge only the real worker will be able to publish the secret of the secretFunction (Or use some zero knowledge proof to convince you they know it).
That's why you don't index the webpage but a snapshot of it.
For example you index the commoncrawl archives, or some content addressable storage like ipfs or torrent file.
I think crawling and indexing should be treated differently.
Indexing is about extracting value from the data, while crawling is about gathering data.
Once a reference snapshot has been crawled, the indexing task is more easily verifiable.
The crawling task is harder to verify, because external website could lie to the crawler. So the sensible thing to do is have multiple people crawl the same site and compare their results. Every crawler will publish its snapshots (which may contain some errors or not), and then that's the job of the indexer to combine multiple snapshots of various crawler and filter the errors out and do the de-duplication.
The crawling task is less necessary now than it was a few years ago, because there is already plenty of available data. Also most of the valuable data is locked in walled garden, and companies like Cloudflare make the crawling difficult for the rest of the fat tail. So it's better to only have data submitted to you, and outsource the crawling.
One idea I have been kicking around is the idea of federating the indices, not the crawling.
If every contributor maintained their own index, then you could reward contributors based on how many hits their index generated.
This would open up the possibility of people maintaining indices for specialized topics that they were experts in, and give the federated search engine a shot at taking on Google.
You don’t want to reward people for quantity, but quality. The cost of creating a new webpage is effectively zero, so if you attach an incentive for creating them you are doomed.
If the miner, they would be able to deliver any crap so content deliveries would have to be judged in some way and awarded differently.
If the pool provides addresses to crawl, the miner could be given crafted/dedicated URIs from time to time and lack of delivery of Proof Of Crawl could result in a penalty chosen in a way rendering "cheating" unprofitable. But then fresh URIs have to come from somewhere.
There would have to be some aspect of centralized moderation, I suppose. This is beyond my knowledge: Is there a way to accept output only from signed binaries, so that we assume if X cycles of work were performed by a signed binary, then it is legitimate output?
no not really may be some theoretical way using ZK-snarks or encrypted enclaves (intel sgx) but not practical. also probably does not work cz oracle/enclave also needs input(raw crawl data/ network http bytes) which has to be trusted. one way could be project could make unbreakable mining/crawling chip/box/os/anticheat_layer_with_vm and supply to everyone which has different levels of breakable-lity and complexity to build.
One way is we send crawl_task to different to N random nodes and accept one that most similar?
another way could be build messy network to solve messy problem. What we do is build reputation bashed graph network and you accept index from nodes you trust. so people will start un following misbehaving nodes. there is not universalroot view of network instead its dynamic and different from prospective of each node.
or it could have one root view if we store reputation data in bchain and with some type quadratic voting to modify the chain. ?
yeah Bitcoin showed us way to build mathematically secure system without any trusted party but it could do that cz problem it was solving is mathematically provable. Problem like collecting indexing crawl data you have to trust somebody.
YaCy is decentralized, but without the credit system. Some tokens, like QBUX, have tried to develop decentralized hosting infrastructure.
I also have been wondering how this would play out with some kind of decentralized indexes. The nodes could automatically cluster with other nodes of users sharing the same interests, using some notion of distances between query distributions. The caching and crawling tasks could then be distributed between neighbors.
A big part of the problem I see with decentralized search is that you basically need to traverse the index in orthogonal axes to assemble search results. First you need to search word-wise in order to get result candidates, then sort them rank-wise to get relevant results (this also hinges upon an agreed-upon ranking of domains). That's a damn hard nut to crack for a distributed system.
Crawling is also not as resource consuming as you might think. Sure you can distribute it, but there isn't a huge benefit to this.
Actually I have an idea for you: i think you can use cryptography to prove that an SSL session really happened. So you could prove indexing of HTTPS sites.
Make it open source and syndicate it. The goal is to get people to contribute both resources and code. Think about the Shopify as the model. Where many people contribute to create a huge shopping place. People care about their shop only but ultimately they create a useful shopping area.
Also setup a foundation to guide its development and be able to hire a management team.
The real challenge is not the code development but setting up an organization that will outlast all the challenges that will appear. Wikipedia is the model to follow.
Do you yearn for explainability due to getting irrelevant search results? Is what you're searching for more specialized than what the public might consider general knowledge?
It’s really fast - nice job! Can you elaborate on the ranking algorithm you are using? It seems that this will become more important as you index more pages.
Thanks! A really simple one for now: number of matching terms, and then prioritising matches earlier in the result string. But this is something I'm looking forward to working on properly when I get a bigger index.
I also want to incorporate a community aspect to ranking, allowing upvoting and downvoting of results. I've not yet figured out how to reconcile this idea with not having any tracking though. Perhaps a separate interface for logged-in users.
One ambitious project I've thought about over and over again over the years is search (and social sites / forums) where the votes, tags, and flags make a public dataset and users can manipulate their own weights (or even the ranking algorithm) to construct a "web of trust" that yields favorable results.
This way you can escape spammers, powertripping moderators, and the tyranny of the hive mind; it doesn't matter if there's a large population of spammers, shills, and idiots upvoting crap because you set their weights to zero (or negative). In fact, that becomes a feature, because by upvoting crap, they generate a crap filter for you. If the weights are also public, then you can automatically & algorithmically seed your web of trust (simplest algo for sake of example: give positive weight to identities who upvoted and downvoted the same way you did) but you could still override the algo with manually set values if it gave too much weight to bad actors.
Obviously this has privacy implications (all your votes and your network becomes public), and can generate a large dataset (performance challenge, how do you distribute it / give access to it?), so it's far from a trivial project. For the privacy angle, I'd start by keeping identities pseudonymous (e.g. a public key or random id -- you don't know who's behind the identity unless they blurt it out). Furthermore, I think it'd be useful to automagically split your actions across multiple identities so it's harder to link all your activity. I think the system should also explicitly allow switching identities, for privacy but also because sometimes you just want a different "filter bubble" which helps tailor the content you get to what you're looking for. Maybe the network that yields best shopping results isn't the same network that yields best cooking recipes or technical docs.
With this model, everyone is a moderator and everyone can defer moderation to identities they trust, but neither the hive mind nor individuals have the ultimate power to dictate what you see. If you want to read spam or conspiracy theories, you just switch to your identity which upvotes such content and has positive weights towards other identities with similar votes.
I doubt you're going to build this; I doubt people want this. I certainly want it. Maybe one day I'll try, but it probably won't work well without network effects (=reasonably large quantity of users). I just wanted to let you know about the idea because your project is inspiring and inspiring things inspire me to share ideas.. :)
This sounds like the sorting hat algorithm (tiktok) applied to query search engines. If there could be a way to visualize your recommendations network and switch to others without logging out, this could work really well. But a lot of research needs to be done, and the interest of big actors is to keep users blind inside their webs.
This topic is interesting to me because I'm building a faster search engine for programming queries and trying to solve the core issues that got us stuck with crappy engines.
I think there's something worth looking into here. Great insight about the "crap filter". What would you say is the most technically ambitious part about this project? I'd reckon that it is the resources to constantly pull updated content from all monitored pages at an acceptable frequency e.g. daily/hourly/minute-ly.
It looks interesting. However, the results appearing so fast as I type, and changing just as fast as I type more, makes it seem like it's flickering and it's painful on my eyes. Perhaps a slight delay and/or a fading effect as the results appear would be a bit easier for me to look at.
Update: there's been interest from a few people so I've started a Matrix chat here for anyone that wants to help out or provide feedback: https://matrix.to/#/#mwmbl:matrix.org
Congrats on the mvp path you took to lunch your product. Generally, I think that there is a place for other variations of web search, be it in the way you crawl or perhaps how you monetize. I genuinely believe that it is really hard to build a general purpose search engine like DDG, Google and the like, but you can build a fairly good niche search engine. I'm particularly fond of the idea of community powered curation in search. Just today I lunched my own take on a community driven search engine - https://github.com/gkasev/chainguide. If you like to bounce ideas back and forth with somebody, I'll be very interested to talk to you.
Off-topic [0]: I would be very interested in an economic model that would work for such a search engine. Donations are fine, but (imho) it will take much more than that to keep the lights on, let alone expand...
The "fairest" solution for both sides I can think of is ads which no not send tracking information, and are shown primarily based on search terms and country, or even other parameters that the visitor has set explicitly. Any other ideas on how to finance such an engine so that incentives are aligned?
[0]: EDIT: off-topic because the page clearly states that this project will be financed with donations only.
The model my search uses is for the public search to essentially be a loss leader for the search as a service - site owners can pay a small fee to access extra features such as being able to configure what is indexed, trigger reindexing on demand, etc. It also heavily downranks pages with adverts, to try to eliminate the incentive for spamdexing.
Wikimedia has an estimated $157m in donations this year. If we could get a small fraction of this amount we should be able to build something pretty good.
I wish you luck, but I mean, I use Google and I haven’t seen a search ad for what, a decade (okay, less than a decade considering iOS)? Most people who don’t want to see search ads can pretty easily find an ad blocker.
I mean, it's not even remotely comparable. It's not like we have to look at paper search engines as an alternative to online search engines. The whole point is that the general public has no real reason to switch, never mind donate.
1. They somewhat get around this with their maps feature, but their regular search doesn't actually search by area; you always get national websites that optimize the best. That would be a nice feature to have starting out without having to type in the specific area you're looking for.
2. Search results for hotels that actually work! Not only if they're set up on OTA's! This could actually get your search engine some traction as the search engine to go to when making travel plans which would give you a nice niche to start out in.
If you filed to become a non-profit could people "donate" their engineering time as a tax write off? If you find out the legality of something like this and make it easy to do that could inspire a lot of collaboration on the project and I can see a bunch of other areas (outside of search) where services could be provided like this. I'm also sure having a non-profit would also make it easier to find cheap hosting which is a large part of the cost there.
Non-profit search engines are needed. It will probably still be vulnerable to SEO but will more likely be resistant to become corrupt by the interest of "investors".
Congrats! Very nice to see results being lightning fast, I am getting 100-120ms response with network overhead included and that is impressive. The payload size of only 10-20kb helps immensely, good job!
I've built something similar called Teclis [1] and in my experience a new search engine should focus on a niche and try to be really, really good at it (I focused on non-commercial content for example).
The reason is to be able to narrow down the scope of content to crawl/index/rank and hopefully with enough specialization to be able to offer better results than Google for that niche. This could open doors to additional monetization path, API access. Newscatcher [2] is an example of where this approach worked (they specialized on "news").
Okay, the cynical quip is "All search engines other than Google's are 'non-profit'." :-) But the reasons for that won't fit in the margin here.
Building search engines are cool and fun! They have what seems like an endless source of hard problems that have to be solved before they are even close to useful!
As a result people who start on this journey often end up crushed by the lack of successes between the start and the point where there is something useful. So if I may, allow me to suggest some alternatives which have all the fun of building a search engine and yet can get you to a useful place sooner.
Consider a 'spam' search engine. Which is to say a crawler that you work to train on finding spammy useless web sites. Trust me when I say the current web is a "target rich environment" here. The purpose would be to not so much provide a search engine in total here, as it would be to provide something like the realtime black hole list did for email spam, come up with a list of URLs that could be easily checked with a modified DNS type server (using DNS protocol but expressly for the purpose of doing the query 'Is this URI hosting spam?' in a rapid fashion.
There are two "go to market" strategies for such a site. One is a web browser plugin that would either pop up an interstitial page that said, "Don't go here, it is just spam" when someone clicked on a link. Or a monkey-script kind of thing which would add an indication to a displayed page that a link was spammy (like set the anchor display tag to blinking red or something). The second is to sell access to this service to web proxies, web filters, and Bing which could in the course of their operation simply ignore sites that appeared on your list as if they didn't exist.
You will know you are successful when you are approached by shady people trying to buy you out.
Another might be a "fact finding" search engine. This would be something like Wolfram Alpha but for "facts." There are lots of good AI problems here, one which develops a knowledge tree based on crawled and parsed data, and one which answers factual queries like 'capital of alaska' or 'recipe for baked alaska'. The nice things about facts is they are well protected against the claim of copyright infringement and so people really can't come after you for reproducing the fact that the speed of light is 300Mkps, even if they can prove you crawled their web site to get that fact.
I'd estimate that Google uses ~1000 TB of fast storage, Bing 500 TB and Yandex 100 TB, so the most basic useful search engine would use at least... 10 TB?
Actually I doubt that this is a true statement and not something to discourage others. Check out these queries:
https://www.google.com/search?q=1 12B results
https://www.google.com/search?q=an 9B results
https://www.google.com/search?q=the 6B results
If we estimate that about half of all English pages contain 'the' or 'an' article we'll have about 15B English pages. If half of all pages contain the '1' then the total number of pages is about 24B. If half of all the pages are in English then the total number of all the pages is 30B. So even the maximum is less than the "hundreds". Similar numbers are at https://www.worldwidewebsize.com/
Ah. I thought that fast storage was a more specific type of storage that might be more expensive. I mean 1000tb is expensive, but it’s feasible to get to that scale with the right funding.
see, that's the problem with engineers today. AltaVista ran on 3x300MHz 64bit processors with 512M of RAM back in 1995. Resources cost peanuts these days. It's just that we're so used to bloat and digital inflation, we can't even start considering unbloated implementations as we perceive them as "not modern". Apparently we are also stuck in the centralized/ownership mentality. If you crowdsource search indexing and processing, it scales alongwith amount of users. Oh and also another bane of the internet is the need to monetize. FFS, if email would be invented today, we probably would have to buy NFT poststamps.
I'm sorry what? Have you seen the rate at which data is being created today? I mean, if you want to index the size of the 1995 web with your raspberry pi, go ahead, but it costs insane amounts of money to index the December 2021 web and keep the index up-to-date.
edit / full disclosure: I work at google but nothing related to search
Is it really necessary to index EVERYTHING? It's true that we have much more data today than 26 years ago, but not all of these websites qualify or provide value (duplicate results, promotional content, outdated content).
The challenge then moves to the curation, but it's no longer infeasible.
I think the words "minimally viable" are being ignored here.
I think OP's point is, assume you only have 10/100 terabytes of space and limited compute ability - how would you approach the problem? I assume 90% of google's searches probably come from less than 1% of their total index, not to mention that Google is also keeping full cached versions of the whole website including images.
I just eyeballed my browser history from the last 2-3 days and I'd estimate 15% is current/latest news related, some 25% is programming related, 15% is e-commerce stuff, the rest random crap. I'd imagine 10-100 TB can easily serve all of _my_ search space (even the links I didn't look at on page 10) from the past few years, but that's the thing -- it's just my search space. How do you serve the rest of the world? I wish I knew the answer :)
Well Google doesn't know the answer either. Results are complete trash when you try find something niche or in a local language.
The challenge isn't to index the entire web, it's to index the useful parts of it, and I think an index covering most of the useful web can be seeded quite easily with some community effort.
You might think that. My search engine runs off <$5k worth of consumer hardware off domestic broadband. It survived the hacker news front page for a week, saw a sustained load of 8000 searches per hour for nearly a day.
It's got a fairly small index, but yeah, it's not particularly hardware-hungry.
Your project is fascinating, but 8000 queries/hr is just over 2 queries/sec. Even bursting up to 10x that (20 QPS), it doesn’t seem surprising that it can run on consumer hardware? Am I missing something?
Well you got to consider a search engine isn't a blog where it just fetches a file off disk and is done with it, a trivial search query might as well be, but a non-trivial search query may have to do a search through dozens of megabytes of ids to produce its response (given these are the 800k documents that contain 'search', which of those contain 'engine', and which of those contain 'algorithm').
I honestly don't know what the actual limit is, all I know is it dealt with 2 QPS without affecting response times. But 2 QPS for a search engine is actually kind of a lot. Most people don't actually search that much. Like you get a few queries per day. Put it this way: 2 QPS is what you'd expect if you had around a million regular users. That's not half bad for consumer hardware.
I run three separate indices at about 10-20mn documents each. But I'm fairly far off any sort of limit (ram and disk-wise I'm at maybe 40%).
I'm confident 100mn is doable with the current code, maybe .5bn if I did some additional space optimization. There are some low hanging fruit that seem very promising. Sorted integers are highly compressable, and right now I'm not doing that at all.
Yes doclist compression is a must. Higher intersection throughput and less bandwidth stress. Are you loading your doclists from persistent storage? What is your current max rps?
I'm loading the data off a memory mapped SSD, trivial questions will probably be answered entirely from memory, although the disk-read performance doesn't seem terrible either.
> What is your current max rps?
It depends on the complexity of the request, and repeated retrievals are cached, so I'm not even sure there is a good answer to this.
is the size of the common Web already way too large to play catch up against google/bing at this point?
my dream, is a distributed/p2p index. each browser contribute to storing part of the overall index, and handle queries coming from other users so that how to fund huge data centers never become a question.
> is the size of the common Web already way too large to play catch up against google/bing at this point?
Probably. But I would prefer a search engine that didn't search the whole web. I would prefer a search engine that searched the sites related to fields that I'm interested in.
So I would pay for or donate to a search engine that provided me good results in e.g. software development. They could add additional fields as demand warrants, so long as quality as maintained. I would even like to see a faceting feature, so I could search for e.g. Matrix and get results on the mathematical concept when need be without having to wade through movie review or fiddle with magic search keywords.
not searching the whole Web makes sense, except that it isn't clear what belongs to your field of interest and what doesn't. should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects? and what if you want to search about pottery that Sunday morning, turn to Google?
I think segmenting search results by topics is a useful consumer query feature, not sure segmenting what gets indexed would provide a useful service other than covering niches hence not really fulfilling a web search engine.
I find the idea less ambitious so maybe that's how an open search engine should approach the problem, federation of hosts could, cover the whole Web eventually.
> should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects?
Yes, because that author's pages wouldn't even be fetched at the point of development that we are discussing.
> and what if you want to search about pottery that Sunday morning, turn to Google?
why not Google? What I find most beneficial in a search engine alternative is to be able to fetch results or topics I'm less knowledgeable about.
I'm somewhat fine with very commercial search engines when searching for something within a field I'm familiar with, e.g I if search for "kubernetes configmap best practice", skimming infomercial sites, ads, junk is trivial.
if I search for "baby milk safety" I have no clue how to digest the results list, I'm totally unfamiliar with the media outlets catering to parenting, nutrition or food health, I know absolutely no authors in the field, no renowned media outlets for tips, not even brand reputations to make a somewhat informed decision on what to skim and what to read with attention.
But I don't disagree an engine focusing on specific fields provides great value. and perhaps that's where it should start to have a chạnce to win.
I had problems with YaCy. Slow search and slow crawling. Regarding search I think it could be improved if instead of HTTP requests to other peers a more efficient protocol (UDP) could be used. Regarding crawling it's quite possible that doing this in Java and running it on a Rock64 may not be the best combination (: It started OOMing after some days.
How does Internet Archive verify dumps submitted by Archive Team and other groups? This may already be a solved problem.
Not knowing their implementation details, I’m guessing it could be doable without reinventing much. An oracle could dispatch a P2P archive job to a pool of clients randomly assigned tasks, with both the first to archive and the first to validate being recognized by the swarm somehow, with periodic re-archiving and re-verification, rate adjusted by popularity of site and of search keywords.
What do you do in order for your crawler to not accidentally weer into some naughty-naughty site and yield you a visit from your friendly FBI squad? That concern why I decided to stay away from yacy
If I were to work on building a search engine from scratch, I would probably approach this from these directions:
(1) Investigate if running a DNS server will help me get a more robust picture of what websites exist.
(2) Investigate if supplying a custom browser would help me to leverage client PCs to do the crawling / processing for me.
(3) Investigate if there is any point in building a search engine with the data gathered in a non-proffit way... Non-proffits are not as sustainable as for proffit corporations.
Thank you for this, but I feel like you should make a profit and this is currently a missed opportunity to use web3 principles to do that.
Free and open software is a great ideal, but the reality is that people need money to live - and ads are the way to make that money on web2 platforms, which is why Google is in such a sad state. Why not do something similar to Brave? You can add tokenomics to the search engine and make money while keeping it 100% useable and open source.
How would web3 and "tokenomics" solve search? If the underlying problem is that spamdexers are given a profit incentive to game search engine results with low effort and low quality content, does it make a difference whether the profit incentive is via pages generating advertising revenue or pages generating cryptocurrency revenue?
I don't really want to address your questions; this has been talked to death. How do you think tokens could help? Maybe they'd create incentives for users to use the engine. Maybe they'd open a market for said tokens so the dev can extract currency out of it. Maybe tokens can be a meta-system on top of the search engine, so that the search functionality can be left to solving the search problem without interference.
Do you think there's an alternative to tokens in order to fund the project without degrading the search algorithm? If so, I'm all ears.
In fact, I think we all are. Please enlighten us. But you haven't proposed a solution while crypto devs have been working on one since 2008.
Any reason to prefer GPLv3 over AGPLv3? It might be useful to use the latter so that distributing it over a network requires distributing modifications as well.
> To be honest, at the moment the algorithm doesn't return any sensible results for anything
Is there an objective way to measure this? Do we just compare the output to Google or DDG? This seems like one of the many big hurdles in creating a competitive product in this space
The idea of building a distributed crawler that runs on user’s browser sounds fascinating!! Now that is better than the user burning power to mine bitcoins solving stupid puzzles.
I tried this back in 2006 - mozdex (only a wikipedia article survives) - it’s not cheap. I was a fan of lucene which led to nutch and eventually hadoop. so lots of servers running hdfs doing map reduce jobs to compile and update indexes. No one in the end cared about open search… duck seems to do alright under the guise of security but most non major searches are just meta searches these days because of economies of scale being highly disadvantageous to any upcoming search - and people just don’t search like they used to either.
i was spending 2500 a month just on indexers and had query traffic taken off my costs would have shot through the roof since you want query nodes to all be in memory cache and that was expensive back then. today i would have used some modern in memory distributed doc dbs instead of query masters with heavy block buffer caches. i learned a lot but lost my shirt :)
Why would you need that much data? The average website has maybe 10kB worth of textual information without compression. To get tens of thousands of terabytes of data, you'd need to index of the order 10^12 websites. That seems a bit much.
GuideStar is the veteran in this space. I agree that doing this based on web scrapes and robots.txt is probably going to be pretty tough to get quality results. GuideStar always sells there product on the premise that they're vetting the financial statements of non-profits of best results. The real money might be on figuring out a way to scale reading and classifying non-profit financials - then see if you can quality control using a set of patterns.
Or perhaps a for-profit company running this search engine and not garnering profit from it by directly changing search results, or having a different model for profit that does not affect search rankings.
I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).
> Search "best car brands"
Expected: Car Reviews
Returns a page showing the best mobile phone brands.
then...
> Then searching "Best Mobile Phone"
Expected: The article from the search above.
Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
> Searching "What is a test?"
Expected result: Some page describing what a test is, maybe wikipedia?
Returns "Test could confirm if Brad Pitt does suffer from face blindness"
> Searching "Duck Duck Go"
Expected result: DDG.com
Returns "There be dragons? Why net neutrality groups won't go to Congress"
> Searching "Google"
Expected result: Google.com
Returns: An article from the independent, "Google has just created the world’s bluest jeans"