Show HN: I'm building a non-profit search engine

Closi · on Dec 26, 2021

Hey, great project - the more competition in this space the better. To be honest, at the moment the algorithm doesn't return any sensible results for anything (at least that I can find), but I hope that you can find a way past this as it's a great place to have a project.

I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).

> Search "best car brands"

Expected: Car Reviews

Returns a page showing the best mobile phone brands.

then...

> Then searching "Best Mobile Phone"

Expected: The article from the search above.

Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"

> Searching "What is a test?"

Expected result: Some page describing what a test is, maybe wikipedia?

Returns "Test could confirm if Brad Pitt does suffer from face blindness"

> Searching "Duck Duck Go"

Expected result: DDG.com

Returns "There be dragons? Why net neutrality groups won't go to Congress"

> Searching "Google"

Expected result: Google.com

Returns: An article from the independent, "Google has just created the world’s bluest jeans"

rmbyrro · on Dec 26, 2021

I guess that's the real problem. People like to wonder what would be the "ideal world" in a search engine. It may be wishful thinking, I don't know.

It seems really hard to produce quality search results. Takes a lot of investment. Makes it an expensive product. But no one wants to pay. So selling ads it's the only way forward.

Maybe there's a way to convince people to pay what it takes? I dunno...

bigyikes · on Dec 26, 2021

I would gladly pay $5 a month for a Google-quality search service that doesn’t track me. I’ve been using Duck Duck Go for most of this year, but frequently find myself falling back to !g because Google’s results really are much better.

I wonder how much money Google search makes per the average user. Is it more than $5/mo?

divmain · on Dec 26, 2021

It is a bit above your price point, but I have been using Kagi.com (not affiliated, just impressed). They're in beta, but will charge ~$10 once they go GA. Like you, I tried DuckDuckGo for awhile, but resorted to g! so often that I started using it for everything out of habit.

In contrast, Kagi provides Google-quality results mosts of the time, better-than-Google semi-often, and worse-than-google rarely. They support g!, but I only use it a couple of times a week, usually for site-specific searches.

Additionally, I really like that I am their customer and not their product - incentives are aligned for them to continue respecting my privacy and preferences.

rmbyrro · on Dec 27, 2021

I've been quite happy with DDG, serves about 80% of my needs.

The other 20% I resort to Google are mostly things with a geographical/country context, which DDG really sucks and Google excels.

freediver · on Dec 27, 2021

If you are OK with using Google, why not simply use Google for 100% of your needs?

6510 · on Dec 27, 2021

One only uses Google services if one absolutely needs to. Google on the other hand never needs you. You are absolutely unnecessary and super easy to replace. Whichever you chose, the next search engine really needs our queries. If we give them enough they might be able to create a competitive product. If they do google will dramatically improve. I'm sure they have plenty of ideas, the incentive is just not there.

gundmc · on Dec 27, 2021

The problem with this approach is that the group most likely to pay to remove ads is also the same group that is most attractive to advertisers. So your pool of users to whom you serve ads is less attractive overall. That means you need to figure out not just how much you make serving those users ads, but also how much you lose by removing them from the ad viewer pool.

tamrix · on Dec 26, 2021

Ironically enough tracking you is what improves the search results.

berkut · on Dec 26, 2021

Likewise, I'd pay that as well (I'd like no ads as well, but tracking is the main issue for me).

Similarly, I do have DDG as my main search on all machines and devices just out of principle, but its region-aware searching (I'm in NZ, and often only want NZ results) are very close to useless in my experience (with NZ as the region ticked, it will still return results from .ca and .co.uk domains, which I would have hoped would be almost trivial to remove), and Google seems much better in this area (but not perfect).

Similarly, there's often technical/programming things I'll search for that DDG doesn't have indexed at all, and Google does.

Google also seems a lot better at ignoring spelling differences (color/colour, favorite/favourite) than DDG, which is often (but not always!) useful.

rezonant · on Dec 27, 2021

In a recent HN post the creator of Kagi[1] was talking about it. I think it's probably what you're looking for.

[1] https://kagi.com/

rmbyrro · on Dec 27, 2021

I've read a few years ago an estimate revenue at $180/user/year in the US. Rest of the world was lower.

Many here would pay $5/mo, but probably only a handful would pay $180/year.

Now imagine the hundreds of millions they serve. Safe to say most wouldn't pay even $1/year.

Also, if Google loses even 10% or 20% of their audience, the overall value lost will probably higher due to network effects.

daoudc · on Dec 27, 2021

While search funds most of Google, most employees at Google don't work on search. A small proportion of Google's revenue would be plenty for good quality search.

blakewatson · on Dec 26, 2021

Just wanted to mention that you could try falling back to StartPage (!sp) first as it uses results from Google.

_benj · on Dec 26, 2021

I've been enjoying a lot neeva, not affiliated at all, just a happy user :-)

I think they are $4.95/mo or something? haven't payed a cent yet since there are a few discounts that they do to prompt you to learn how to use it (I really liked that, and def made me more likely to stick with it!)

lenkite · on Dec 29, 2021

$5 per month is sadly out of budget for many third world citizens because that $5 can be urgently used elsewhere to plug a need. I think somewhere in the range of $0.1-$0.5 might be doable though.

quickthrower2 · on Dec 27, 2021

I don't know why Google doesn’t offer a paid for ad free version. This would reduce their reliance on the ad industry and become more like a utility.

danuker · on Dec 26, 2021

> because Google’s results really are much better.

Or maybe they're average, but you only see the ones where DDG fails.

Next time also try Yandex and Baidu.

6510 · on Dec 27, 2021

If you have specific areas of interest that are unpopular enough your local yacy instance can index those [say] 100 or 10 000 websites and the results will blow you away. Google is still useful of course but its a joke by comparison.

Say for laughs you are only interested in yourself. You put every page by you and about you in the crawler. It will obviously render fantastic results. Using google you would have to start every query with your full name OR user names, whatever you type behind it doesn't even matter, it wont return pages with all keywords. With yacy you just type the query and it will return EVERYTHING. To compare the 2 would be to compare useless with perfection.

daoudc · on Dec 26, 2021

Thanks for the feedback! I'll take a look at your examples and see if I can improve the rankings.

GraemeMeyer · on Dec 26, 2021

Fun idea. It seems to be getting stuck on the first word you enter.

e.g. you get the same results for “London” as you do for “London cats”, “London cat rescue” and “London test”.

devoutsalsa · on Dec 26, 2021

The first thing I do to test a search engine is to search for my own username on various public sites to see if it can find me. It didn’t find me. But keep it up and I’m sure I’ll be in there eventually (or maybe a overestimate how interesting I am, hehe).

rmbyrro · on Dec 26, 2021

I get this is your usual testikg case for search engines, but if you've read their README you'd have seen its inaproppriate for the project at the current stage.

gjm11 · on Dec 26, 2021

I was curious and tried a bunch of other searches, with similarly disappointing results. My searches were a bit more esoteric than Closi's.

"langlands program" (pure mathematics thing): yup, top result is indeed related to the Langlands program, though it isn't obviously what anyone would want as their first result for that search. Not bad.

"asmodeus" (evil spirit in one of the deuterocanonical books of the Bible, features extensively in later demonology, name used for an evil god in Dungeons & Dragons, etc.): completely blank page, no results, no "sorry, we have no results" message, nothing. Not good.

"clerihew" (a kind of comic biographical short poem popular in the late 19th / early 20th century): completely blank page. Not good.

"marlon brando" (Hollywood actor): first few results are at least related to the actor -- good! -- but I'd have expected to see something like his Wikipedia or IMDB page near the top, rather than the tangentially related things I actually god.

"b minor mass" (one of J S Bach's major compositions): nothing to do with Bach anywhere in the results; putting quotation marks around the search string doesn't help.

"top quark" (fundamental particle): results -- of which there were only 7 -- do seem to be about particle physics, and in some cases about the top quark, but as with Marlon Brando they're not exactly the results one would expect.

"ferrucio busoni" (composer and pianist): blank page.

"dry brine goose" (a thing one might be interested in doing at this time of year): five results, none relevant; top two were about Untitled Goose Game.

"alphazero" (game-playing AI made by Google): blank page. Putting a space in results in lots of results related to the word "alpha", none of which has anything to do with AlphaZero.

OK, let's try some more mainstream things.

"harry potter": blank page. Wat. Tried again; did give some results this time. They are indeed relevant to Harry Potter, though the unexpected first-place hit is Eric Raymond's rave review of Eliezer Yudkowsky's "Harry Potter and the Methods of Rationality", which I am fairly sure is not what Google gives as its first result for "harry potter" :-).

"iphone 12" (confession: I couldn't remember what the current generation was, and actually this is last year's): top results are all iPhone-related, but first one is about the iPhone 6, second is from 2007, this is about the iPhone 6, fourth is from 2007, fifth is about the iPhone 4S, etc.

"pfizer vaccine": does give fairly relevant-looking results, yay.

daoudc · on Dec 26, 2021

Thanks for the detailed feedback! I think most of most of the problems here are because we have a really small index right now. Increasing the number of documents is our top priority. I agree that some kind of feedback when there are no results would be a good idea.

Closi · on Dec 26, 2021

I actually think it’s probably the algorithm too - if I take one of the search items returned from a search that I know is in the index, but then search for it with slightly different terminology (or a different tense / pluralisation), the same item doesn’t come up.

daoudc · on Dec 27, 2021

Ah good point, we haven't done any stemming or anything like that yet.

clay-dreidels · on Dec 26, 2021

What does a search engine algorithm look like, and where can I find examples to build from?

marginalia_nu · on Dec 26, 2021

Depends on which algorithm you are looking for, but these are commonly used:

* Okapi BM25 for determining the relevance of a result to a query.

* TF-IDF for determining the relevance of a term to a document.

* PageRank for ranking domains.

tomxor · on Dec 26, 2021

> We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

Is there a concern that volunteers could manipulate results through their crawler?

You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.

[edit]

Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.

altdataseller · on Dec 26, 2021

There are already loads of browser extensions that do a lot of screen scraping of all the sites you visit, without you even realizing it,

tomxor · on Dec 26, 2021

I can imagine, but i only use one, ublock

daoudc · on Dec 26, 2021

Yes, that is a concern. I'd probably worry about it if and when it started happening, however.

Schiendelman · on Dec 26, 2021

If you wait until then, it may become too late to mitigate. Unless you have a plan to remain in complete control of who contributes.

bryguy32403 · on Dec 27, 2021

Famous last words

qwertox · on Dec 26, 2021

There could be legal issues if the crawler starts to crawl into regions which should be left alone. I don't mean things like the dark web, but for example if someone is a subscriber to an online magazine it could start crawling paywalled content if 3rd party cookies enable this.

tomxor · on Dec 26, 2021

Google already seems to crawl paywalled content somehow, this doesn't seem to be much of a legal issue since you cannot click through - it's just annoying as a user.

This might even be intentional through robots.txt ... A browser extension that passively crawls visited sites could easily download robots.txt as the single extra but minimal download requirement.

dreamcompiler · on Dec 26, 2021

Pretty sure most paywalled sites explicitly allow the googlebot to enter. If you spoof your UserAgent to be that of the googlebot they check your IP address to make sure you really are Google.

The new fly in the crawler ointment is Cloudflare: If you're not the googlebot and you hit a Cloudflare customer you need to be running javascript so they can verify you're not a bot. It's a continual arms race.

imglorp · on Dec 26, 2021

Google has instructions for paywall sites to allow crawling. I suppose it brings them traffic when users click on a search result and arrive at the sign up page.

https://developers.google.com/search/docs/advanced/structure...

KarlKemp · on Dec 26, 2021

The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.

That little thought experiment is true for many online services, from social networking to (marginally) publishing. But nowhere is it more true than for search results, which differ in two fundamental ways: being text-only, they don't bother me to anywhere near the degree of other ads. And, second, they are an order of magnitude more valuable than drive-by display ads, because people have indicated a need and a willingness to visit a website that isn't among their bookmarks. These two, combined, make this the worst possible case for replacing an ad-based business with a donation model.

The idea mentioned in this readme that "Google intentionally degrades search results to make you also view the second page" is also wrong, bordering on self-delusion. The typical answer to conspiracy theories works here: there are tens of thousands of people at Google. Such self-sabotage would be obvious to many people on the inside, far too many to keep something like this secret.

alexiaya · on Dec 26, 2021

I would consider Google randomly excluding the most relevant words from my search query intentionally degrading results. It's incredibly frustrating. This shouldn't be the default behavior, maybe an optional link the user can click to try again with some of the terms excluded.

Yes, I know verbatim mode exists, but I always forget to enable it, and the setting eventually gets lost when my cookies are cleared or something.

Unfortunately I can't switch to another search engine because in my experience every other search engine has far inferior results, despite not having the annoying behaviors Google does. DuckDuckGo is only useful for !bangs for me.

timeon · on Dec 26, 2021

> The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.

Is this relevant for non-profit project? Do you pay $30/year for Wikipedia?

KarlKemp · on Dec 26, 2021

Do you think „non profit“ means they don’t have to pay for servers and employees?

And yes.

daoudc · on Dec 26, 2021

TBF I don't think Google intentionally degrades results, but they have less incentive to improve the results.

dazc · on Dec 26, 2021

The same way people don't intentionally break the law, they just overlook certain aspects of it when it suits them.

dazc · on Dec 26, 2021

Duck Duck Go is profitable despite not blanketing the first page with ads (just like google were once upon a time); you can have no ads at all if you like also. Do they make money in other ways, sure but not in a way that degrades the user experience.

Are DDG results inferior, for 95% of users no.

alexiaya · on Dec 26, 2021

> Are DDG results inferior, for 95% of users no.

Do you have a source for this figure? Maybe it's mostly true for the "average non-tech-savvy user in an English-speaking country", but I've found DuckDuckGo and everything other than Google inferior in many cases, especially when looking for Hungarian content.

dazc · on Dec 26, 2021

Sorry, for the "average non-tech-savvy user in an English-speaking country" is more or less what I meant. Although this would have been true for Google also in the early days.

wolfgarbe · on Dec 26, 2021

A laudable effort. Two questions:

1. What is the rationale behind choosing Python as a implementation language? Performance and efficiency are paramount in keeping operational costs low and ensuring a good user experience even if the search engine will be used by many users. I guess Python is not the best choice for this, compared to C, Rust or Java.

2. What is the rationale behind implementing a search engine from scratch versus using existing Open Source search engine libraries like Apache Lucene, Apache Solr and Apache Nutch (crawler)?

Faaak · on Dec 26, 2021

Premature optimization is the root of all evil. Best to concentrate on the algorithm first, and then, maybe, improve it with a faster language.

Apart from that, the misconception that "python is slow" should die :-)

caaqil · on Dec 26, 2021

> Premature optimization is the root of all evil.

Is keeping performance in mind and choosing the tech stack accordingly really a premature optimization?

This might be the most abused phrase in CS history. Perhaps we should add "Premature optimization fallacy" to the list of cognitive errors programmers use as an excuse to not seriously think about performance.

thatwasunusual · on Dec 27, 2021

> Is keeping performance in mind and choosing the tech stack accordingly really a premature optimization? This might be the most abused phrase in CS history.

And it's often misquoted, and/or taken out of context. The full quote goes like this: "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."

Back then - this was the 60s, remember - CPU cycles were costly.

But with all theories, it stands its time in the sun. We have the same "problem" today, but in a different form. We call it "agile" today, though; make sure that the customer is happy before the programmer is happy. If the programmer is allowed to spend too much time on trying to become happy, the customer is either gone, or someone else came up with a better solution.

In regards to your specific "is keeping performance in mind and choosing the tech stack accordingly really a premature optimization" question, and keeping in mind OP's endevour, you're on the right spot. But the real question is how programmers _get there_.

By experience.

And in turn, by relating to clients more directly these days, programmers have to adhere to the laws that were separated from them back in the days. And your business isn't worth shit without customers, even if you have the best programmers that can create the best code from day 1.

Hence Knuth's quote, translated: "if you spend so much time on planning your journey that you don't reach your flight, you are getting nowhere."

stevenally · on Dec 26, 2021

It’s a trade off between speed of development and performance. Speed of development seems like a good optimization for an experimental project?

wolfgarbe · on Dec 26, 2021

This is no contradiction, you can concentrate on the algorithm also in a faster language :-) https://benchmarksgame-team.pages.debian.net/benchmarksgame/... https://benchmarksgame-team.pages.debian.net/benchmarksgame/... https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

daoudc · on Dec 26, 2021

Agreed, this was my thinking - and since I'm better at Python, it's faster for me to get stuff done. I would like to rewrite it in Rust though, all help from Rustaceans gladly accepted!

marginalia_nu · on Dec 26, 2021

In general, speed isn't the problem with search (at least the retrieval aspect), but memory efficiency is. Things like small object overhead and the ability to memory map large data ranges are extremely beneficial for a language if you want to implement a search index.

But I agree, get it working first, then re-implement it in another language if it turns out to be necessary.

authed · on Dec 26, 2021

"Apart from that, the misconception that "python is slow" should die :-) "

Yeah it's not python that is slow, it's the interpreter.

nulbyte · on Dec 26, 2021

Which interpreter? There are multiple. I found pypy to be quite reasonable; often faster than the standard C python interpreter.

legofr · on Dec 26, 2021

> All other search engines that I've come across are for-profit. Please let me know if I've missed one!

https://www.ecosia.org/

https://ask.moe/

https://ekoru.org/

I remember seeing one more non-profit search engine on HN but can't seem to find it right now.

daoudc · on Dec 26, 2021

Thanks, but these are not technically non-profit:

"Ecosia is a search engine based in Berlin, Germany. It donates 80% of its profits to nonprofit organizations that focus on reforestation" [1]

"80% of profits will be distributed among charities and non-profit organizations. The remaining 20% will be put aside for a rainy day." [2]

"Ekoru.org is a search engine dedicated to saving the planet. The company donates 60% of revenue generated from clicks on sponsored search results to partner organizations who work on climate change issues" [3]

[1] https://en.wikipedia.org/wiki/Ecosia [2] https://ask.moe/ [3] https://www.forbes.com/sites/meimeifox/2020/01/19/how-the-se...

toper-centage · on Dec 26, 2021

While Ecosia is not technically a non-profit, no one can sell Ecosia shares at a profit.

> Ecosia says that it was built on the premise that profits wouldn’t be taken out of the company. In 2018 this commitment was made legally binding when the company sold a 1% share to The Purpose Foundation, entering into a ‘steward-ownership’ relationship.

> The Purpose Foundation's steward-ownership of Ecosia legally binds Ecosia in the following ways: - Shares can't be sold at a profit or owned by people outside of the company and - No profits can be taken out of the company.

https://www.ethicalconsumer.org/technology/how-ethical-searc...

gardenfelder · on Dec 26, 2021

Seems like a reasonable approach is to use a b-corp where shareholders cannot sue for financial gains.

asicsp · on Dec 26, 2021

>I remember seeing one more non-profit search engine on HN but can't seem to find it right now.

Probably this one? "A search engine that favors text-heavy sites and punishes modern web design" https://news.ycombinator.com/item?id=28550764 (3 months ago, 717 comments)

mlinksva · on Dec 26, 2021

https://web.archive.org/web/20171130000415/https://about.com... was one 5+ years ago https://news.ycombinator.com/item?id=11281700 https://github.com/commonsearch

daoudc · on Dec 27, 2021

Oh that's interesting, I may try and get in touch with the author.

Minor49er · on Dec 26, 2021

I just tried ask.moe, but it clearly noted that the search results were provided by Google

m-i-l · on Dec 26, 2021

Also https://searchmysite.net/ for personal and independent websites (essentially a loss-leader for its open source self-hostable search as a service).

marginalia_nu · on Dec 27, 2021

http://www.wiby.me/ is pretty good too.

alexdowad · on Dec 26, 2021

Some idle words from a passer-by: It would have been good if this project had a pronounceable name.

"To Google" has entered the English lexicon as a verb, but I don't think anybody will ever say they "mwmbled" something.

daoudc · on Dec 26, 2021

It's pronounced "mumble". I live in Mumbles, which is spelt Mwmbwls in Welsh.

yetanother-1 · on Dec 26, 2021

Nice, but still not very intuitive nor common for the grand public.

danpalmer · on Dec 26, 2021

Spelling it “mumble” wouldn’t be accurately pronounceable for most of the world, and billions couldn’t even read the letters.

I get your point but I think we should normalise things that don’t come completely naturally for English speakers.

jesprenj · on Dec 26, 2021

Mumble is already a group talk protocol. http://mumble.info

scottmcdot · on Dec 26, 2021

If it took off we all might start swapping e with w as a nod to our preferred search engine.

wodenokoto · on Dec 26, 2021

In the early web 2, it was very in for things to be spelled unpronounceably. For the life of me I can only remember Twittr, but I wanna say Spotify also had an unreadable name in the early days.

thenthenthen · on Dec 26, 2021

Flickr

KnobbleMcKnees · on Dec 26, 2021

Tumblr

marginalia_nu · on Dec 26, 2021

del.ico.us

ZeroGravitas · on Dec 26, 2021

I have this feeling that most of time I "search" for something I already know what I'm looking for, but google via firefox's omnibox, is just the fastest way to get there, even though it's a bit indirect. Are they getting paid for that, or am I costing them money in the short term, but they get to build up a profile on me to provide more effective ads later?

I wonder if it's possible to take advanage of that type of search by putting a facade in front of the "search engine" and based on the search term and the private local user history, then go direct to a known site, or if it seems a search is needed, go to a specific search engine. This may open up opportunities for say program language specific search engines, or error messages from a program specific search, or shopping for X sites.

medstrom · on Dec 26, 2021

I bookmark every site I might possibly want to revisit - make a habit of Ctrl+D. They're totally unsorted, but the key is to wipe the regular history on exit, leaving only the bookmarks as source material for completion. That way I can type something in the url bar and get completion to interesting sites. The url bar (or omnibox) matches on page title as well as the actual address, so it's easy, and always faster than a search engine.

yellowsir · on Dec 26, 2021

if u set duckduckgo as your default search provider, you can use bang in the omibox. also you can toogle between local-area or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi

bluecatswim · on Dec 26, 2021

Most wikis or resource/documentation sites have a local search bar on their homepage, Firefox has a feature where it lets you add a search keyword for that specific site. So if you add, say, pydocs as a keyword for docs.python.org you can do "@pydocs <query>" it looks up the query on that page.

_xnmw · on Dec 26, 2021

This is a business model I've been thinking about: what if users earned credits for running a crawler on their machine? In other words, as much as I hate crypto scams, a "tokenized" search engine where the "mining" power was put to good use, i.e crawling and indexing.

thebeastie · on Dec 26, 2021

How would you judge that they had actually done the work? The output needs to be verifiable.

b3kart · on Dec 26, 2021

Well they are in a sense: you can just do the task yourself. It’s expensive of course, so you can use methods applied to human labelling for ML: _periodically_ injecting tasks with known results and checking how trustworthy the party is, vending the task to multiple parties and aggregating results, blocking parties that make many “mistakes”, etc.

GistNoesis · on Dec 26, 2021

You build a system based on trust but verify. If the output is the result of a known deterministic program on a known input, anyone can verify it. So they just have to sign their work. If later someone find that they have lied and provided a false result, they lose reputation/stacked coins.

The second associated problem is how would one prevent them from appropriating the work of others that would just re-sign it.

One way would be to allow the worker to introduce a few voluntary errors but have a secret joker that allow him to pass the challenge of a failed verification.

One alternative way is based on data malleability. The worker pick a secret one way function and compute is F( data + secretFunction(data,epsilon) ) ~ F(data) and publish the values of the secretFunction(data,epsilon) but not the secretFunction. Only someone with knowledge of the secretFunction can make a claim on the work done. If there is a challenge only the real worker will be able to publish the secret of the secretFunction (Or use some zero knowledge proof to convince you they know it).

charcircuit · on Dec 26, 2021

>If later someone find that they have lied and provided a false result, they lose reputation/stacked coins.

A web page isn't an immutable piece of text. It can change on every visit and it can sometimes returns errors.

GistNoesis · on Dec 26, 2021

That's why you don't index the webpage but a snapshot of it. For example you index the commoncrawl archives, or some content addressable storage like ipfs or torrent file.

charcircuit · on Dec 26, 2021

What's the point of crawling the common crawl archive? It's pointless. You can simply download it.

GistNoesis · on Dec 26, 2021

I think crawling and indexing should be treated differently. Indexing is about extracting value from the data, while crawling is about gathering data.

Once a reference snapshot has been crawled, the indexing task is more easily verifiable.

The crawling task is harder to verify, because external website could lie to the crawler. So the sensible thing to do is have multiple people crawl the same site and compare their results. Every crawler will publish its snapshots (which may contain some errors or not), and then that's the job of the indexer to combine multiple snapshots of various crawler and filter the errors out and do the de-duplication.

The crawling task is less necessary now than it was a few years ago, because there is already plenty of available data. Also most of the valuable data is locked in walled garden, and companies like Cloudflare make the crawling difficult for the rest of the fat tail. So it's better to only have data submitted to you, and outsource the crawling.

zomglings · on Dec 26, 2021

One idea I have been kicking around is the idea of federating the indices, not the crawling.

If every contributor maintained their own index, then you could reward contributors based on how many hits their index generated.

This would open up the possibility of people maintaining indices for specialized topics that they were experts in, and give the federated search engine a shot at taking on Google.

bspammer · on Dec 26, 2021

You don’t want to reward people for quantity, but quality. The cost of creating a new webpage is effectively zero, so if you attach an incentive for creating them you are doomed.

zomglings · on Dec 26, 2021

You and I are talking about completely different types of people.

For most people, the cost of creating and maintaining a website is high. This is why products like Wix and Squarespace exist (and are not cheap).

I am thinking a simple dashboard where anyone could go and curate a list of content they find useful. They could share this with the world.

The interface should be so simple that my parents could use it - and they aren't going to be putting up websites anytime soon.

DarylZero · on Dec 26, 2021

This is what Yahoo was doing before Google took over.

(But of course it wasn't a volunteer public benefit effort like you describe.)

hericium · on Dec 26, 2021

Depending on who would be providing URIs.

If the miner, they would be able to deliver any crap so content deliveries would have to be judged in some way and awarded differently.

If the pool provides addresses to crawl, the miner could be given crafted/dedicated URIs from time to time and lack of delivery of Proof Of Crawl could result in a penalty chosen in a way rendering "cheating" unprofitable. But then fresh URIs have to come from somewhere.

_xnmw · on Dec 26, 2021

There would have to be some aspect of centralized moderation, I suppose. This is beyond my knowledge: Is there a way to accept output only from signed binaries, so that we assume if X cycles of work were performed by a signed binary, then it is legitimate output?

born-jre · on Dec 26, 2021

no not really may be some theoretical way using ZK-snarks or encrypted enclaves (intel sgx) but not practical. also probably does not work cz oracle/enclave also needs input(raw crawl data/ network http bytes) which has to be trusted. one way could be project could make unbreakable mining/crawling chip/box/os/anticheat_layer_with_vm and supply to everyone which has different levels of breakable-lity and complexity to build.

One way is we send crawl_task to different to N random nodes and accept one that most similar?

another way could be build messy network to solve messy problem. What we do is build reputation bashed graph network and you accept index from nodes you trust. so people will start un following misbehaving nodes. there is not universalroot view of network instead its dynamic and different from prospective of each node. or it could have one root view if we store reputation data in bchain and with some type quadratic voting to modify the chain. ?

yeah Bitcoin showed us way to build mathematically secure system without any trusted party but it could do that cz problem it was solving is mathematically provable. Problem like collecting indexing crawl data you have to trust somebody.

Piezoid · on Dec 26, 2021

YaCy is decentralized, but without the credit system. Some tokens, like QBUX, have tried to develop decentralized hosting infrastructure.

I also have been wondering how this would play out with some kind of decentralized indexes. The nodes could automatically cluster with other nodes of users sharing the same interests, using some notion of distances between query distributions. The caching and crawling tasks could then be distributed between neighbors.

_xnmw · on Dec 26, 2021

YaCy is too slow for mainstream use. I believe the indices still need to be centralized, only index-building and crawling can be distributed.

marginalia_nu · on Dec 26, 2021

A big part of the problem I see with decentralized search is that you basically need to traverse the index in orthogonal axes to assemble search results. First you need to search word-wise in order to get result candidates, then sort them rank-wise to get relevant results (this also hinges upon an agreed-upon ranking of domains). That's a damn hard nut to crack for a distributed system.

Crawling is also not as resource consuming as you might think. Sure you can distribute it, but there isn't a huge benefit to this.

thebeastie · on Dec 26, 2021

Actually I have an idea for you: i think you can use cryptography to prove that an SSL session really happened. So you could prove indexing of HTTPS sites.

detaro · on Dec 26, 2021

You can prove that a TLS session happened, but nothing about its contents, so you can't really prove indexing.

thebeastie · on Dec 26, 2021

I think the way this works is having the code to execute an ssl session encoded in a zkSnark. One of the zkSnark based blockchains is doing it.

g105b · on Dec 26, 2021

I'm very intrigued by this concept.

WheelsAtLarge · on Dec 26, 2021

Make it open source and syndicate it. The goal is to get people to contribute both resources and code. Think about the Shopify as the model. Where many people contribute to create a huge shopping place. People care about their shop only but ultimately they create a useful shopping area.

Also setup a foundation to guide its development and be able to hire a management team.

The real challenge is not the code development but setting up an organization that will outlast all the challenges that will appear. Wikipedia is the model to follow.

ChemSpider · on Dec 26, 2021

Really, I don't care if it is for-profit or not. Just a search engine with transparent ranking would be great.

Ideally with explainable AI (XAI) that can tell me WHY is result A ranked higher than result B. I would even pay a monthly subscription to use it.

challenger-derp · on Dec 27, 2021

Do you yearn for explainability due to getting irrelevant search results? Is what you're searching for more specialized than what the public might consider general knowledge?

tibbar · on Dec 26, 2021

It’s really fast - nice job! Can you elaborate on the ranking algorithm you are using? It seems that this will become more important as you index more pages.

daoudc · on Dec 26, 2021

Thanks! A really simple one for now: number of matching terms, and then prioritising matches earlier in the result string. But this is something I'm looking forward to working on properly when I get a bigger index.

daoudc · on Dec 26, 2021

I also want to incorporate a community aspect to ranking, allowing upvoting and downvoting of results. I've not yet figured out how to reconcile this idea with not having any tracking though. Perhaps a separate interface for logged-in users.

foxfluff · on Dec 26, 2021

One ambitious project I've thought about over and over again over the years is search (and social sites / forums) where the votes, tags, and flags make a public dataset and users can manipulate their own weights (or even the ranking algorithm) to construct a "web of trust" that yields favorable results.

This way you can escape spammers, powertripping moderators, and the tyranny of the hive mind; it doesn't matter if there's a large population of spammers, shills, and idiots upvoting crap because you set their weights to zero (or negative). In fact, that becomes a feature, because by upvoting crap, they generate a crap filter for you. If the weights are also public, then you can automatically & algorithmically seed your web of trust (simplest algo for sake of example: give positive weight to identities who upvoted and downvoted the same way you did) but you could still override the algo with manually set values if it gave too much weight to bad actors.

Obviously this has privacy implications (all your votes and your network becomes public), and can generate a large dataset (performance challenge, how do you distribute it / give access to it?), so it's far from a trivial project. For the privacy angle, I'd start by keeping identities pseudonymous (e.g. a public key or random id -- you don't know who's behind the identity unless they blurt it out). Furthermore, I think it'd be useful to automagically split your actions across multiple identities so it's harder to link all your activity. I think the system should also explicitly allow switching identities, for privacy but also because sometimes you just want a different "filter bubble" which helps tailor the content you get to what you're looking for. Maybe the network that yields best shopping results isn't the same network that yields best cooking recipes or technical docs.

With this model, everyone is a moderator and everyone can defer moderation to identities they trust, but neither the hive mind nor individuals have the ultimate power to dictate what you see. If you want to read spam or conspiracy theories, you just switch to your identity which upvotes such content and has positive weights towards other identities with similar votes.

I doubt you're going to build this; I doubt people want this. I certainly want it. Maybe one day I'll try, but it probably won't work well without network effects (=reasonably large quantity of users). I just wanted to let you know about the idea because your project is inspiring and inspiring things inspire me to share ideas.. :)

ffhhj · on Dec 26, 2021

This sounds like the sorting hat algorithm (tiktok) applied to query search engines. If there could be a way to visualize your recommendations network and switch to others without logging out, this could work really well. But a lot of research needs to be done, and the interest of big actors is to keep users blind inside their webs.

This topic is interesting to me because I'm building a faster search engine for programming queries and trying to solve the core issues that got us stuck with crappy engines.

challenger-derp · on Dec 27, 2021

I think there's something worth looking into here. Great insight about the "crap filter". What would you say is the most technically ambitious part about this project? I'd reckon that it is the resources to constantly pull updated content from all monitored pages at an acceptable frequency e.g. daily/hourly/minute-ly.

daoudc · on Dec 26, 2021

Actually I was thinking of something vaguely along those lines!

rascul · on Dec 26, 2021

It looks interesting. However, the results appearing so fast as I type, and changing just as fast as I type more, makes it seem like it's flickering and it's painful on my eyes. Perhaps a slight delay and/or a fading effect as the results appear would be a bit easier for me to look at.

daoudc · on Dec 27, 2021

Thanks for the feedback! Yup, a delay is on the to-do list.

daoudc · on Dec 26, 2021

Update: there's been interest from a few people so I've started a Matrix chat here for anyone that wants to help out or provide feedback: https://matrix.to/#/#mwmbl:matrix.org

gkasev · on Dec 26, 2021

Congrats on the mvp path you took to lunch your product. Generally, I think that there is a place for other variations of web search, be it in the way you crawl or perhaps how you monetize. I genuinely believe that it is really hard to build a general purpose search engine like DDG, Google and the like, but you can build a fairly good niche search engine. I'm particularly fond of the idea of community powered curation in search. Just today I lunched my own take on a community driven search engine - https://github.com/gkasev/chainguide. If you like to bounce ideas back and forth with somebody, I'll be very interested to talk to you.

amenod · on Dec 26, 2021

Off-topic [0]: I would be very interested in an economic model that would work for such a search engine. Donations are fine, but (imho) it will take much more than that to keep the lights on, let alone expand...

The "fairest" solution for both sides I can think of is ads which no not send tracking information, and are shown primarily based on search terms and country, or even other parameters that the visitor has set explicitly. Any other ideas on how to finance such an engine so that incentives are aligned?

[0]: EDIT: off-topic because the page clearly states that this project will be financed with donations only.

luckylion · on Dec 26, 2021

Aren't ads super ineffective, especially when you don't make them very invasive?

I think donations are probably workable. It works in the private tracker scene; the larger ones have "donation meters" and never seem to fall behind.

It could also work on a subscription model which is essentially just formalizing the donations and making it easier to plan cash flow.

daoudc · on Dec 26, 2021

Yes, I think a subscription model is the way to go.

m-i-l · on Dec 26, 2021

The model my search uses is for the public search to essentially be a loss leader for the search as a service - site owners can pay a small fee to access extra features such as being able to configure what is indexed, trigger reindexing on demand, etc. It also heavily downranks pages with adverts, to try to eliminate the incentive for spamdexing.

hosteur · on Dec 26, 2021

Ads as a business model ends with surveillance as a business model. We know this now.

marginalia_nu · on Dec 26, 2021

> but (imho) it will take much more than that to keep the lights on, let alone expand...

You'd be surprised how cheap a search engine can be to operate. My search.marginalia.nu has a burn rate of less than $100/month.

daoudc · on Dec 26, 2021

Impressive! Does that include the startup costs of the server you bought, and maintaining it?

marginalia_nu · on Dec 26, 2021

The hardware cost about $5k all-in-all including a UPS, and I'd estimate it will chew through an 1Tb SSD once every 12-18 months.

daoudc · on Dec 26, 2021

Wikimedia has an estimated $157m in donations this year. If we could get a small fraction of this amount we should be able to build something pretty good.

klohto · on Dec 26, 2021

Get real lol. Why would a general public care about you? Happy to donate but it won’t keep the light on. You’re serving a niche community

daoudc · on Dec 26, 2021

Niche for now, but I think a lot of people can see the value of search without ads.

oefrha · on Dec 26, 2021

I wish you luck, but I mean, I use Google and I haven’t seen a search ad for what, a decade (okay, less than a decade considering iOS)? Most people who don’t want to see search ads can pretty easily find an ad blocker.

abraae · on Dec 26, 2021

A journey of a thousand miles begins with a single step.

wodenokoto · on Dec 26, 2021

Online encyclopedia was very niche when Wikipedia started.

interator7 · on Dec 26, 2021

I mean, it's not even remotely comparable. It's not like we have to look at paper search engines as an alternative to online search engines. The whole point is that the general public has no real reason to switch, never mind donate.

quantum2021 · on Dec 26, 2021

Two big things that annoy me about google:

1. They somewhat get around this with their maps feature, but their regular search doesn't actually search by area; you always get national websites that optimize the best. That would be a nice feature to have starting out without having to type in the specific area you're looking for.

2. Search results for hotels that actually work! Not only if they're set up on OTA's! This could actually get your search engine some traction as the search engine to go to when making travel plans which would give you a nice niche to start out in.

gravypod · on Dec 26, 2021

If you filed to become a non-profit could people "donate" their engineering time as a tax write off? If you find out the legality of something like this and make it easy to do that could inspire a lot of collaboration on the project and I can see a bunch of other areas (outside of search) where services could be provided like this. I'm also sure having a non-profit would also make it easier to find cheap hosting which is a large part of the cost there.

marcodiego · on Dec 26, 2021

Non-profit search engines are needed. It will probably still be vulnerable to SEO but will more likely be resistant to become corrupt by the interest of "investors".

freediver · on Dec 26, 2021

Congrats! Very nice to see results being lightning fast, I am getting 100-120ms response with network overhead included and that is impressive. The payload size of only 10-20kb helps immensely, good job!

I've built something similar called Teclis [1] and in my experience a new search engine should focus on a niche and try to be really, really good at it (I focused on non-commercial content for example).

The reason is to be able to narrow down the scope of content to crawl/index/rank and hopefully with enough specialization to be able to offer better results than Google for that niche. This could open doors to additional monetization path, API access. Newscatcher [2] is an example of where this approach worked (they specialized on "news").

[1] http://teclis.com

[2] https://newscatcherapi.com/

ChuckMcM · on Dec 26, 2021

Okay, the cynical quip is "All search engines other than Google's are 'non-profit'." :-) But the reasons for that won't fit in the margin here.

Building search engines are cool and fun! They have what seems like an endless source of hard problems that have to be solved before they are even close to useful!

As a result people who start on this journey often end up crushed by the lack of successes between the start and the point where there is something useful. So if I may, allow me to suggest some alternatives which have all the fun of building a search engine and yet can get you to a useful place sooner.

Consider a 'spam' search engine. Which is to say a crawler that you work to train on finding spammy useless web sites. Trust me when I say the current web is a "target rich environment" here. The purpose would be to not so much provide a search engine in total here, as it would be to provide something like the realtime black hole list did for email spam, come up with a list of URLs that could be easily checked with a modified DNS type server (using DNS protocol but expressly for the purpose of doing the query 'Is this URI hosting spam?' in a rapid fashion.

There are two "go to market" strategies for such a site. One is a web browser plugin that would either pop up an interstitial page that said, "Don't go here, it is just spam" when someone clicked on a link. Or a monkey-script kind of thing which would add an indication to a displayed page that a link was spammy (like set the anchor display tag to blinking red or something). The second is to sell access to this service to web proxies, web filters, and Bing which could in the course of their operation simply ignore sites that appeared on your list as if they didn't exist.

You will know you are successful when you are approached by shady people trying to buy you out.

Another might be a "fact finding" search engine. This would be something like Wolfram Alpha but for "facts." There are lots of good AI problems here, one which develops a knowledge tree based on crawled and parsed data, and one which answers factual queries like 'capital of alaska' or 'recipe for baked alaska'. The nice things about facts is they are well protected against the claim of copyright infringement and so people really can't come after you for reproducing the fact that the speed of light is 300Mkps, even if they can prove you crawled their web site to get that fact.

montebicyclelo · on Dec 26, 2021

How much compute, storage, and network speed would a minimal up-to-date web search engine need?

ValleZ · on Dec 26, 2021

I'd estimate that Google uses ~1000 TB of fast storage, Bing 500 TB and Yandex 100 TB, so the most basic useful search engine would use at least... 10 TB?

fauigerzigerk · on Dec 26, 2021

"The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size."

https://www.google.com/intl/en_uk/search/howsearchworks/craw...

ValleZ · on Dec 26, 2021

Actually I doubt that this is a true statement and not something to discourage others. Check out these queries: https://www.google.com/search?q=1 12B results https://www.google.com/search?q=an 9B results https://www.google.com/search?q=the 6B results If we estimate that about half of all English pages contain 'the' or 'an' article we'll have about 15B English pages. If half of all pages contain the '1' then the total number of pages is about 24B. If half of all the pages are in English then the total number of all the pages is 30B. So even the maximum is less than the "hundreds". Similar numbers are at https://www.worldwidewebsize.com/

dqv · on Dec 26, 2021

What is fast storage? Is that, for right now, the fastest SSDs available?

ValleZ · on Dec 26, 2021

HDD is definitely not enough because of low iops, likely Google keeps index in RAM. I think NVME should be good enough, idk for sure.

dqv · on Dec 27, 2021

Ah. I thought that fast storage was a more specific type of storage that might be more expensive. I mean 1000tb is expensive, but it’s feasible to get to that scale with the right funding.

iopq · on Dec 26, 2021

If you have to ask, you can't afford it

arpa · on Dec 26, 2021

see, that's the problem with engineers today. AltaVista ran on 3x300MHz 64bit processors with 512M of RAM back in 1995. Resources cost peanuts these days. It's just that we're so used to bloat and digital inflation, we can't even start considering unbloated implementations as we perceive them as "not modern". Apparently we are also stuck in the centralized/ownership mentality. If you crowdsource search indexing and processing, it scales alongwith amount of users. Oh and also another bane of the internet is the need to monetize. FFS, if email would be invented today, we probably would have to buy NFT poststamps.

adtac · on Dec 26, 2021

I'm sorry what? Have you seen the rate at which data is being created today? I mean, if you want to index the size of the 1995 web with your raspberry pi, go ahead, but it costs insane amounts of money to index the December 2021 web and keep the index up-to-date.

edit / full disclosure: I work at google but nothing related to search

noogle · on Dec 26, 2021

Is it really necessary to index EVERYTHING? It's true that we have much more data today than 26 years ago, but not all of these websites qualify or provide value (duplicate results, promotional content, outdated content).

The challenge then moves to the curation, but it's no longer infeasible.

Closi · on Dec 26, 2021

I think the words "minimally viable" are being ignored here.

I think OP's point is, assume you only have 10/100 terabytes of space and limited compute ability - how would you approach the problem? I assume 90% of google's searches probably come from less than 1% of their total index, not to mention that Google is also keeping full cached versions of the whole website including images.

adtac · on Dec 26, 2021

I just eyeballed my browser history from the last 2-3 days and I'd estimate 15% is current/latest news related, some 25% is programming related, 15% is e-commerce stuff, the rest random crap. I'd imagine 10-100 TB can easily serve all of _my_ search space (even the links I didn't look at on page 10) from the past few years, but that's the thing -- it's just my search space. How do you serve the rest of the world? I wish I knew the answer :)

foxfluff · on Dec 26, 2021

Well Google doesn't know the answer either. Results are complete trash when you try find something niche or in a local language.

The challenge isn't to index the entire web, it's to index the useful parts of it, and I think an index covering most of the useful web can be seeded quite easily with some community effort.

marginalia_nu · on Dec 26, 2021

You might think that. My search engine runs off <$5k worth of consumer hardware off domestic broadband. It survived the hacker news front page for a week, saw a sustained load of 8000 searches per hour for nearly a day.

It's got a fairly small index, but yeah, it's not particularly hardware-hungry.

anonred · on Dec 27, 2021

Your project is fascinating, but 8000 queries/hr is just over 2 queries/sec. Even bursting up to 10x that (20 QPS), it doesn’t seem surprising that it can run on consumer hardware? Am I missing something?

marginalia_nu · on Dec 27, 2021

Well you got to consider a search engine isn't a blog where it just fetches a file off disk and is done with it, a trivial search query might as well be, but a non-trivial search query may have to do a search through dozens of megabytes of ids to produce its response (given these are the 800k documents that contain 'search', which of those contain 'engine', and which of those contain 'algorithm').

I honestly don't know what the actual limit is, all I know is it dealt with 2 QPS without affecting response times. But 2 QPS for a search engine is actually kind of a lot. Most people don't actually search that much. Like you get a few queries per day. Put it this way: 2 QPS is what you'd expect if you had around a million regular users. That's not half bad for consumer hardware.

daibo · on Dec 26, 2021

What's the size of your index in records?

marginalia_nu · on Dec 26, 2021

I run three separate indices at about 10-20mn documents each. But I'm fairly far off any sort of limit (ram and disk-wise I'm at maybe 40%).

I'm confident 100mn is doable with the current code, maybe .5bn if I did some additional space optimization. There are some low hanging fruit that seem very promising. Sorted integers are highly compressable, and right now I'm not doing that at all.

daibo · on Dec 26, 2021

Yes doclist compression is a must. Higher intersection throughput and less bandwidth stress. Are you loading your doclists from persistent storage? What is your current max rps?

marginalia_nu · on Dec 26, 2021

I'm loading the data off a memory mapped SSD, trivial questions will probably be answered entirely from memory, although the disk-read performance doesn't seem terrible either.

> What is your current max rps?

It depends on the complexity of the request, and repeated retrievals are cached, so I'm not even sure there is a good answer to this.

dotancohen · on Dec 26, 2021

Fair point.

How is it to be funded?

hirako2000 · on Dec 26, 2021

is the size of the common Web already way too large to play catch up against google/bing at this point?

my dream, is a distributed/p2p index. each browser contribute to storing part of the overall index, and handle queries coming from other users so that how to fund huge data centers never become a question.

dotancohen · on Dec 26, 2021

  > is the size of the common Web already way too large to play catch up against google/bing at this point?

Probably. But I would prefer a search engine that didn't search the whole web. I would prefer a search engine that searched the sites related to fields that I'm interested in.

So I would pay for or donate to a search engine that provided me good results in e.g. software development. They could add additional fields as demand warrants, so long as quality as maintained. I would even like to see a faceting feature, so I could search for e.g. Matrix and get results on the mathematical concept when need be without having to wade through movie review or fiddle with magic search keywords.

hirako2000 · on Dec 26, 2021

not searching the whole Web makes sense, except that it isn't clear what belongs to your field of interest and what doesn't. should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects? and what if you want to search about pottery that Sunday morning, turn to Google? I think segmenting search results by topics is a useful consumer query feature, not sure segmenting what gets indexed would provide a useful service other than covering niches hence not really fulfilling a web search engine. I find the idea less ambitious so maybe that's how an open search engine should approach the problem, federation of hosts could, cover the whole Web eventually.

dotancohen · on Dec 26, 2021

  > should indexing skip a blog page because a the author usually write about algorithm but here goes on and on about business while sporadically mentioning algorithmic technical aspects?

Yes, because that author's pages wouldn't even be fetched at the point of development that we are discussing.

  > and what if you want to search about pottery that Sunday morning, turn to Google?

Yes, Google still exists. Why not?

hirako2000 · on Dec 28, 2021

why not Google? What I find most beneficial in a search engine alternative is to be able to fetch results or topics I'm less knowledgeable about.

I'm somewhat fine with very commercial search engines when searching for something within a field I'm familiar with, e.g I if search for "kubernetes configmap best practice", skimming infomercial sites, ads, junk is trivial. if I search for "baby milk safety" I have no clue how to digest the results list, I'm totally unfamiliar with the media outlets catering to parenting, nutrition or food health, I know absolutely no authors in the field, no renowned media outlets for tips, not even brand reputations to make a somewhat informed decision on what to skim and what to read with attention.

But I don't disagree an engine focusing on specific fields provides great value. and perhaps that's where it should start to have a chạnce to win.

daoudc · on Dec 26, 2021

Check out YaCy. It's great if you're happy with slow search!

jesprenj · on Dec 26, 2021

I had problems with YaCy. Slow search and slow crawling. Regarding search I think it could be improved if instead of HTTP requests to other peers a more efficient protocol (UDP) could be used. Regarding crawling it's quite possible that doing this in Java and running it on a Rock64 may not be the best combination (: It started OOMing after some days.

charcircuit · on Dec 26, 2021

YaCy's results weren't great and there was the raking was very bad letting sides game it by just spamming a ton of keywords.

daoudc · on Dec 26, 2021

The plan is to fund it through donations

smt88 · on Dec 26, 2021

Can you make it a contributory database? I wouldn't mind "donating" my browsing history and page downloads to build the index and train the algorithm.

You'd have to find a way to verify reputation to make sure no bad actors could contribute.

aspenmayer · on Dec 26, 2021

How does Internet Archive verify dumps submitted by Archive Team and other groups? This may already be a solved problem.

Not knowing their implementation details, I’m guessing it could be doable without reinventing much. An oracle could dispatch a P2P archive job to a pool of clients randomly assigned tasks, with both the first to archive and the first to validate being recognized by the swarm somehow, with periodic re-archiving and re-verification, rate adjusted by popularity of site and of search keywords.

daoudc · on Dec 26, 2021

Yes, I'm planning to do something like this.

kova12 · on Dec 26, 2021

What do you do in order for your crawler to not accidentally weer into some naughty-naughty site and yield you a visit from your friendly FBI squad? That concern why I decided to stay away from yacy

lovefeature · on Dec 30, 2021

Nice,i'm building service engine https://beforedo.com , facebook alternative: https://alovez.com instagram alterative : https://www.snapfeel.com/

champagnois · on Dec 26, 2021

If I were to work on building a search engine from scratch, I would probably approach this from these directions:

(1) Investigate if running a DNS server will help me get a more robust picture of what websites exist.

(2) Investigate if supplying a custom browser would help me to leverage client PCs to do the crawling / processing for me.

(3) Investigate if there is any point in building a search engine with the data gathered in a non-proffit way... Non-proffits are not as sustainable as for proffit corporations.

prohobo · on Dec 26, 2021

Thank you for this, but I feel like you should make a profit and this is currently a missed opportunity to use web3 principles to do that.

Free and open software is a great ideal, but the reality is that people need money to live - and ads are the way to make that money on web2 platforms, which is why Google is in such a sad state. Why not do something similar to Brave? You can add tokenomics to the search engine and make money while keeping it 100% useable and open source.

astoor · on Dec 26, 2021

How would web3 and "tokenomics" solve search? If the underlying problem is that spamdexers are given a profit incentive to game search engine results with low effort and low quality content, does it make a difference whether the profit incentive is via pages generating advertising revenue or pages generating cryptocurrency revenue?

prohobo · on Dec 26, 2021

I don't really want to address your questions; this has been talked to death. How do you think tokens could help? Maybe they'd create incentives for users to use the engine. Maybe they'd open a market for said tokens so the dev can extract currency out of it. Maybe tokens can be a meta-system on top of the search engine, so that the search functionality can be left to solving the search problem without interference.

Do you think there's an alternative to tokens in order to fund the project without degrading the search algorithm? If so, I'm all ears.

In fact, I think we all are. Please enlighten us. But you haven't proposed a solution while crypto devs have been working on one since 2008.

everydaybro · on Dec 26, 2021

exactly, free and opensource is amazing, but the developer should also earn money to live, maybe add donations, or maybe add some web3 features

btdmaster · on Dec 26, 2021

Any reason to prefer GPLv3 over AGPLv3? It might be useful to use the latter so that distributing it over a network requires distributing modifications as well.

daoudc · on Dec 26, 2021

Good suggestion, thanks.

ortuman84 · on Dec 26, 2021

> Marginalia Search is fantastic, but it is more of a personal project than an open source community.

And where's the community behind mwmbl project?

daoudc · on Dec 26, 2021

Feel free to email me if you want to be involved!

slmjkdbtl · on Dec 26, 2021

As a normal human I naturally typed in "fuck" in a new search engine and it led me to this article https://mathbabe.org/2015/06/22/fuck-trigonometry/ which I quite enjoyed!

juliushuijnk · on Dec 26, 2021

here's my open data attempt from a couple of years ago:

http://www.charitius.org

goal was/is to include all charities in the world, based on open data, and open software.

It's been on pause for a while, but still works, and open for new sources of data to incorporate.

emrah · on Dec 27, 2021

> To be honest, at the moment the algorithm doesn't return any sensible results for anything

Is there an objective way to measure this? Do we just compare the output to Google or DDG? This seems like one of the many big hurdles in creating a competitive product in this space

reacharavindh · on Dec 27, 2021

The idea of building a distributed crawler that runs on user’s browser sounds fascinating!! Now that is better than the user burning power to mine bitcoins solving stupid puzzles.

marban · on Dec 26, 2021

I've recently built one just for business news w/ obligatory zero-tracking. (https://yup.is)

aantix · on Dec 26, 2021

How does someone economically store tens if thousands of terabytes of data needed for the indexes of a large scale search engine?

And have the large server instances (lots of ram)?

supernovae · on Dec 26, 2021

I tried this back in 2006 - mozdex (only a wikipedia article survives) - it’s not cheap. I was a fan of lucene which led to nutch and eventually hadoop. so lots of servers running hdfs doing map reduce jobs to compile and update indexes. No one in the end cared about open search… duck seems to do alright under the guise of security but most non major searches are just meta searches these days because of economies of scale being highly disadvantageous to any upcoming search - and people just don’t search like they used to either.

i was spending 2500 a month just on indexers and had query traffic taken off my costs would have shot through the roof since you want query nodes to all be in memory cache and that was expensive back then. today i would have used some modern in memory distributed doc dbs instead of query masters with heavy block buffer caches. i learned a lot but lost my shirt :)

aantix · on Dec 28, 2021

I think I would just start with indexing the latest Common Crawl WET files along with a server and a couple of Exos X20 20TB drives.

Wondering out loud if there's a crawl that will produce real-time updates via common news outlets..

supernovae · on Dec 31, 2021

Crawling is the easy part.

The 20tb drives would be too slow for query though. Query really needs everything in memory to be fast enough.

marginalia_nu · on Dec 26, 2021

Why would you need that much data? The average website has maybe 10kB worth of textual information without compression. To get tens of thousands of terabytes of data, you'd need to index of the order 10^12 websites. That seems a bit much.

tmnstr85 · on Dec 26, 2021

GuideStar is the veteran in this space. I agree that doing this based on web scrapes and robots.txt is probably going to be pretty tough to get quality results. GuideStar always sells there product on the premise that they're vetting the financial statements of non-profits of best results. The real money might be on figuring out a way to scale reading and classifying non-profit financials - then see if you can quality control using a set of patterns.

AlphaWeaver · on Dec 26, 2021

This isn't a search engine for nonprofits, it's a search engine that's designed to be run by a nonprofit one day.

blueatlas · on Dec 26, 2021

Or perhaps a for-profit company running this search engine and not garnering profit from it by directly changing search results, or having a different model for profit that does not affect search rankings.

born-jre · on Dec 26, 2021

there seems to be a lot of comment about some form of distributed trust/reputation bashed system.

discordance · on Dec 26, 2021

Is there a test suite of expected search results that could be used with these sort of projects?

amelius · on Dec 26, 2021

What literature did you use to obtain suitable algorithms for search/NLP?