Hacker News new | past | comments | ask | show | jobs | submit login

They want to know how the algorithms work, not the data itself.



"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

-- Linus Torvalds


Linus never had to deal with hundreds of petabytes of search data, nor ML black boxes, to be fair.


Google doesn't know what the algorithm is anymore. The whole site is a black box.


Same at FB as far as I could tell while I was there. "The algorithm" is a misnomer, popularized by the press but really kind of silly. There are really thousands of pipelines and models developed by different people running on different slices of the data available. Some are reasonably transparent. Others, based on training, are utterly opaque to humans. Then the weights are all combined to yield what users see. And it all changes every day if not every hour. Even if it could all be explained in a useful way, that explanation would be out of date as soon as it was received.

I'm not saying that to defend anyone BTW. This complexity and opacity (which is transitive in the sense that a combined result including even one opaque part itself becomes opaque) is very much the problem. What I'm saying is that it's likely impossible for the companies to comply without making fundamental changes ... which might well be the intent, but if that's the case it should be more explicit.


What needs to be shared is a high level arch not nuts and bolts.

At a broad level:

what are the input sources like IP address , clicks on other websites etc you use to feed the model.

What is the overall system optimized for , like some combination of engagement , view time etc, just listing them if possible in a order of preference is good enough

Alternatively what does your human management measure and monitor as the business metrics of success .

I want to know what behaviors (not necessarily how ) are used , I want to know what is feed trying to optimize for , more engagement, more view time to etc

This is not adversarial, knowing this helps as modify user behavior to make the model work better.

Users already have some sense of this and work around it blindly , for example YouTube has heavy emphasis on resent views and search . I (and am sure others) would use signed out user to see content way outside my interest area so my feed isn’t polluted with poor recommendations. I may have watched 1000’s hours of educational content but google would still think some how to video I watched once means I need to only see that kind of content.

Google knows it is me sure even am signed out, but they don’t use it change my feed that’s the important part and knowing that can help improve my user experience


> Google doesn't know what the algorithm is anymore

You are an insider?


They haven't talked much detail since Matt Cutts left, but over time they did sort of outline the basics. That the core ranking is still some evolution of PageRank, weighting scoring of page attributes/metadata and flowing it down/through inbound links as well. But then altered via various waves of ML, like Vince (authority/brand power), Panda (inbound link quality), Penguin (content quality), and many others that targeted other attributes (page layout, ad placement, etc).

Even if some of that is off, the premise of a chain of some ML, and some not ML, processors means they probably can't really tell you exactly why anything ranks where it does.


It's clear the public and lawmakers like the idea of knowing how the algorithm works, but what you posted is about as deep as people can reasonably understand at a high level. I don't think they realize how complex a system built over 20 years that's a trillion-dollar company's raison d'être can be.


Those sound like awesome potential features. Allow users to assign 0-100% weights for each of those scoring adjustments during search,and show them the calcs (if you can).


Supposedly there's thousands of different features that are scored, and those are just the rolled-up categories that needed their own separate ML pipeline step.

Like, maybe, for example, a feature is "this site has a favicon.ico that is unique and not used elsewhere" (page quality). Or "this page has ads, but they are below the fold" (page layout). Or "this site has > X amount of inbound links from a hand curated list of 'legitimate branded sites'" (page/site authority).

Google then picks a starting weight for all these things, and has human reviewers score the quality of the results, order of ranking, etc, based on a Google written how-to-score document. Then tweaks the weights, re-runs the ML pipeline, and has the humans score again, in some iterative loop until they seem good.

There's a never-acted-on FTC report[1] that describes how they used this system to rank their competition (comparison shopping sites) lower in the search results.

[1] http://graphics.wsj.com/google-ftc-report/

Edit: Note that a lot of detail is missing here. Like topic relevance, where a site may rank well for some niche category it specializes in. But that it wouldn't necessarily rank well for a completely different topic, even with good content, since it has no established signals it should.


> and those are just the rolled-up categories that needed their own separate ML pipeline step.

AKA ensemble models.


I doubt it, they should know what the various algorithms are, especially the most important ones that drive most of the ranking. But their competitive advantage would be on the line.


Data is already an algorithm




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: