Hacker News new | past | comments | ask | show | jobs | submit login
Missing data hinder replication of artificial intelligence studies (sciencemag.org)
134 points by jonbaer on Feb 16, 2018 | hide | past | favorite | 38 comments



Hell, I did an analysis on "Causality in Crypto Markets":

https://blog.projectpiglet.com/2018/01/causality-in-cryptoma...

I provided code and the data (since I knew no one would believe me, without evidence). Why do we believe researchers who have more incentive than I do to publish? Their jobs require publication, or at least are greatly improved by it, thus they have every incentive to be less than honest about their "experiments". Most of the "A.I. revolution" isn't about algorithms, it's about having more data and the right data. That's why Google and Facebook can out compete in research, as much as they do. They have all the data.

Personally, I've almost made it a life mission to try and replicate every study I can. After watching people in labs (private and public biology labs), just straight up make up data - I never believe anything.

I also wrote some stuff on backtesting (which in industry is viewed as the gold standard, but IMO is deeply prone to flaws): https://blog.projectpiglet.com/2018/01/perils-of-backtesting...

Having worked in industry and worked in academia... I don't believe 90% of the research I hear.


Its gotten to the point where the first thing I do when reading a paper is skim for their training data.

If its not from a public/standard/replicable data source my bullshit meter hops up a couple steps.

"We improved on the current SotA by 15% but you can't see the data or the code - but trust us, its for real"


A similar line of thinking to the always provide the source code behind your paper position.


In fact, I'd go so far to say that data is source code in the case of Machine Learning; it just goes through a kind of compilation step.


Careful now, don't tempt the LISPers.


You might enjoy this site if you don’t already: http://retractionwatch.com

(No affiliation)


I'm honestly amazed at how some papers (esp in medicine) magically train deep nets using 1000 or 2000 cases and get absurdly high 90s AUC.


I imagine getting samples/training sets that don't have some hefty source of bias is really tough in medicine, and probably leads to over-fitting.


Yes, I have come to much prefer a blog post and/or github to any "published" paper. It is almost like these journals are actively impeding progress at this point in every way possible.


Ditto. Data can be manipulated, so can results. Small changes demanded by a PM can have a huge impact on results. The input data and the way it's handled are going to have a huge impact on any ML result.


There should be a major hit to the reputation of anyone publishing fake data. I am a bit surprised studies aren't independently reproduced immediately.


I considered starting a non-profit company around that, basically - you could donate to reproduce research. Then we could provide grants to reproduce the research. Kind of like a bounty system.

Even registered the domain: http://researchdonor.com/

I know a few people with terminal illness who would donate large sums of money for reproducible research.


I hinted at this issue a few days ago already here on HN:

This is why about 90% of the papers Google publishes are useless.

They describe deep learning successes, with datasets that aren't public, on hardware that isn't public, with software that isn't public.

They describe technologies, be it spanner, borg, flumejava or others, relying on closed hard- and software.

If an average university professor can't replicate research 1:1 based solely on existing, open technology and the paper, the paper should not be accepted by any journal. In fact, it shouldn't even be accepted on conferences, or be able to get a doi code. That's not worthy of being called science.

And it’s problematic when a large amount of AI research happens at Google, Facebook and co., and none of it can be replicated. Science requires replication, if AI research can’t be replicated, it’s not science.


The problem is when BigCo does AI research, they are not doing it for the greater good, they are trying to find a competitive edge. Unfortunately there are no good incentives (besides doing the "right" thing) for companies to open their datasets to the public, so until there is we can expect more of the same...


>they are not doing it for the greater good, they are trying to find a competitive edge

thats just fine. they don't really need to publish anything, but if they do want to publish it, it should be done in a way other people can reproduce their findings.


This isn't fair to researchers at companies. While the goal of the organization may be to make money, the goal of the individual researchers who publish the papers is very likely doing it for the greater good.


Yeah, I don't see that. If you're an employee of a company (not least one who's paying you a very fat salary) there is an expectation that you will do work that benefits the company's bottom line. If that happens to be aligned with the "greater good" then OK, but there is no such guarantee- and if it isn't... well then your research will only benefit the company.


Of course your work benefits the bottom line, but being able to publish is often a perk that companies have to give to be able to attract and keep people who care about those things. It's actually incredibly insulting that you would assume you know the motivations of individual employees.

The world isn't as black and white as you'd like it to be.


If journals/conferences were to refuse un-replicable results, companies that wish to offer this perk would have to make a sufficient part of their dataset public in order to offer this perk to their employees. This would be a good thing.


Fair enough, so don’t publish them. If you can’t replicate results, it really isn’t science.


The other problem is that academics simply may not have the computing resources to replicate anything that Google does anyway even if everything was open.


I agree, I'd even go as far to say that not publishing non replicable results is better for research than publishing it. These papers are noise that is pointed to for marketing purposes. A higher signal-to-noise ratio will lead to faster research, because people don't have to sift through 10 papers to find one good one.


If that's true, then you shouldn't read any journals that don't publish data and methods.

You've got to vote with your feet (or in this case, attention).


I was shocked in CS graduate school the first time I downloaded the code referred to by a very well-respected paper in the field, and found that it did not even compile. By the 20th time, I was less shocked, but I had lost a lot of respect for CS academia.

The correct standard should be, if you claim you wrote a program that does X, the program should be available along with simple instructions that explain how to run it. The reviewers should then follow the instructions and verify that it actually works, and returns the results described in the paper! This basic process would sadly invalidate most computer science research.


The vast majority of such papers don't really involve a claim that they wrote a program that does X. I mean, they probably did write such a program, but that's tangential to the paper and not particularly relevant; the fact that they did isn't one of claims in the paper, it's not something they're trying to prove/show in the conclusions. There are some papers in the style "Software package for doing X", but those are a minority and generally limited to major releases of widely used open source tools for that problem; most CS papers are not about software, just like most astronomy papers are not about telescopes.

E.g. the paper is about some new method to approach X. In general, the paper could be valid even without any program implementation whatsoever; but you'll likely might supplement the method description with some experimental evaluation - but the paper is not about the experimental evaluation or the "experimental apparatus" i.e. the code they used. I mean, the paper is likely making some claims about the method as such, not about any single particular implementation of the method including their own. As a part of the main claim "this method seems good and interesting" they're providing some evidence "we tried, and in certain conditions described here applying method X was 10% better than method Y" - but the actual code used to do that is just supplementary material; the code is a tool they used in research, not the result of the research they published. The code is the "telescope" you used to make an observation, but the paper is about the observation, not about the telescope.

It's just as with medicine - we generally accept papers that say "well, we tried this procedure on 100 patients and 73 of them got better" without requiring video evidence of those 73 patients actually getting better; in the same manner, we accept CS papers that say "well, we tried this procedure on 100 datapoints and 73 of them got correct results" and don't require the reviewers to reproduce the experiments and verify if the experimental results aren't falsified; just as reviewers don't try to reproduce experiments in pretty much any other science.


The difference is that when a medical paper claims

> well, we tried this procedure on 100 patients

"this procedure" is the most important part, and they describe it in detail in the paper, hopefully well enough that someone could attempt to replicate it, and some do attempt to replicate it. (Not that procedure descriptions in such papers are always sufficient for this.)

The difference between that and

> we tried this procedure on 100 datapoints

Is that it's nigh impossible to describe a ML procedure in enough detail to reproduce with just the description in the paper. Tiny changes in the parameters and construction can completely change the result; the only way to be able to reproduce it is if you had the source code. And also the source data which is just if not more important as the source code (see sibling thread).

The opportunity that academic CS has over every other science is that they could empower every reader with the capability to verify the results of every paper they read, and this is actually attainable. Reviewers of other sciences don't reproduce findings themselves for purely practical reasons that don't need to exist in CS.


To try out a procedure on one dataset amounts to a data point. You need to process a bunch of datasets to establish the performance of the procedure under study.

I though ML folks would have the statistical background to know you cannot infer a true statement from a single occurrence?


Hmm, it probably depends on your subfield of computer science. These AI papers almost always claim they wrote a program that does X, and so did the stuff I was working on in grad school, but more theoretical things probably are more similar to what you describe.


You're completely wrong. So are most people in the comment section.

Research papers should NOT release code. Ever.

Researchers don't care, and should not care about your setup, your IDE, your OS, your dependencies, your processor architecture etc. None of that matters, unless it's the central point of the paper. You literally ask researchers to write coding tutorials or something. It's simply comedic.

Most machine learning and CS research revolves around Methods, Architectures, Processes and other abstract approaches to solving a problem. Their algorithms are always reproducible with any implementation you like.

Sure, the training data should be supplied, because in AI/learning it's part of the algorithm, but even if sometimes they aren't, you can try and reproduce the relative changes with another similar database. If their point is that one training method works better than another, the result still should be reproducible with similar datasets.

I get where you're coming from, after all most people on this site are programmers, but research is not about publishing code that works, it's about publishing ideas that work.


The big issue is the low-quality of machine learning / applied CS literature in general. Pull a paper and try implementing their algorithm(s). You'll be almost certainly guaranteed to encounter barriers ranging from mistakes in the paper, omissions / ambiguities in the paper, to even downright fabrications.

The entire field has a reproducibility problem, which implies that as a academic community we're just churning out pubs and not actually doing science.

I'm helping form Nature's foray into ML (https://www.nature.com/natmachintell/), and we're strongly considering enforcing a requirement that FOSS source code be published in a public venue (like github), and requiring that at least some benchmarks are established on publicly (but perhaps not freely) available data sets. This won't be appropriate for every paper of course, but our editorial staff is certainly going to be focused on maintaining scientific quality.


Their algorithms are always reproducible with any implementation you like.

Well, that's the dream. In practice their algorithms are often not reproducible at all. Requiring code would get rid of the papers that cannot be made to work, where the authors stretched the truth because without code they could not be caught.


This seems pretty straight forward to resolve the, just have conferences say they will add half a point to the average review score for papers with code.

Most reviewers seem to agree that besides the top and bottom 10% of papers the primary differentiator is luck, so might as well add code to that mix.


>Gundersen says the culture needs to change. "It's not about shaming," he says. "It's just about being honest."

I think that it should be about shaming. If you shame those that don't provide enough information to replicate you might create an incentive to publish properly.

Replication problems create so much noise for further research that I would classify a researcher who doesn't publish replicable results as hostile to the research community.


Welcome to the world of science.... where the data is scant but the results are impressive!

Seriously though, publish your data in the supplementary materials. Quit being afraid of failure or someone catching your mistakes. This is science, not figure skating.


I would encourage going one step further, and finding a repository for the data outside of the journal. Journals can be terrible at keeping supplementary files around, and I have people email asking for the supplementary data after a website redesign, or similar. Journals also sometimes do weird things like convert moderate size tables to PDF (making them nearly useless), and it's not clear what's happening when you're submitting the article always.

The other advantage of a repository is that it gets included in large data collections, and is in a standardized format.

Often it's a lot of work to upload to repositories, nearly as much as the analysis (if you have a good analysis pipeline), but is well worth it both for oneself and for the community.


Missing data hinders generalizability and long term stability of ML techniques even internal to an organization. This is just one symptom of the fact that the "data generating mechanism" changes constantly, in both obvious and subtle ways due to changes in business practice and market conditions.


Last few weeks I've been reading a lot of scientific research on speech recognition algorithms and technology. Without open data, open algorithms and open software it's quite impossible to generate any of the results claimed by some articles. I think science, and especially computer science has a fundamental problem in execution and research publishing. We need a novel revolutionary approach beyond peer review where 'reviewability' must be quick and within reach if not instant - as bundled inside research documents.


If you're interested in finding out more about data citation, you may find these interesting:

https://www.crossref.org/blog/the-research-nexus---better-re...

(edit - missing data citations from scholarly publications cause/propagate the replication problem described in the article)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: