Artificial Intelligence faces a “Reproducibility Crisis”20. September 2019
Artificial Intelligence faces a “Reproducibility Crisis”
New York, 9/21/2019
A few years ago, Joelle Pineau, a computer science professor at McGill University, helped her students develop a new algorithm when they got off track. Her lab is investigating reinforcement learning, a branch of artificial intelligence that serves, among other things, to help virtual characters teach themselves how to move in virtual worlds. It is a prerequisite for building autonomous robots and cars. Pineau’s students hoped to improve the system of another laboratory. But first they had to rebuild it. But their design fell short of the promised results for unknown reasons. The students tried some “creative manipulations” that were not part of the other lab’s work. And lo and behold, the system began to work. Such luck is a symptom of a disturbing trend, Pineau told “Wired”.
Neural networks are often referred to as black boxes because they function so mysteriously. As a rule, subtle optimizations are required in order for them to perform well. Because the networks are larger and more complex, and the data sets to be processed are huge, replicating these models becomes expensive, if not impossible, for all but the best-financed laboratories.
“Anna Rogers, a machine learning researcher at the University of Massachusetts, asks. “It’s not clear whether they’re demonstrating the superiority of your model or your budget.”
Pineau is trying to change the standards. She is the director of NeurIPS, a leading conference on artificial intelligence. She now asks researchers to submit a “reproducibility checklist” containing elements omitted from the papers, such as the number of models trained before selecting the “best” model, the computing power used, and links to code and data sets.
The idea, says Pineau, is to encourage researchers to offer a roadmap so that others can imitate their work because even the most experienced researchers don’t know how they work. Replication of these AI models is important not only to identify new research paths, but also to investigate algorithms that extend and in some cases replace human decision making.
Others are also tackling the problem. Researchers at Google have proposed so-called “model maps” to explain how machine learning systems have been tested, including the results that indicate possible distortions. Others have tried to show how fragile the term “state of the art” is when systems optimized for the data sets used in rankings are loose in other contexts. Last week, researchers at the Allen Institute for Artificial Intelligence (AI2) published a paper aimed at extending Pineau’s reproducibility checklist to other parts of the experimental process. They call it “Show Your Work”.
“Starting where someone stopped is so painful because we never fully describe the experimental setup,” says Jesse Dodge, an AI2 researcher who co-authored the research. “People can’t reproduce what we did if we don’t talk about what we did.” It’s a surprise, he adds, when people even report basic details about how a system was built.
Sometimes basic information is missing because it is proprietary – a problem that exists specifically for industry labs. But it’s often a sign that the field isn’t keeping pace with changing methods, says Dodge. A decade ago, it was easier to see what a researcher had changed to improve his results. Neural networks, on the other hand, are delicate; to get the best results, thousands of small buttons often need to be tuned, which Dodge calls a form of “black magic. Choosing the best model often requires a large number of experiments. Magic quickly becomes expensive.
Even the large industrial laboratories, which have the resources to design the largest and most complex systems, have triggered alarms. When Facebook tried to replicate AlphaGo, the system developed by Alphabet’s DeepMind to master the old game of Go, researchers seemed exhausted from the task. The enormous computational demands – millions of experiments running on thousands of devices over days, combined with unavailable code – made the system “very difficult, if not impossible, to reproduce, study, improve and expand,” she wrote in a May paper. (The Facebook team was finally successful.)
AI2 research suggests a solution to this problem. The idea is to provide more data on the experiments conducted. You can still report the best model you got after, say, 100 experiments – the result that could be called “state of the art” – but you would also report the performance you would expect if you only had the budget to try it 10 times or just once.
The point of reproducibility, Dodge says, is not to replicate the results exactly. That would be almost impossible given the natural randomness in neural networks and variations in hardware and code. Instead, a roadmap will be offered to draw the same conclusions as the original research, especially when it comes to deciding which machine learning system is best suited for a particular task.
The differences in methods are in part the reason why the NeurIPS reproducibility checklist is voluntary. A stumbling block, especially for large laboratories, are proprietary codes and data. For example, if Facebook researches with your Instagram photos, there is a problem with the public use of this data. Another sticking point is clinical research with health data. “We don’t want to go over to excluding researchers from the community,” she says.
It’s difficult to develop reproducibility standards that work without restricting researchers, especially if the methods evolve rapidly. But Pineau is optimistic. Another component of NeurIPS reproducibility is the challenge of asking other researchers to replicate accepted papers. Compared to other areas, such as biosciences, where old methods die hard, the field is more open to bringing researchers into such sensitive situations. “It is young both in terms of its people and its technology,” she says. “We need to overcome less inertia.”