“The Bitter Lesson” by Richard Sutton

Andrej Karpathy tweeted this essay by Richard Sutton:


First paragraph:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

Except speech recognition wouldn’t be possible 20 years ago… So do we embrace the wasted counterproductive efficient solution or wait for decades for Moores law?

1 Like

There’s been a lot of pushback on this essay, unsurprisingly. It also has a lot of fans. The lesson isn’t as clear cut as the essay makes it out to be - convolutional nets and backpropagation were not discovered by brute force calculation. Similarly there are other underlying principles that have to be developed by means other than scaling up earlier, simpler approaches. Additionally, while methods that scale clearly win out over those that don’t, scaling up the underlying computation resource itself is the accumulated of a very large number of refinements. If progress relies upon scaling computation then the developments that go into that scaling should also be considered part of the overall development of the resulting capabilities.

That said, I believe the central implication is true here. Focusing on techniques that have a lot of runway with respect to scaling is to be preferred over refinements that won’t scale. It was this insight in reverse that brought me back to working on neural networks in 2007. I thought, “What highly general techniques can use a million or a billion times as much computation and apply it to a large variety of important problems?”. Neural networks were the answer that came to me then.


Important to clarify that he’s not saying computation itself is the source of discoveries in AI. He’s saying that the discoveries in AI that stand the test of time are “general methods that leverage computation”.

I think Sutton is in agreement when he writes:

…we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

He is arguing that AI researchers need to develop new ways for AI to learn — not simply scale up existing approaches. His point, as I understand it, is AI researchers shouldn’t try to encode in AI things that humans know about the world. Instead, researchers should try to encode as little as possible and develop methods that can learn those things that humans know.

It’s funny, I had the opposite thought when reading Alex Irpan’s essay on reinforcement learning. The most general possible learning agent would be completely agnostic about the laws of physics — whether nested infinite physical objects like Hilbert’s hotel exist, whether time only moves forward or also backward, how many physical dimensions there are, and so on. Humans have all kinds of inductive biases specific to our world. We are biased toward seeing faces. Our visual system “expects” to see objects lit from above, rather than below, because it evolved with the Sun, not floor lighting.

Could progress in artificial learning be, in part, the process of making learning agents less general, and more biased toward the world we actually live in?

I’m sure these two thoughts are not incompatible and not actually opposites. We want a learning agent to have as much innate inductive bias as a human with regard to things like space, objects, multiple agents, symmetries, and so on, but we don’t want learned, explicit human knowledge about these things inserted into the agent at the beginning, before the learning process starts.

Have you read any interesting responses you can share?

I think food for thought is that the use of human annotation in deep supervised learning is a way to get knowledge from human heads into a neural network. At a certain point, supervised learning ceases to scale with computation and gets bottlenecked by data collection and annotation.

Maybe in the future we will do away with supervised learning and just do end-to-end hierarchical reinforcement learning.

A thought experiment is to imagine you have a computer the size of Jupiter. Can you solve the problem with that much computation? With supervised learning, the answer is no, unless you also have enough supervisory signal — i.e. enough human knowledge in neural network-consumable format. (Potential slogan for this idea: all supervised learning is imitation learning?)

1 Like

It’s not my intention to criticize Sutton’s essay. I agree with it. But having digested some of the criticism from people who disagree with it I can see it from a slightly different perspective too. It’s possible to read the essay as discouraging research on trying to understand what’s going on in favor of focusing on things that will perform well as computation scales. I don’t think that’s Sutton’s intention and I don’t personally read it that way.

When I say the pushback is unsurprising what I mean is that suggesting that the work of some academic or other is less useful is going to get pushback - some of it unjustified - from people who feel that their own work might be part of what’s being referred to. Additionally any general statement about the relative merit of an approach to a complex subject is going to fail in corner cases, creating the opportunity to disagree. Oh well.

The criticism I’ve come across is mainly twitter retweets of the essay with alternative analyses attached. I’d just point you at twitter if you want to read them, it shouldn’t be hard to find.

In supervised learning, human data labelling is a systematized way to get human knowledge into a neural network. In AI in general, hand crafting of knowledge specific to a narrow task (e.g. Go) is another way to get human knowledge into a neural network. They are two forms of the same thing.

A vexing open question is how much the power and generality of human intelligence is owed to inductive biases that encode general knowledge — gained from evolution — about the physics in our universe and more narrowly about how macroscopic phenomena on Earth behave. A maximally general intelligence would learn equally fast how to accomplish tasks on Earth as in outer space, or in another universe with different physics. Robots that we want to function on Earth don’t need to be that general. So maybe we should encode them with basic, general knowledge about the world of the sort that humans are born with.

I really like that observation by Sutskever. It’s a subtle ‘failure mode’ of tweaking a network’s training regimen that you can just be moving the hyperparameters into a near adjacent domain without actually introducing anything novel to it’s capabilities. These things are really hard to detect without a lot of testing, so they usually go undetected.