Alex Irpan: Deep Reinforcement Learning Doesn't Work Yet


The title is a bit grumpy, but it’s really a constructive essay on common problems encountered when applying deep RL from a deep RL researcher.

I have some musings on this part of the essay:

The DeepMind parkour paper (Heess et al, 2017), demoed below, trained policies by using 64 workers for over 100 hours. The paper does not clarify what “worker” means, but I assume it means 1 CPU.

At the same time, the fact that this needed 6400 CPU hours is a bit disheartening. It’s not that I expected it to need less time…it’s more that it’s disappointing that deep RL is still orders of magnitude above a practical level of sample efficiency.

There’s an obvious counterpoint here: what if we just ignore sample efficiency? There are several settings where it’s easy to generate experience. Games are a big example. But, for any setting where this isn’t true, RL faces an uphill battle, and unfortunately, most real-world settings fall under this category.

This part of the essay made me think of this tweet from Karpathy:

Maybe we should say it’s disheartening that we don’t have enough CPU hours to make deep RL work. Or ways for robots to get enough real world experience for deep RL to work.

The idea that deep RL should be more sample efficient comes from analogies to human learning, like this one from the essay:

RainbowDQN passes the 100% threshold [i.e. the human-level performance threshold] at about 18 million frames. This corresponds to about 83 hours of play experience, plus however long it takes to train the model. A lot of time, for an Atari game that most humans pick up within a few minutes.

That “a few minutes” overlooks 500 million years of evolution.

What if it’s a fundamental fact about cognition, intelligence, and physical information processing that learning certain tasks requires a certain minimum number of samples? What if it’s just unrealistic to expect deep RL to learn Atari games in as few samples as humans, without the benefit of 500 million years of prior learning?

Maybe the approach we should take to deep RL is building the compute infrastructure and the physical robotics infrastructure to do massive training. Maybe if you want a robot to learn a task, you should build millions of copies of that robot, housed in giant warehouses in rural areas where land is cheap, and set them to task 24/7/365 for years on end. Maybe deep RL is extremely capital intensive, and we just haven’t accepted that fact yet.

This is just a conjecture. A thought experiment. Of course if we can copy the human brain well enough, we can steal evolution’s 500 million years of R&D, and get AIs to pick up new tasks as fast as humans do. We can try to improve sample efficiency through new techniques like hierarchical reinforcement learning. We can bootstrap deep RL using deep supervised learning with demonstrations from humans, which itself might be extremely capital intensive.

But just entertain the thought that our cultural expectations might be causing us to overlook the potential of current deep RL technology. We are used to the idea that anything that can be done with software should be able to be accomplished in a garage. That may be appropriate for the Software 1.0 paradigm, where the bottleneck to progress is the creative, intellectual labour of one person or a small team — which doesn’t scale with more people on the project. But if the bottleneck to progress for Software 2.0 is robots mucking around, maybe we need a ton of robots mucking around. Maybe we need software development on the physical scale of heavy industry. Industrial information technology is a foreign concept at our current moment in history, but that doesn’t mean it’s wrong.


if the bottleneck to progress for Software 2.0 is robots mucking around, maybe we need a ton of robots mucking around. Maybe we need software development on the physical scale of heavy industry. Industrial information technology is a foreign concept at our current moment in history, but that doesn’t mean it’s wrong.

I feel like you’re drawing a false comparison, as if today’s software infrastructure doesn’t already amount to heavy industrial scale. Some points to look at: how expensive is a data center? How expensive is a pod of TPUs? We’ve saturated the use of individual machines and continue to expand horizontally as fast as we can. Supply can scarcely keep up with the rate we saturate disk, ram, and compute.

But just entertain the thought that our cultural expectations might be causing us to overlook the potential of current deep RL technology.

It’s economic, not cultural, and we’re not overlooking the potential at all. Profit-seeking corporations would love for RL to work, and in fact that’s why folks like Alex are working on the problem (and paid to do so). Your saying “just throw 100x more metal at the problem because that’s what it takes” points to exactly what Alex writes in his post: RL in its current iteration doesn’t work. For the same price you can do much better with a non-RL approximation, which is exactly what industrial robot arms do. Complex inverse-kinematics equation solvers + cleverly hand-written program >>> RL right now.

Now RL not working is a comment directed at economic viability. RL is clearly yielding state-of-the-art solutions to problems like computers playing games. But this is exactly because of the bottleneck (well, one of the biggest ones): sample inefficiency.

Your post reminds me of what people said before deep learning changed the supervised learning game: maybe SVMs with hand-engineered features are as good as it’ll get, and improving models doesn’t help anything. We just need to scale up our human feature engineering 1000x. Or increase data and hope that solves everything. Increase amount of thing we know to make thing we have work.

But this is a greedy way to look at something that needs a researcher’s eye to really improve: how do we go back to first principles and find where the bottlenecks are in our systems today, and how do we advance this state of the art? In the supervised learning analogy we figured out that stacking linear affine transformations on dense samples lets us build powerful representations over data (instead of wringing our hands and doing manual feature engineering for the rest of time). RL is at that same point in its development – the way forward is to figure out how to stop hand-crafting shitty, not-general-enough policies and doing something more fundamental and powerful.


AlphaStar and OpenAI Five didn’t solve the sample efficiency problem; they worked because they embraced sample inefficiency and gave neural networks thousands of years of experience to work with.

StarCraft and Dota are perfect simulators of themselves. Let’s say sim-to-real transfer learning doesn’t work for some robotics applications. Why not put 1000 robots in some buildings and have them work on, say, a step in a manufacturing process? In a year, they’ll have 1000 years of experience.

Similarly, why not pay 1000 people to operate the robots in order to bootstrap reinforcement learning with supervised learning? This is costly, but given the trillions of dollars spent on labour in the global manufacturing sector, successfully automating previously unautomatable production steps might provide a satisfactory return on this investment. (I specifically mean tasks that can only be performed by humans currently, but could in principle be performed by deep RL given sufficient training.)

It doesn’t exclude doing more fundamental research on deep RL, but it potentially means deep RL can start making a practical, economic impact via robotics applications with no further algorithmic innovation. If the only thing holding back deep RL from succeeding in factory robotics or warehouse robotics is sample inefficiency, and if fundamental research progress in deep RL is unpredictable and not guaranteed to advance quickly, then it seems worth simply scaling up robotics to overcome the sample inefficiency problem and make deep RL work.

This idea is mainly meant to be provocative. I’m questioning the idea that our expectations for a reasonable level of sample efficiency for deep RL should be based on 1) an analogy to human learning, which overlooks 500 million years of evolution or 2) whatever hardware we just happen to have available to us today, which is arbitrary as it’s a result of a combination of accidents of history, culture, business, politics, and so on. Why should we expect the fundamental nature of intelligence to be convenient to us in the era that we happen to be born, given only the tools we’ve set up for other purposes? Why should we base our theory of intelligence on the hardware we currently happen to have in place, rather than change our hardware based on the apparent requirements of intelligent systems?

Ultimately we will copy the human brain more and more until we get to AGI. But who knows at what rate generality will increase over time.

Edit: Robotics learning infrastructure on the scale of cloud compute infrastructure, or industrial manufacturing infrastructure, just doesn’t exist (yet). There is lots of cloud compute, but it’s still too expensive for a lot of AI researchers to run the experiments they would like to run. What if we re-framed the problem from “deep RL is too sample inefficient” to “compute is too expensive for most researchers to do big deep RL experiments”? How might we approach things differently?

One idea would be for governments to pass laws requiring big cloud computing companies like Amazon, Google, and Microsoft to donate 1% of their overall computing capacity to academic research. Since these big companies are tax dodgers anyway, this wouldn’t be unfair or onerous. AI research could benefit, but so could any area of science where intensive simulation is used.


In my field people have spent the last 40 years working to efficiently sample data. In the end we’re pretttty much back where we started with just optimizing the efficiency and memory consumption of brute force sampling. Lots of very clever, incredibly efficient sampling papers were written. Lots of neat products came and went. Lots of experts in tuning samplers made a career out of wringing out the most of each generation. But in the end Moore’s law has ultimately won.

I’ve got my suspicion that by the time we come up with extremely clever solutions, hardware will have advanced to the point where performance increases just make brute force good enough. I think we’re at the beginning of that curve right now with reinforcement learning. We only have to worry about balancing exploration\exploitation in training models because exploration is expensive. With 100x more processing in 10 years of Moore’s law you can explore 100x more local optima.


What is your field?


Photoreal Visual Effects. So pretty much all of the pain that people doing deep learning are rediscovering. If you have a small gap under a door, how do you efficiently find the random path a photon can take from the sun, off a tree leaf, off a wall, through the crack under the door, off another wall and then onto fibrous carpet? Do you start from the camera and work randomly backward? But what are the odds that your random light path will “find” the crack under the door? If it does randomly find the crack under the door, should you weight the future paths toward that part of the room and assume that there isn’t another light path? Once you’re through the door how do you prevent the weighted door crack from being over sampled the opposite direction when you need to get out the window? Do you start from the sun? But what if it’s a whole city street? The chances of it starting from the sun and finding the door crack are also infinitely remote. The local optima for the dark room is the crack under the door but the local optima for the room with a window might be the wall where the sun is hitting directly. But the local optima for the tree might be the sun itself. But there might be a window far away from the sun which acts as a mirror and fills back the opposite direction of the sun.

You want the sampler to explore because you want to find that light path to the door crack, but you also don’t want to waste sampling if the global optima is in fact zero photons. Maybe the room really just is black.

How many rays returning 0 photons should you fire out before assuming that the pixel is black? Should you stop sampling that pixel or keep firing until you find a light path that returns illumination? Should you sample every light in the world for a surface? But what if the light is inside of a window in a building across the street? Will it contribute? What if you have a warehouse full of lights but no windows? That would be inefficient to sample every light in the warehouse since none are visible in another building? But what if someone opens a door? How do you know that’s a door?

So uhh yeah… lots of similarities between rendering and training. Do you fire out 10 batches of 10 samples and then have your sampler fire out another 10 if your pixel intensity change > 0.1 (descent of sample batch to sample batch variability). Or do you send out 20 batches of 5 samples? Maybe it’ll stop sampling too soon because the batches had too few samples, maybe it’ll take too long because your samples (steps) were too small. Maybe one pixel will discover a good light path but none of its surrounding pixels will? Maybe it’s a shiny chrome marble and only that pixel should be able to find a significant path to the door?

Anyone who is looking for experts in optimizing neural networks should be mining the VFX industry’s raytracing experts.

And then all of the stuff related to autonomous vehicles in VFX as well: Localization (camera tracking), photogrammetry, LIDAR scanning, traffic and crowd modeling (it’s incredibly difficult to create authentic looking traffic patterns in background people\cars), segmentation (rotoscoping\keying), etc.


A worker is a single thread gathering experience by running an instance of the current policy as a controller for a simulated agent. Normally you have around one worker per CPU core, though this varies depending on the simulated environment. 64 cores running 64 workers is a pretty common configuration for running the mujoco humanoid because it’s about the largest set of workers that can be run on a single server - avoiding the overhead of cluster setup.