The title is a bit grumpy, but it’s really a constructive essay on common problems encountered when applying deep RL from a deep RL researcher.
I have some musings on this part of the essay:
The DeepMind parkour paper (Heess et al, 2017), demoed below, trained policies by using 64 workers for over 100 hours. The paper does not clarify what “worker” means, but I assume it means 1 CPU.
At the same time, the fact that this needed 6400 CPU hours is a bit disheartening. It’s not that I expected it to need less time…it’s more that it’s disappointing that deep RL is still orders of magnitude above a practical level of sample efficiency.
There’s an obvious counterpoint here: what if we just ignore sample efficiency? There are several settings where it’s easy to generate experience. Games are a big example. But, for any setting where this isn’t true, RL faces an uphill battle, and unfortunately, most real-world settings fall under this category.
This part of the essay made me think of this tweet from Karpathy:
Maybe we should say it’s disheartening that we don’t have enough CPU hours to make deep RL work. Or ways for robots to get enough real world experience for deep RL to work.
The idea that deep RL should be more sample efficient comes from analogies to human learning, like this one from the essay:
RainbowDQN passes the 100% threshold [i.e. the human-level performance threshold] at about 18 million frames. This corresponds to about 83 hours of play experience, plus however long it takes to train the model. A lot of time, for an Atari game that most humans pick up within a few minutes.
That “a few minutes” overlooks 500 million years of evolution.
What if it’s a fundamental fact about cognition, intelligence, and physical information processing that learning certain tasks requires a certain minimum number of samples? What if it’s just unrealistic to expect deep RL to learn Atari games in as few samples as humans, without the benefit of 500 million years of prior learning?
Maybe the approach we should take to deep RL is building the compute infrastructure and the physical robotics infrastructure to do massive training. Maybe if you want a robot to learn a task, you should build millions of copies of that robot, housed in giant warehouses in rural areas where land is cheap, and set them to task 24/7/365 for years on end. Maybe deep RL is extremely capital intensive, and we just haven’t accepted that fact yet.
This is just a conjecture. A thought experiment. Of course if we can copy the human brain well enough, we can steal evolution’s 500 million years of R&D, and get AIs to pick up new tasks as fast as humans do. We can try to improve sample efficiency through new techniques like hierarchical reinforcement learning. We can bootstrap deep RL using deep supervised learning with demonstrations from humans, which itself might be extremely capital intensive.
But just entertain the thought that our cultural expectations might be causing us to overlook the potential of current deep RL technology. We are used to the idea that anything that can be done with software should be able to be accomplished in a garage. That may be appropriate for the Software 1.0 paradigm, where the bottleneck to progress is the creative, intellectual labour of one person or a small team — which doesn’t scale with more people on the project. But if the bottleneck to progress for Software 2.0 is robots mucking around, maybe we need a ton of robots mucking around. Maybe we need software development on the physical scale of heavy industry. Industrial information technology is a foreign concept at our current moment in history, but that doesn’t mean it’s wrong.