Mikael Henaff, Alfredo Canziani, and Yann LeCun recently published this paper on using reinforcement learning for autonomous highway driving in dense traffic:
Learning a policy using only observational data is challenging because the distribution of states it induces at execution time may differ from the distribution observed during training. We propose to train a policy by unrolling a learned model of the environment dynamics over multiple time steps while explicitly penalizing two costs: the original cost the policy seeks to optimize, and an uncertainty cost which represents its divergence from the states it is trained on. We measure this second cost by using the uncertainty of the dynamics model about its own predictions, using recent ideas from uncertainty estimation for deep networks. We evaluate our approach using a large-scale observational dataset of driving behavior recorded from traffic cameras, and show that we are able to learn effective driving policies from purely observational data, with no environment interaction.
Choosing a policy for neighboring cars is challenging due to a cold-start problem: to accurately evaluate a learned policy, the other cars would need to follow human-like policies which would realistically react to the controlled car, which are not available. We take the approach of letting all the other cars in the environment follow their trajectories from the dataset, while a single car is controlled by the policy we seek to evaluate. This approach avoids hand-designing a policy for the neighboring cars which would likely not reflect the diverse nature of human driving. The limitation is that the neighboring cars do not react to the controlled car, which likely makes the problem more difficult as they do not try to avoid collisions.
This ties perfectly into what Waymo said in its ChaffeurNet paper. Imitation learning is a potential way to get human-like driving policies, and thereby to bootstrap reinforcement learning in simulation.
We have applied this approach to a large observational dataset of real-world traffic recordings, and shown it can effectively learn policies for navigating in dense traffic, which outperform other approaches which learn from observational data. However, there is still a sizeable gap between the performance of our learned policies and human performance. We release both our dataset and environment, and encourage further research in this area to help narrow this gap.
God bless these researchers for benchmarking their system against human performance. Love to see that!
Looks like this approach is quite a ways off from human-level performance, but there are at least two big limitations that I can see:
The training dataset is quite small, consisting of only 36 minutes of footage from overhead highway traffic cameras.
As mentioned, “neighboring cars do not react to the controlled car, which likely makes the problem more difficult as they do not try to avoid collisions.”
Tesla’s HW3 fleet provides a solution to (1). Once the fleet reaches 350,000 cars, it will be driving 1 billion miles per quarter. At an average speed of, say, 50 mph, that’s 20 million hours of driving per quarter. Even if data from just 1% of those hours is uploaded, that’s a steep increase from the half hour used in this paper.
With regard to (2), imitation learning is a potential solution as Waymo suggested in its ChauffeurNet paper. Tesla’s HW3 fleet data comes in handy again here. Especially data on crashes and “crash-like events”, and other rare, weird, and tricky situations.
Mobileye’s approach to (2) is to use self-play. This gets rid of the need for fleet data or imitation learning. But what if the behaviour of the learned agent is different from human drivers? Isn’t there a risk it will learn to drive based on false assumptions about how other road users will behave?