Exploring the Limitations of Behavior Cloning for Autonomous Driving

Edit: This is a paper about cloning the behaviour of a software agent, not a human driver!


Driving requires reacting to a wide variety of complex environment conditions and agent behaviors. Explicitly modeling each possible scenario is unrealistic. In contrast, imitation learning can, in theory, leverage data from large fleets of human-driven cars. Behavior cloning in particular has been successfully used to learn simple visuomotor policies end-to-end, but scaling to the full spectrum of driving behaviors remains an unsolved problem. In this paper, we propose a new benchmark to experimentally investigate the scalability and limitations of behavior cloning. We show that behavior cloning leads to state-of-the-art results, including in unseen environments, executing complex lateral and longitudinal maneuvers without these reactions being explicitly programmed. However, we confirm well-known limitations (due to dataset bias and overfitting), new generalization issues (due to dynamic objects and the lack of a causal model), and training instability requiring further research before behavior cloning can graduate to real-world driving.

“spurious correlations cannot be distinguished from true causes in observed training demonstration patterns unless an explicit causal model or on-policy demonstrations are used.”

Was the black cat the cause of the crash or just incidental? Superstition is going to be hard to weasel out. It’s stuck with humans for millenia.

Then again 100 hours of training data is still in the grand scheme of things effectively nothing.

I have no idea how to compare apples to orangutans like go vs driving cameras but 100 hrs * 60 min * 60 sec * 30 fps = 10 million frames vs alpha zero’s 5m games * 200 moves/game 1 billion moves. That’s still 100x difference if a move is equivalent to a frame and that’s an extremely generous assumption.

I would argue though that maybe half a second is equivalent to a “move” in behavior planning. Which would put 100 hours of training at more like 500k driving events to train on vs 1 billion go moves.

For comparison if big data is the solution (and clearly it isn’t yet or Tesla would just dump a $100 million into aws and AT&T fees tomorrow) 500k hw2+ vehicles on the road in the morning driving 20 min to work is 166,000 hours of potential training data by lunch.

Academics should be focusing on novel approaches of how to effectively handle 100 million hours of simulated data not 100.

1 Like

OpenAI Five did 45,000 years of training (i.e. reinforcement learning via self-play) and AlphaStar (by my calculation) did 60,000 years in addition to some unknown amount of experience gained from imitation learning.


Thought-provoking excerpt:

Bias in Naturalistic Driving Datasets. The appeal of behavior cloning lies in its simplicity and theoretical scalability, as it can indeed learn by imitation from large off-line collected demonstrations (e.g., using driving logs from manually driven production vehicles). It is, however, susceptible to dataset biases like all learning methods. This is exacerbated in the case of imitation learning of driving policies, as most of real-world driving consists in either a few simple behaviors or a heavy tail of complex reactions to rare events. Consequently, this can result in performance degrading as more data is collected, because the diversity of the dataset does not grow fast enough compared to the main mode of demonstrations. This phenomenon was not clearly measured before. Using our new NoCrash benchmark (section 4), we confirm it may happen in practice.

This imparts the importance of not collecting common state-action pairs after a certain point and only collecting ones that are uncommon or rare. The more you water down your dataset with the same “few simple behaviors” the more biased your agent will be toward those behaviours and therefore the worse it will be at “complex reactions to rare events”.

Another wonderful excerpt:

Causal Confusion. Related to dataset bias, end-to-end behavior cloning can suffer from causal confusion [14]: spurious correlations cannot be distinguished from true causes in observed training demonstration patterns unless an explicit causal model or on-policy demonstrations are used. Our new NoCrash benchmark confirms the theoretical observation and toy experiments of [14] in realistic driving conditions. In particular, we identify a typical failure mode due to a subtle dataset bias: the inertia problem. When the ego vehicle is stopped (e.g., at a red traffic light), the probability it stays static is indeed overwhelming in the training data. This creates a spurious correlation between low speed and no acceleration, inducing excessive stopping and difficult restarting in the imitative policy. Although mediated perception approaches that explicitly model causal signals like traffic lights do not suffer from this theoretical limitation, they still under-perform end-to-end learning in unconstrained environments, because not all causes might be modeled (e.g., some potential obstacles) and errors at the perception layer (e.g., missed detections) are irrecoverable.

This reminds me of what Waymo did with ChauffeurNet. From the paper (page 8):

4.2 Past Motion Dropout

During training, the model is provided the past motion history as one of the inputs (Fig. 1(g)). Since the past motion history during training is from an expert demonstration, the net can learn to “cheat” by just extrapolating from the past rather than finding the underlying causes of the behavior. During closed-loop inference, this breaks down because the past history is from the net’s own past predictions. For example, such a trained net may learn to only stop for a stop sign if it sees a deceleration in the past history, and will therefore never stop for a stop sign during closed-loop inference. To address this, we introduce a dropout on the past pose history, where for 50% of the examples, we keep only the current position (u0,v0) of the agent in the past agent poses channel of the input data. This forces the net to look at other cues in the environment to explain the future motion profile in the training example.

I think a cool name for this sort of thing would be counterfactual training.

1 Like

For some reason, this paper tries to imitate a software agent, not a human. Huh…