Tesla AI and behaviour cloning: what’s really happening?

What I saw was just smartphone snapshots of two slides touting the ‘pros’ and ‘cons’ of end-to-end. The slides were clearly promoting the idea that end-to-end had some advantages. I didn’t save copies and they seem to have fallen off of my twitter stream now.

If your comment about AlphaGo is ‘why won’t the same method work’, then the main issue there is probably that AlphaGo has the advantage of a perfect model of the environment - this means there’s no noise in their feedback signal. Additionally AlphaGo’s model is extremely compact so they don’t need to be particularly sample efficient. Developing path planning from RL will need an approach that is more noise resistant and more sample efficient than AlphaGo needed to have.

Working from the latent space of a perception system that is trained on labeled data (what Waymo seems to be doing) helps with the sample efficiency issue but doesn’t resolve the noise issue.

This is not to say that RL cannot be made to work for path planning. I think it’s likely that all of these issues will be overcome in time. But you probably can’t do it naively today. Tomorrow? Who knows.

1 Like

That’s very helpful, thanks. What do you think is the source of feedback signal noise in our environment models for autonomous driving? Is this a perception problem, or something else?

Based on what I’ve heard and read recently, the actual physics of the world isn’t hard to model for a car context, it’s modelling the behaviour of agents in the environment.

Pieter Abbeel suggests using inverse reinforcement learning to derive a reward function from observation of human driving. This is an interesting idea to me because a company like Tesla (or whoever else in the future might have a large enough production fleet with the right hardware) could, in theory:

  • Upload 10 billion miles of mid-level representations data
  • Use inverse reinforcement learning to derive a reward function
  • Use reinforcement learning in simulation to search for a policy that optimizes for that reward function

The derived reward function would — presumably — include a tacit model of how agents in the world behave, and how to interact with them.

I wonder how far you could get with a randomization approach. That is, populate a simulator with randomized variants of the ego car. Or run multiple simulations in parallel each with a different, randomized variant populating the roads.

If you can’t accurately simulate the physics relevant to your problem, you can just train a neural network on many random variants of the physics (like OpenAI did with their robot hand). Maybe if you can’t accurately simulate the behaviour of other agents in the environment, you can just train a neural network on many random variants of behaviour.

In a self-driving car context, this is analogous to — though different than — self-play. As the ego car gets better at driving, the randomized other cars will generally, on average get better at driving too. Hopefully this means the end product isn’t a car that is impractically cautious for the real world. The need for caution will decrease as the surrounding drivers get better. And if you need bad drivers for your simulator, you have plenty of older versions of the ego car you can use.

A potential problem I can see is that the simulated cars might develop unhumanlike behaviours even if the end result is good driving by human standards. So the ego car might not know how to predict the behaviour of real cars in the real world. The design space/possibility space of driver behaviour might be too large for randomization to be an effective workaround.

1 Like

I say ‘noise’ just to differentiate it from signal. Strictly speaking it is information which correlates poorly with your objective function. When you are simulating a Go game there’s almost nothing uncorrelated in the simulation, but the driving problem involves absorbing very high dimensional and highly abstracted data which is largely uncorrelated with the objective of driving. The car doesn’t care what kind of trees are on the side of the road, whether the fence is plastic or wood, if the clouds are cirrus or cumulus. The car also doesn’t care about the vast majority of the actions of other road users - it only cares about the ones that plausibly affect its future options. This is all on top of the fact that the other information is stochastic. To an RL agent all of that stuff hides the stuff it really needs to pay attention to. Training an RL agent on ‘noisy’ data presents kinds of problems that AlphaGo didn’t need to consider and those problems will need addressing.

I’m personally sanguine about the potential for RL to make contributions to the driving problem. I think that eventually even ‘end to end’ can probably be made to work - with enough computation, some new techniques, and a lot of refinement. It’s surprising how far you can get with simple approaches. It’s almost as if the universe was designed with this kind of problem solving in mind.

I’m also a fan of physics simulations making contributions. Humans are actually really bad at physics compared to even simple computers. Having accurate physical simulation integrated into driving agents is going to be a big advantage.

1 Like

That makes sense. I suppose the perception network can narrow it down somewhat by outputting only key variables as mid-level representations. Different types of clouds and trees just won’t be represented. But even if you narrow it down to just the stuff you see in verygreen’s videos, there is still a lot of irrelevant information, and the relevant information is highly stochastic.

Or that this kind of problem solving evolved in organisms adapted to a universe designed this way.

I’m thinking about doing this step by step (which is something Waymo suggests):

  1. Use supervised imitation learning to get to a certain point.

  2. Use inverse reinforcement learning to derive a reward function from human driving.

  3. Use reinforcement learning.

Can you help me understand what this means in practice? I’m especially interested since Karpathy also recommended using one big network.

In Tesla’s case, let’s say Software V10 pre-alpha uses one big network, and they’re using a neural network approach for path planning. Does this mean the bottom layer of the perception network feeds into the top layer of the path planning network, making them one network? If so, how do you train the perception network with labelled images?

End to end - as I understand the term - doesn’t refer to the architecture of a system being used but rather how it is trained. A system could also operate end-to-end but not be trained end-to-end. My sense of Karpathy’s suggestion is that the ideal system is end-to-end, or nearly so, in operation. But it might be trained on synthetic data, a curated dataset, partially self-supervised, or some other mix and still satisfy Karpathy.

An end-to-end trained system is the easiest to create in a sense because the process demands very little of the developer in terms of work on the training and test set. The reason you don’t see that done a lot right now is that work on the training set - for NNs that are in common use today - generally results in substantially better performance. That benefit might gradually go away as techniques improve.

An operationally end-to-end system potentially allows for higher performance compared to a modular system. Modular systems have interfaces between the modules and those interfaces are generally designed by humans to present a particular set of abstractions which are almost always selected based on human intuition about what kind of data should be crossing that interface. End to end systems allow the process of training the NN to determine the appropriate data representation at every boundary. For instance - hand coded kernels for ConvNets were the gold standard before back propagation was fully developed. Those hand coded kernels always had a rationale for why they were constructed in a particular fashion and that rationale was derived from human intuition about what kinds of features were interesting and useful. Automatically generated kernels greatly outperform hand coded kernels today (though they might be less efficient) and part of the reason for this is that they are not constrained to representations which are easy to ‘explain’. In a similar fashion the internal interfaces in a system composed of multiple elements is likely to ‘discover’ better internal representations if it isn’t restricted to those which are easy for a human to understand.

1 Like

I just realized I didn’t respond to your final question. The answer is, there are various ways to train subsets of a ‘monolithic’ network and which is best depends on a lot of factors. But just to provide an example: you could do multi-head training where an intermediate representation is brought out and provided with an auxiliary loss function. During back propagation you can inject training signals into that intermediate network head to encourage earlier stages of the network to converge towards particular solutions. In this particular case you might bring out an intermediate representation and train it to generate segmentation maps and bounding boxes. Later stages would take those as inputs as well as perhaps other intermediate products from the earlier stages and train against a target which might be a path planning output. This multi-head approach is pretty common in deep non-residual networks. For instance - inception is trained with, if I recall correctly, 4 heads: the main plus 3 auxiliary objectives. Also - the AKNET_V9 network in it’s operational form has something like ten heads. It might have even more than that in it’s training form.

1 Like

Thank you so much for the very thorough explanation… Lots of stuff I had no idea about & need to learn more about.

I think there is a challenge in defining End-To-End. For instance would a pooling layer be considered breaking End-To-End? I would say no. I would also say that breaking up networks in a modular fashion isn’t necessarily breaking End-To-End.

I can’t see anyone making progress without a modular system. The problem with the AlphaGo approach as I see it is that AlphaGo can find exotic solutions that humans can’t comprehend. A self driving path finding algorithm needs to be indistinguishable from an excellent human driver. And it’s really hard to teach humans let alone machines through reinforcement learning. That’s how you end up with Cargo Cults on jungle runways mimicing the behavior of air traffic controllers trying to guide in planes that will never come. The Google talk touches on this with the importance of “meaning”.

When I’m driving I’m very modal in my behavior. I have one behavior when driving down a city street. “Stay in lane. Watch for jaywalkers. Stay in lane. Watch for Jaywalkers.” I will then get into modes like “approaching cross walk, can I see the entire crosswalk or is my vision obstructed… is there a pedestrian possibly behind that delivery van?” I clear the intersection and it’s back to “Stay in lane. Watch for jaywalkers.”

Given enough data of course a NN could eventually derive these different modes intuitively and learn purely through mimicry but it’s much more efficient to guide the NN. It’s much more efficient to create a mode such as “van randomly stopped in a 4 lane road… maybe they stopped for a pedestrian you can’t see, don’t pass quickly until you can see that they are actually turning or broken down.” That’s a lesson I had to learn as a driver and I would teach another human as a discreet scenario to learn, I would break that out into a unique driving strategy for a very specific situation. A NN could also eventually learn that a black and white rectangle in frame is a “speed limit” but you can far more quickly teach the meaning of speed limits through directed learning and sign segmentation feeding a concept of “current speed limit”. The risk of the pure end-to-end is that it learns a McDonalds sign means you should drive 25mph because McDonalds are always on city streets.

I would define an End to End NN pretty liberally and say that as long as all pathfinding and driving logic is performed in a NN it’s an E2E network. If you’re only using traditional code as the glue for your network it’s an E2E network. That’s just my personal opinion. I’m not sure how you differentiate that from a NN that’s trained by just dumping tens of billions of miles of driving video and only car controls are outputted but I don’t think one or the other is more or less an end to end approach.

It’s not a well defined term - I was just commenting on how I’ve seen it used amongst practitioners. I can see that different people would have different expectations from the plain english meaning.

Here is a different approach to imitation learning than ChauffeurNet. The Waymo paper says ChauffeurNet is a recurrent neural network (RNN). Here is a paper from DeepMind where they use a deep Q-network: Deep Q-learning from Demonstrations.

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstra- tions to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

This wonderful paper gives an overview of three approaches to imitation learning (IL) for autonomous driving:

  • Behavioural cloning (BC), “which treats IL as a supervised learning problem”

  • Inverse reinforcement learning (IRL), which “assumes that the expert follows an optimal policy with respect to an unknown reward function” and then uses reinforcement learning (RL) “to find a policy that behaves identically to the expert”

  • Generative Adversarial Imitation Learning (GAIL), which imitates behaviour by “training a policy to produce actions that a binary classier mistakes for those of an expert”

Hi! thank you for the clarification about this term, I have the same doubt