Tesla AI and behaviour cloning: what’s really happening?


#1

When The Information recently published two articles on Tesla and autonomy, the strangest thing to come out of that reporting was this bit under the subheading “Behavior Cloning”:

Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object. Such an approach has its limits, of course: behavior cloning, as the method is sometimes called… But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.

As I understand it, when software engineers who work on self-driving cars use the term “behaviour cloning”, this means the same thing as “end to end learning”, i.e. the entire system is just one big neural network that takes sensor data as its input and outputs steering, acceleration, and braking.

What’s not made clear in the article is the difference between end to end learning and neural networks in general. If you use neural networks, but not end to end learning, that’s still a situation where humans don’t need to write code for specific scenarios.

Amnon Shashua has a really good talk on end to end learning vs. the “semantic abstraction” approach to using neural networks:

As Amnon says, if Tesla were using end to end learning, it would not need to label images. The only “labelling” that occurs is the human driver’s actions: the steering angle, accelerator pushes, and brake pedal pushes. The sensor data is the input, and the one big neural network tries to learn how to map that sensor data onto the human driver’s actions. Since we know Tesla is labelling images, we know Tesla can’t be using end to end learning. Since “end to end learning” and “behaviour cloning” are synonymous, we know Tesla can’t be using behaviour cloning.

So, what did Amir Efrati at The Information hear from his sources that led him to report that Tesla is using “behaviour cloning”? Amir writes that how humans drive is used “to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.” What this makes me think is that perhaps Tesla is working on a neural network for path planning (or motion planning) and/or control. Perhaps a path planning neural network and/or control neural network is being trained not with sensor data as input, but with the metadata outputted by the perception neural networks. The Tesla drivers’ behaviour — steering, acceleration, brake — “labels” the metadata in the same way that, in end to end learning, the human driver’s behaviour “labels” the sensor data.

This approach would solve the combinatorial explosion problem of end to end learning (described by Amnon in the video above) by decomposing perception and action. Perception tasks and action tasks would be handled independently by separate neural networks that are trained independently.

Using human drivers’ actions as the supervisory signal/training signal for path planning and/or control actually makes sense to me (whereas end to end learning does not). What are the alternatives?

a) Use a hand-coded algorithm. While this may be effective, we have lots of examples where fluid neural networks outperform brittle hand-crafted rules.

b) Use simulation. Until recently, I didn’t appreciate how much trouble we have simulating the everyday physics of the real world. From OpenAI:

Learning methods for robotic manipulation face a dilemma. Simulated robots can easily provide enough data to train complex policies, but most manipulation problems can’t be modeled accurately enough for those policies to transfer to real robots. Even modeling what happens when two objects touch — the most basic problem in manipulation — is an active area of research with no widely accepted solution. Training directly on physical robots allows the policy to learn from real-world physics, but today’s algorithms would require years of experience to solve a problem like object reorientation.

While simulation may be a part of the development and training process for path planning and/or control, it probably can’t be the whole process.

If Tesla were to train a neural network using the behaviour of Tesla drivers — and use human review to remove examples of bad driving — then it would avoid hand-coded algorithms’ brittleness and simulations’ lack of verisimilitude. I think (but I’m not sure) it would then be possible to use reinforcement learning or supervised learning to improve on this. Tesla could put the path planning and/or control neural network into cars running Enhanced Autopilot and more advanced future features, and then use disengagements, aborts, crashes, and bug reports to identify failures. These failures would then become part of the training signal.

If my conjecture is correct, I can see how this would be an extremely fast way to solve path planning and/or control. I can also see how it’s an approach that Tesla is uniquely suited to pursue, given a fleet of HW2 cars that is driving something like 400 million miles a month (300,000 cars x 1,380 miles per month). Based on (admittedly scant) anecdotal evidence, each HW2 car might be uploading an average of 30 MB+ per day.

I can’t help but think of Elon’s comments:

…I think no one is likely to achieve a generalized solution to self-driving before Tesla. I could be surprised, but… You know, I think we’ll get to full self-driving next year. As a generalized solution, I think. … Like we’re on track to do that next year. So I don’t know. I don’t think anyone else is on track to do it next year. … I would say, unless they’re keeping it incredibly secret, which is unlikely, I don’t think any of the car companies are likely to be a serious competitor.


#2

Behavior cloning isn’t a term I’ve seen before, but I can imagine how it would differ from end-to-end. In ETE you train a network on actual input and have it generate the actual, final output to control the system. It’s advantage is that it’s very simple to implement and doesn’t require manual labeling. The goal of ETE is to make an NN that actually controls the system.

But there are many potential benefits that could come from predicting likely driver behavior that don’t involve controlling the vehicle directly. For instance, you could look at where a human tends to position a car in the lane in various situations - when there’s a barrier at the left edge of the lane it’s more to the right - when a truck on the right is wandering out of it’s lane a driver might move a bit to the left. That prediction could be used as one of the inputs to bias lane positioning.


#3

I’m not sure what specific term (if any) applies to the hypothetical approach I described of using a neural network with 1) metadata outputted by perception neural networks as in the input and 2) human drivers’ actuation of the steering wheel, accelerator, and brake as the desired output.

I think the term “imitation learning” or “learning from demonstration” is an umbrella term that includes behaviour cloning, but also includes other approaches I’m just hearing about for the first time:

  • inverse reinforcement learning
  • apprenticeship learning
  • max entropy inverse reinforcement learning
  • guided cost learning
  • generative adversarial imitation learning

I’m confused by the terminology since different people seem to use it differently. :thinking:

Perhaps behaviour cloning isn’t synonymous with end to end learning, but whenever I encounter the term “behaviour cloning” in an autonomous car context, it always seems to be framed as a way to directly map pixels onto actuators.

Here’s an example where “imitation learning” is defined the same way as I would define end to end learning: “The goal is to learn a function f that maps from sensor readings xt to actions.” This is a terminological morass… :confused:

This idea is consistent with one thing that Amir wrote (my emphasis):

…the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.

But then what he writes shortly afterward suggests a different conclusion:

But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team.

Maybe the discrepancy is between what Tesla engineers are actually working on now (“an additional factor”) vs. what they believe is ultimately possible (“directly predict the correct steering, braking, and acceleration”).


#4

There isn’t really a standardized way to refer to most of these things because they are too many and too varied right now. There are a lot of ideas about how to generate useful input data and a lot of ideas about how to apply the output product. Relatively few combinations have broadly recognized labels. “end to end” is pretty simple because it is one of the extremes, but there are many more hybrid methods which are mainly described in their specifics. As some approaches become dominant they’ll get labels. Right now we’re in exploration mode - downselection and optimization come later.


#5

I think labeling is still needed due to the underlying problem being solved. For example, training a NN behavior at a freeway off ramp would be difficult since some drivers take the ramp and some don’t.

Labeling could also speed training since we know the car shouldn’t cross lane lines unless changing lanes or avoiding obstacles. Unlabeled training data would have instances of drivers changing lanes and thus need meta data to justify that action and gate the training. Even if all drivers used the turn signal, the NN would still need to link that one input to the behavior on its own without a training nudge.


#6

Apparently there are people out there trying to make the case for end-to-end in self driving. Uber did a presentation at NeurIPS yesterday arguing that decomposing the NN into stages to enable more conventional engineering methods invariably reduces performance and extends development time. I expect that we’ll see systems become more end-to-end’ish over time as various obstacles to it decline.

Still, it seems to be too early right now to seriously try end-to-end for a real product, especially something as high stakes as driving a car. So for the time being labeling data is going to continue to be an important part of engineering these systems.


#7

Are you at NeurIPS? Anywhere I can find out more about the talk?


#8

Ah no - somebody I know tweeted a couple of slides from an Uber presentation.


#9

Oh, cool. Do you mind sharing the tweets of those slides? I am very curious…