Tesla AI and behaviour cloning: what’s really happening?

Edit (April 5, 2019): When I wrote this post, I was confused about the term “behaviour cloning” or “behavioural cloning”. I mistakenly thought the term was synonymous with end-to-end learning. While some examples of behavioural cloning are also examples of end-to-end learning, behavioural cloning doesn’t have to be end-to-end learning.

Behavioural cloning falls under the umbrella of imitation learning, a family of machine learning techniques wherein a neural network attempts to learn from a human demonstrator. Behavioural cloning is distinct from other forms of imitation learning in that it “treats IL as a supervised learning problem”. That is, it learns to map actions to states the same way a neural network competing in the ImageNet challenge learns to map labels to images. (The definition I quoted is from this paper.)

What follows is the original post from December 1, 2018.

When The Information recently published two articles on Tesla and autonomy, the strangest thing to come out of that reporting was this bit under the subheading “Behavior Cloning”:

Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object. Such an approach has its limits, of course: behavior cloning, as the method is sometimes called… But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.

As I understand it, when software engineers who work on self-driving cars use the term “behaviour cloning”, this means the same thing as “end to end learning”, i.e. the entire system is just one big neural network that takes sensor data as its input and outputs steering, acceleration, and braking.

What’s not made clear in the article is the difference between end to end learning and neural networks in general. If you use neural networks, but not end to end learning, that’s still a situation where humans don’t need to write code for specific scenarios.

Amnon Shashua has a really good talk on end to end learning vs. the “semantic abstraction” approach to using neural networks:

As Amnon says, if Tesla were using end to end learning, it would not need to label images. The only “labelling” that occurs is the human driver’s actions: the steering angle, accelerator pushes, and brake pedal pushes. The sensor data is the input, and the one big neural network tries to learn how to map that sensor data onto the human driver’s actions. Since we know Tesla is labelling images, we know Tesla can’t be using end to end learning. Since “end to end learning” and “behaviour cloning” are synonymous, we know Tesla can’t be using behaviour cloning.

So, what did Amir Efrati at The Information hear from his sources that led him to report that Tesla is using “behaviour cloning”? Amir writes that how humans drive is used “to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.” What this makes me think is that perhaps Tesla is working on a neural network for path planning (or motion planning) and/or control. Perhaps a path planning neural network and/or control neural network is being trained not with sensor data as input, but with the metadata outputted by the perception neural networks. The Tesla drivers’ behaviour — steering, acceleration, brake — “labels” the metadata in the same way that, in end to end learning, the human driver’s behaviour “labels” the sensor data.

This approach would solve the combinatorial explosion problem of end to end learning (described by Amnon in the video above) by decomposing perception and action. Perception tasks and action tasks would be handled independently by separate neural networks that are trained independently.

Using human drivers’ actions as the supervisory signal/training signal for path planning and/or control actually makes sense to me (whereas end to end learning does not). What are the alternatives?

a) Use a hand-coded algorithm. While this may be effective, we have lots of examples where fluid neural networks outperform brittle hand-crafted rules.

b) Use simulation. Until recently, I didn’t appreciate how much trouble we have simulating the everyday physics of the real world. From OpenAI:

Learning methods for robotic manipulation face a dilemma. Simulated robots can easily provide enough data to train complex policies, but most manipulation problems can’t be modeled accurately enough for those policies to transfer to real robots. Even modeling what happens when two objects touch — the most basic problem in manipulation — is an active area of research with no widely accepted solution. Training directly on physical robots allows the policy to learn from real-world physics, but today’s algorithms would require years of experience to solve a problem like object reorientation.

While simulation may be a part of the development and training process for path planning and/or control, it probably can’t be the whole process.

If Tesla were to train a neural network using the behaviour of Tesla drivers — and use human review to remove examples of bad driving — then it would avoid hand-coded algorithms’ brittleness and simulations’ lack of verisimilitude. I think (but I’m not sure) it would then be possible to use reinforcement learning or supervised learning to improve on this. Tesla could put the path planning and/or control neural network into cars running Enhanced Autopilot and more advanced future features, and then use disengagements, aborts, crashes, and bug reports to identify failures. These failures would then become part of the training signal.

If my conjecture is correct, I can see how this would be an extremely fast way to solve path planning and/or control. I can also see how it’s an approach that Tesla is uniquely suited to pursue, given a fleet of HW2 cars that is driving something like 400 million miles a month (300,000 cars x 1,380 miles per month). Based on (admittedly scant) anecdotal evidence, each HW2 car might be uploading an average of 30 MB+ per day.

I can’t help but think of Elon’s comments:

…I think no one is likely to achieve a generalized solution to self-driving before Tesla. I could be surprised, but… You know, I think we’ll get to full self-driving next year. As a generalized solution, I think. … Like we’re on track to do that next year. So I don’t know. I don’t think anyone else is on track to do it next year. … I would say, unless they’re keeping it incredibly secret, which is unlikely, I don’t think any of the car companies are likely to be a serious competitor.

1 Like

Behavior cloning isn’t a term I’ve seen before, but I can imagine how it would differ from end-to-end. In ETE you train a network on actual input and have it generate the actual, final output to control the system. It’s advantage is that it’s very simple to implement and doesn’t require manual labeling. The goal of ETE is to make an NN that actually controls the system.

But there are many potential benefits that could come from predicting likely driver behavior that don’t involve controlling the vehicle directly. For instance, you could look at where a human tends to position a car in the lane in various situations - when there’s a barrier at the left edge of the lane it’s more to the right - when a truck on the right is wandering out of it’s lane a driver might move a bit to the left. That prediction could be used as one of the inputs to bias lane positioning.

1 Like

I’m not sure what specific term (if any) applies to the hypothetical approach I described of using a neural network with 1) metadata outputted by perception neural networks as in the input and 2) human drivers’ actuation of the steering wheel, accelerator, and brake as the desired output.

I think the term “imitation learning” or “learning from demonstration” is an umbrella term that includes behaviour cloning, but also includes other approaches I’m just hearing about for the first time:

  • inverse reinforcement learning
  • apprenticeship learning
  • max entropy inverse reinforcement learning
  • guided cost learning
  • generative adversarial imitation learning

I’m confused by the terminology since different people seem to use it differently. :thinking:

Perhaps behaviour cloning isn’t synonymous with end to end learning, but whenever I encounter the term “behaviour cloning” in an autonomous car context, it always seems to be framed as a way to directly map pixels onto actuators.

Here’s an example where “imitation learning” is defined the same way as I would define end to end learning: “The goal is to learn a function f that maps from sensor readings xt to actions.” This is a terminological morass… :confused:

This idea is consistent with one thing that Amir wrote (my emphasis):

…the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.

But then what he writes shortly afterward suggests a different conclusion:

But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team.

Maybe the discrepancy is between what Tesla engineers are actually working on now (“an additional factor”) vs. what they believe is ultimately possible (“directly predict the correct steering, braking, and acceleration”).

There isn’t really a standardized way to refer to most of these things because they are too many and too varied right now. There are a lot of ideas about how to generate useful input data and a lot of ideas about how to apply the output product. Relatively few combinations have broadly recognized labels. “end to end” is pretty simple because it is one of the extremes, but there are many more hybrid methods which are mainly described in their specifics. As some approaches become dominant they’ll get labels. Right now we’re in exploration mode - downselection and optimization come later.

1 Like

I think labeling is still needed due to the underlying problem being solved. For example, training a NN behavior at a freeway off ramp would be difficult since some drivers take the ramp and some don’t.

Labeling could also speed training since we know the car shouldn’t cross lane lines unless changing lanes or avoiding obstacles. Unlabeled training data would have instances of drivers changing lanes and thus need meta data to justify that action and gate the training. Even if all drivers used the turn signal, the NN would still need to link that one input to the behavior on its own without a training nudge.

1 Like

Apparently there are people out there trying to make the case for end-to-end in self driving. Uber did a presentation at NeurIPS yesterday arguing that decomposing the NN into stages to enable more conventional engineering methods invariably reduces performance and extends development time. I expect that we’ll see systems become more end-to-end’ish over time as various obstacles to it decline.

Still, it seems to be too early right now to seriously try end-to-end for a real product, especially something as high stakes as driving a car. So for the time being labeling data is going to continue to be an important part of engineering these systems.

1 Like

Are you at NeurIPS? Anywhere I can find out more about the talk?

Ah no - somebody I know tweeted a couple of slides from an Uber presentation.

1 Like

Oh, cool. Do you mind sharing the tweets of those slides? I am very curious…

Follow up here from Waymo engineers:

1 Like

I found this article today and appears to be somewhat related but if not please move but either way a very good read from the Waymo view.

1 Like

Definitely related: this is the blog post that summarizes the paper Amir tweeted. Thank you! Thank you also @thenonconsensus for sharing Amir’s tweet.

Amir quotes his own previous tweet that Tesla is using “behaviour cloning” (or perhaps imitation learning) for path planning specifically. This helps clear up some confusion.

Amir’s tweet is still a bit confusing because end to end learning (as demonstrated in Nvidia’s BB8 prototype) means pixels to actuators. Tesla is not doing that.

Woah. Aha moment! This tidbit from the Waymo blog post:

In order to drive by imitating an expert, we created a deep recurrent neural network (RNN) named ChauffeurNet that is trained to emit a driving trajectory by observing a mid-level representation of the scene as an input. A mid-level representation does not directly use raw sensor data, thereby factoring out the perception task, and allows us to combine real and simulated data for easier transfer learning.

By using metadata a.k.a. mid-level representations, you can train action tasks (like path planning) with simulation, without worrying about tainting your neural networks that do perception tasks with synthetic sensor data.

1 Like

Lots of good stuff!

1 Like

High-level question I’m asking myself about simulation: why can’t we do AlphaGo for path planning?

A partial answer from the blog post (my emphasis):

This work demonstrates one way of using synthetic data. Beyond our approach, extensive simulations of highly interactive or rare situations may be performed, accompanied by a tuning of the driving policy using reinforcement learning (RL). However, doing RL requires that we accurately model the real-world behavior of other agents in the environment, including other vehicles, pedestrians, and cyclists. For this reason, we focus on a purely supervised learning approach in the present work, keeping in mind that our model can be used to create naturally-behaving “smart-agents” for bootstrapping RL.

This reminds me a paper that Oliver Cameron (CEO of Voyage) tweeted about:

In theory, Tesla could also leverage production fleet data for this purpose… :thinking:

Useful tweet thread from Oliver Cameron explaining Waymo’s paper:

(open tweet to see the rest of the thread)

My tweets, doing some back-of-the-envelope math:

An important difference between Waymo and Tesla. ChauffeurNet was trained on less than 100,000 miles of human driving (60 days * 24 hours * 65 mph = 93,600 miles). HW2 Teslas drive something like 250 million miles per month (30 miles per day * 30 days * 300,000 vehicles = 270 million).

We don’t know how many (if any!) of those ~250 million miles/month are logged and uploaded to Tesla. Anecdotal evidence suggests 30 MB+ per HW2 car per day is uploaded. If the metadata (i.e. mid-level perception network output representations) is 1 MB per mile, it could be ~100%.

Based on data from Tesla, there is a crash or crash-like event every 2.06 million miles — if we assume Autopilot is 10% of miles. That’s 121 events per 250 million miles.

There’s no reason Tesla can’t use simulation also, but there are plenty of real world perturbations to use.

Suppose Tesla can collect 10 billion miles of path planning metadata from HW2 drivers. That’s 100,000x more than ChauffeurNet.

Actually, since a more realistic estimate for ChauffeurNet is 50,000 miles (assuming an average speed of 35 mph instead of 65 mph), it’s 200,000x.

Caveat: Tesla has to solve perception before the metadata will be fully reliable.

ChauffeurNet uses supervised learning. I wonder if reinforcement learning could be used at some point.

Waymo proposes this idea in their blog post:

we focus on a purely supervised learning approach in the present work, keeping in mind that our model can be used to create naturally-behaving “smart-agents” for bootstrapping RL.

Suppose Tesla uses a ChauffeurNet-like approach to simulating how Tesla drivers drive — without filtering out or training against all the bad stuff that human drivers actually do. The idea here is to get a realistic simulation of how humans drive, good and bad. Tesla populates its simulator with Tesla drivers. The ego car (i.e. the car Tesla wants to train to be superhuman) then drives around this simulated world filled with synthetic Tesla drivers. It uses reinforcement learning to minimize its rate of crashes and near-crashes.

This is an AlphaGo-ish approach. First, use supervised learning to copy how humans behave. Second, use reinforcement learning and self-play (i.e. simulation) to improve on that.

In the case of Tesla’s driving AI, an intermediate step (before reinforcement learning) would be to do what Waymo did with ChauffeurNet and use supervised learning to train against all the labelled examples of crashes, near-crashes, or other undesirable perturbations.

Let me propose, then, a possible Tesla Master Plan to master path planning:

  1. Solve perception.

  2. Collect 10 billion miles of path planning data from HW2 cars to learn how human drivers do path planning. (It’s possible this data could also be collected about surrounding vehicles, not just the Teslas themselves.)

  3. Use supervised learning to, like Waymo did with ChauffeurNet, train against examples of bad driving scenarios.

  4. Populate a simulated world with naturally behaving synthetic human drivers. Use reinforcement learning to improve path planning over many billions or even trillions of miles of simulated driving.

  5. Surpass human performance.

1 Like

Interesting tweet about the Waymo paper from a deep learning engineer:

I think step 5 needs to be GOTO 1.

One of the strongest uses I see for something like ChauffeurNet isn’t necessarily driving it’s seeing when ChauffeurNet fails. Inevitably the Net will fail and you can start to bin failures into categories and some of those categories are solvable through further training but some of those will require a return to the fundamentals (perception). If the expert driver is reacting to some detail in the real world that doesn’t exist in the mid level data set. For instance if drivers are reacting to blinkers you need to Solve Perception in regard to adding blinker metadata for every vehicle. If a driver sometimes departs the roadway to go around a stopped vehicle but sometimes doesn’t you have a good data set of “departing roadway” to start adding metadata for road surface type “dirty, gravel, requires human intervention (uneven terrain with rocks)”.

And of course there will need to be ‘divine’ intervention where commandments are handed down from on high like “Thou shalt not back down the shoulder to take an exit you missed, no matter how much time it saves you.”

1 Like

Woah. This feels like a very deep insight: we don’t know a priori what self-driving cars need to perceive.

If this sounds counterintuitive to anyone, think about this: we don’t know how humans drive. We just do it. What we think we know about how humans drive — beyond the explicit knowledge we learn from driver’s ed — is mostly a posthoc reconstruction of our implicit knowledge. For all we know, we might be wrong in many parts of that reconstruction.

Or consider that, in general, neural networks are good at doing things that we have no idea how to tell them to do. We assume — or I assume — that we know how to tell a robotic system to drive. But why? Maybe we don’t know how to tell a robot to drive anymore than we know how to tell a robot to walk, or to see. Maybe driving involves an array of subtasks that are cognitively impenetrable and opaque to introspection.

im.thatoneguy, I don’t know who you are or what your background is, but it seems like you have really good instincts because you proposed months ago that Tesla could just upload mid-level representations instead of sensor data. When I said above:

I think it was your post on TMC that had planted the seed in my mind. It’s pretty cool that your hunch has turned into a Waymo research paper and some reporting that suggests Tesla might actually be trying this approach.

What you said about using path planning failures to notice perception failures jives with what Karpathy said in this talk about Tesla’s “data engine”:

Perhaps the development process is a loop. Get far enough with perception to deploy a path planning feature (e.g. Navigate on Autopilot), then notice failures with that feature and identify them as either failures in perception or path planning, and then go back and work on perception some more or work on path planning some more. At the same time, keep working on new perception features (e.g. stop sign recognition) to enable new path planning features (e.g. automatic stopping for stop signs). Repeat the loop with those features.

I think the way I have been thinking about autonomous car development may be wrong because I have been thinking that we know what we need to solve. We know what all the parts of the problem are, we can solve those parts independently, and when we put all the parts together, that will be a complete solution. But this overlooks the fact that we have no idea why features will fail. The behaviour of the overall system is emergent from complex interactions within the system and with the environments, and it’s often unexpected.

Neural networks are black boxes, and even hand-coded software which is in theory transparent and deterministic often fails in ways we don’t expect.

If you try to build something without testing it in wild and varied conditions as quickly as possible, you run the risk that your posthoc reconstruction of what needs to be solved will diverge more and more over time with what actually needs to be solved.

My mental model has largely been “feed neural networks lots and lots of data and eventually they might solve the problem”. But this implies you already know a priori the problem that needs to be solved. And that knowledge of what needs to be solved comes from a posthoc reconstruction which is fallible. You need to test your whole system in the wild as early as possible to narrow the gap between your posthoc reconstruction and real driving.

To use an analogy, it won’t do to move closer and closer to hitting a target. You also have to keep checking whether that’s the right target to hit. You can’t just keep making progress on solving a problem. You have to make sure that’s the right problem to solve.

This is a made-up example just to illustrate the point. I can’t think of a real example, and I think the point I’m making is that real examples are hard to think of because they’re gaps between our explicit knowledge via posthoc reconstruction and how humans really drive using implicit knowledge.

Say that figuring out speed limits was a really hard problem for self-driving car engineers. And say that engineers thought this was a vital problem to solve because human drivers follow speed limits.

But say that, in reality, it turned out that human drivers completely ignore speed limits and just follow the natural flow of traffic, which emerges organically. (There might be a grain of truth in this; it’s inspired by a theory I read but only half-remember and can’t find now. I think some people argue it’s safer to increase speed limits because driving is safest when the traffic flows at an organic speed.)

You wouldn’t notice that until you deployed your self-driving car and found that it was getting into trouble because it was going a different speed than all the other vehicles (either driving too fast or too slow). You would be operating on a false theory about how driving is done, and you might put a lot of work into developing a solution to the speed limit problem before finally deploying and realizing that you solved the wrong problem. Not only is the solution you built unnecessary, it’s also insufficient.

To get a self-driving car working in the real world, you need to solve it feature by feature, and test the smallest possible features (atomic features?) as quickly as possible in the real world with the whole system running. If you don’t, you might solve problems that don’t need to be solved (like detecting speed limits, in the made-up example), and you might not solve problems that need to be solved (like how to follow the flow of traffic).

This is a whole new way of thinking for me that I’m not used to. I will have to think about this more and revisit some of my old assumptions.

It’s a super exciting conceptual revelation. What’s particularly interesting to me here on a meta level is that you can derive an engineering approach from epistemology, i.e. thinking carefully about what you know and how you know it, about how human knowledge is created (especially with regard to complex systems), what humans can and can’t know in different contexts (e.g. you can’t predict the discovery of a failure mode without making that discovery), and the difference between human competence and human comprehension (implicit knowledge and explicit knowledge).

Epistemology, either explicit or implicit (or a combination of both), is arguably behind the success of science and engineering as approaches and cultures of solving problems. I’m always excited when really abstract, dreamy concepts unexpectedly collide with nitty gritty technical concepts. It’s a reminder that thinking dreamy thoughts isn’t a waste of time and actually impacts the physical world in big ways.