The analogy between DeepMind’s AlphaStar and Tesla’s Full Self-Driving

#1

Here’s my hunch:

The rate of progress on Tesla’s Full Self-Driving Capability software will depend on whether — particularly after Hardware 3 launches — Tesla can use its training fleet of hundreds of thousands of HW3 cars to do for autonomous driving what DeepMind’s AlphaStar did for StarCraft II. That is, use imitation learning on the state-action pairs from real world human driving. Then augment in simulation with reinforcement learning.​

Imitation learning and reinforcement learning

A state-action pair in this context is everything the perception neural networks perceives (the state), and the actions taken by the driver, like steering, accelerating, braking, and signalling (the action). Similar to the way a human annotator labels an image with the correct category (e.g. stop sign, pedestrian, vehicle), the human driver “labels” a set of neural network perceptions with the correct action (e.g. brake, turn, accelerate, slow down).

This form of imitation learning is just the deep supervised learning we all know and love. Another name for it is behavioural cloning. A report by Amir Efrati in The Information cited an unnamed source or multiple unnamed sources claiming that Tesla is doing this:

Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object. Such an approach has its limits, of course: behavior cloning, as the method is sometimes called…

But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.”​

There are other forms of imitation learning as well, such as inverse reinforcement learning. Pieter Abbeel, an expert in imitation learning and reinforcement learning, has expressed support for the idea using inverse reinforcement learning for autonomous cars. Drago Anguelov, head of research at Waymo, says that Waymo uses inverse reinforcement learning for trajectory optimization. But from what I understand Waymo uses supervised learning rather than inverse reinforcement learning in cases where they have more data on human driving behaviour.

Anguelov’s perspective is super interesting. In his talk for Lex Fridman’s MIT course, he used this diagram to represent machine learning replacing more and more hand coding in Waymo’s software:

image

This diagram is strikingly similar to the one Andrej Karpathy used to visualize Tesla’s transition from Software 1.0 code (traditional hand coding) to Software 2.0 code (neural networks):

Anguelov’s talk is the most detailed explanation I’ve seen of what Waymo is doing:

As many people already know, reinforcement learning is essentially trial and error for AI. In theory, a company working on autonomous driving could do reinforcement learning in simulation from scratch. Mobileye is doing this. One of the problems is that a self-driving car has to learn how to respond appropriately to human driving behaviour. Vehicles in a simulation don’t necessarily reflect real human driving behaviour. They might be following a simple algorithm, like the cars in a video game like Grand Theft Auto V.

If Mobileye’s approach works, why wouldn’t Waymo collaborate with DeepMind and solve autonomous driving with reinforcement learning? On this very topic, Oriol Vinyals, one of the creators of AlphaStar, said:

Driving a car is harder. The lack of (perfect) simulators doesn’t allow training for as much time as would be needed for Deep RL to really shine.

Reinforcement learning from scratch worked for OpenAI Five on Dota 2. Surprisingly, OpenAI Five converged on many tactics and strategies used by human players in Dota 2, simply by playing against versions of itself. So, who knows, maybe Mobileye will be vindicated.

Perhaps one key difference between Dota 2 and driving is that there are driving laws and cultural norms. In Dota 2, everything that is possible to do in the game is allowed, and players are constantly looking for whatever play styles will lead to more victories. Driving, unlike Dota 2, is a coordination problem. Some of the rules are arbitrary and not discoverable in reinforcement learning. To use a toy example, with no prior knowledge a virtual agent might learn to drive on the right side of the road, or the left side of the road. It would have no way of guessing the arbitrary rule in the country it’s going to be deployed.

This is solveable because you can, for example, penalize the agent for driving on the wrong side of the road. Human engineers essentially hand code the knowledge into the agent via the reward function (i.e. the points system). But what if there are more subtle norms and rules that human drivers follow? Can an agent learn all of them with no knowledge of human behaviour? Maybe, maybe not.

Imitation learning can be used to create so-called “smart agents” that learn to drive based on human behaviour. These agents can be used in a simulation, and reinforcement learning can occur in that simulation. In theory, this simulation would be a much better model of real world driving than an agent starting from scratch and driving with versions of itself. If imitation learning is successful in copying human behaviours, then in theory what is learned in reinforcement learning in simulation could actually transfer to the real world.

AlphaStar and Full Self-Driving

With imitation learning alone, AlphaStar achieved a high level of performance. DeepMind estimates it was equivalent to a human player in the Gold or Platinum league in StarCraft II, which is roughly around the middle of the ranked ladder. So AlphaStar may have achieved roughly median human performance just with imitation learning. When AlphaStar was augmented with population-based, multi-agent reinforcement learning — a tournament style of self-play called the AlphaStar league — it reached the level of professional StarCraft II players.

AlphaStar took about 3 years of development, with little to no publicly revealed progress. The version of AlphaStar that beat MaNa — one of the world’s top professional StarCraft II players — was trained with imitation learning for 3 days, and reinforcement learning for 14 days (on a compute budget estimated around $4 million). So that’s a total of 17 days of training.

In June, Andrej Karpathy will have been at Tesla for 2 years. He joined as Director of AI in June 2017. Since at least around that time (perhaps earlier, I don’t know), Tesla has been looking for Autopilot AI interns with expertise in (among other things) reinforcement learning. Karpathy himself spent a summer as an intern at DeepMind working on reinforcement learning. He also worked on reinforcement learning at OpenAI.

The internship job postings also mention working with “enormous quantities of lightly labelled data”. I can think of at least two interpretations:

  1. State-actions pairs for supervised learning (i.e. imitation learning) of path planning and driving policy (i.e. how to drive).

  2. Sensor data weakly labelled by driver input (e.g. image of traffic light labelled as red by driver braking) for weakly supervised learning of computer vision tasks. (An example of weakly supervised learning is Facebook training a neural network on Instagram images using hashtags as labels.)

Tesla’s Full Self-Driving is different from AlphaStar in that Tesla has a plan to roll out features to customers incrementally, so progress is a lot more publicly visible. We didn’t get to see the agents that DeepMind trained, say, 6 months ago. So, we don’t really know how fast the agents went from completely incompetent to pro-level. What’s cool and interesting, though, is that Demis Hassabis (the CEO of DeepMind) seemed totally surprised after AlphaStar beat MaNa:

I don’t think I would be super surprised if, 3 years from now, Tesla is way behind schedule and progress has been plodding and incremental. I would be amazed, but necessarily taken totally off guard, if 3 years from now Tesla’s FSD is at an AlphaStar-like level of performance on fully autonomous (unsupervised) driving.

We can’t predict how untried machine learning projects will turn out. That’s why researchers publish surprising results—we wouldn’t be surprised if we could predict what would happen in advance. The best I can do in my lil’ brain is draw analogies to completed projects like AlphaStar to what Tesla is doing (or might be doing). Then try to identify what relevant differences might change the outcome in Tesla’s case.

Some differences that come to mind:

  • perfect perception in a virtual environment vs. imperfect perception in a real world environment
  • optimizing for one long-term goal (winning the game), which allows individual mistakes vs. a task where a single error could lead to a crash
  • no need to communicate with humans vs. some need to communicate with humans
  • self-play with well-defined conditions for victory and defeat vs. this is not an inherent part of driving, although maybe you could design a driving game to do self-play

What are other important differences? What are other reasons this approach might not work?

#2

What are other reasons this approach might not work?

More need for understanding human behavior. In order to predict how people will behave it really really helps to understand why they do what they do.

There is a classic psychology test where a test participant puts an object in a box while being watched by an observer. That participant then leaves the room. A third party enters the room, opens the box and removes the object and places it somewhere else. The subject is then asked where the first person will look for the object.

Young children and primates will say the first person will look where the object was moved to. Once you reach a certain level of development they are able to understand that the first person has different knowledge from their own.

Lots of driving involves creating little stories in your head for what other drivers are doing. “They are driving under the speed limit. They are probably looking for a parking space. If they stop in front of a parking space they will want to back up into it and I should watch out for their nose swinging out when I pass if there is room to pass.”

Or “That person is walking toward the cross walk but they’re looking around a lot and are talking on the phone. They probably won’t cross as they are looking to meet up with someone.”

If you can’t put yourself into their head and understand why they are behaving the way they are behaving then lots of very very similar situations appear to have completely random resolutions and are incredibly difficult to predict. e.g. an uber driver and someone looking for parking drive similarly but have different outcomes: stopped in the street vs backing into a parking space.

And that’s on top of general understanding. We can perform basic physics calculations in our head. If we see an overloaded pickup that’s poorly secured we can see that the load is catching the wind and starting to lean. That’s thanks to our generalized understanding of how the world works. You’ll never catch a similar situation in your training data and who knows what clues it’ll find to predict “why” that load fell over but others didn’t.

#3

Right, so, do self-driving cars need theory of mind?

Maybe… But I wonder if this can be taken care of with a simple mapping (via supervised learning) of road user’s past few seconds of behaviour to their likely next few seconds of behaviour. Can we predict behaviour on a low level using deep learning, as for instance Waymo is trying to do?

A related question: without predicting the future behaviour of other road users, can we determine the optimal response simply based on their past behaviour? For example, can we copy humans’ response to past behaviour via imitation learning? Can we learn the optimal response via reinforcement learning?

#4

I saw @strangecosmos post on SeekingAlpha discussing using imitation learning and state-action pairs to get to median human driving. Some questions that I had coming out of that:

  • how much do we actually know about that statement from DeepMind that the agents trained using imitation learning were actually around gold/platinum level? They mentioned it was a bit difficult to measure. I’m asking this because I wonder if they truly are at that level, and their statements didn’t really inspire too much confidence if we’re talking about using imitation learning to drive big hunks of metal that can kill people.
  • I wonder to what extent it is a good idea to have something that is trained primarily off of humans. Would the imitation learning not pick up some bad behaviors as well that humans do? I don’t know enough about imitation learning/ML so I might be missing some commonly known techniques that are used to weed this sort of stuff out. Don’t self-driving cars have to be maybe at least 2x safer than a human driving in order for society to more broadly accept it?
  • I suppose the next thought would then be to put these human agents into a simulation and use reinforcement learning to make them superhuman at driving, as @strangecosmos has talked about. But I can’t help but wonder if simulation is really going to be good enough to cover all these different scenarios, even with human agents that can mimic different kinds of driving. It seems to my ignorant mind that the human-like agents would mostly just drive normally in the simulation and not create any edge case scenarios. Don’t the edge case scenarios have to then be explicitly programmed into the simulation? And if so, what good is reinforcement learning? Isn’t this sort of like a long roundabout way of just getting to the explicit programming that Waymo does?

I have to put a disclaimer at the end of my posts since my knowledge is admittedly extremely high level. Sorry if these are silly questions

1 Like
#5

You’re in good company!