High-level question I’m asking myself about simulation: why can’t we do AlphaGo for path planning?
A partial answer from the blog post (my emphasis):
This work demonstrates one way of using synthetic data. Beyond our approach, extensive simulations of highly interactive or rare situations may be performed, accompanied by a tuning of the driving policy using reinforcement learning (RL). However, doing RL requires that we accurately model the real-world behavior of other agents in the environment, including other vehicles, pedestrians, and cyclists. For this reason, we focus on a purely supervised learning approach in the present work, keeping in mind that our model can be used to create naturally-behaving “smart-agents” for bootstrapping RL.
This reminds me a paper that Oliver Cameron (CEO of Voyage) tweeted about:
In theory, Tesla could also leverage production fleet data for this purpose…
An important difference between Waymo and Tesla. ChauffeurNet was trained on less than 100,000 miles of human driving (60 days * 24 hours * 65 mph = 93,600 miles). HW2 Teslas drive something like 250 million miles per month (30 miles per day * 30 days * 300,000 vehicles = 270 million).
We don’t know how many (if any!) of those ~250 million miles/month are logged and uploaded to Tesla. Anecdotal evidence suggests 30 MB+ per HW2 car per day is uploaded. If the metadata (i.e. mid-level perception network output representations) is 1 MB per mile, it could be ~100%.
Based on data from Tesla, there is a crash or crash-like event every 2.06 million miles — if we assume Autopilot is 10% of miles. That’s 121 events per 250 million miles.
There’s no reason Tesla can’t use simulation also, but there are plenty of real world perturbations to use.
Suppose Tesla can collect 10 billion miles of path planning metadata from HW2 drivers. That’s 100,000x more than ChauffeurNet.
Actually, since a more realistic estimate for ChauffeurNet is 50,000 miles (assuming an average speed of 35 mph instead of 65 mph), it’s 200,000x.
Caveat: Tesla has to solve perception before the metadata will be fully reliable.
we focus on a purely supervised learning approach in the present work, keeping in mind that our model can be used to create naturally-behaving “smart-agents” for bootstrapping RL.
Suppose Tesla uses a ChauffeurNet-like approach to simulating how Tesla drivers drive — without filtering out or training against all the bad stuff that human drivers actually do. The idea here is to get a realistic simulation of how humans drive, good and bad. Tesla populates its simulator with Tesla drivers. The ego car (i.e. the car Tesla wants to train to be superhuman) then drives around this simulated world filled with synthetic Tesla drivers. It uses reinforcement learning to minimize its rate of crashes and near-crashes.
This is an AlphaGo-ish approach. First, use supervised learning to copy how humans behave. Second, use reinforcement learning and self-play (i.e. simulation) to improve on that.
In the case of Tesla’s driving AI, an intermediate step (before reinforcement learning) would be to do what Waymo did with ChauffeurNet and use supervised learning to train against all the labelled examples of crashes, near-crashes, or other undesirable perturbations.
Let me propose, then, a possible Tesla Master Plan to master path planning:
Collect 10 billion miles of path planning data from HW2 cars to learn how human drivers do path planning. (It’s possible this data could also be collected about surrounding vehicles, not just the Teslas themselves.)
Use supervised learning to, like Waymo did with ChauffeurNet, train against examples of bad driving scenarios.
Populate a simulated world with naturally behaving synthetic human drivers. Use reinforcement learning to improve path planning over many billions or even trillions of miles of simulated driving.
One of the strongest uses I see for something like ChauffeurNet isn’t necessarily driving it’s seeing when ChauffeurNet fails. Inevitably the Net will fail and you can start to bin failures into categories and some of those categories are solvable through further training but some of those will require a return to the fundamentals (perception). If the expert driver is reacting to some detail in the real world that doesn’t exist in the mid level data set. For instance if drivers are reacting to blinkers you need to Solve Perception in regard to adding blinker metadata for every vehicle. If a driver sometimes departs the roadway to go around a stopped vehicle but sometimes doesn’t you have a good data set of “departing roadway” to start adding metadata for road surface type “dirty, gravel, requires human intervention (uneven terrain with rocks)”.
And of course there will need to be ‘divine’ intervention where commandments are handed down from on high like “Thou shalt not back down the shoulder to take an exit you missed, no matter how much time it saves you.”
Woah. This feels like a very deep insight: we don’t know a priori what self-driving cars need to perceive.
If this sounds counterintuitive to anyone, think about this: we don’t know how humans drive. We just do it. What we think we know about how humans drive — beyond the explicit knowledge we learn from driver’s ed — is mostly a posthoc reconstruction of our implicit knowledge. For all we know, we might be wrong in many parts of that reconstruction.
Or consider that, in general, neural networks are good at doing things that we have no idea how to tell them to do. We assume — or I assume — that we know how to tell a robotic system to drive. But why? Maybe we don’t know how to tell a robot to drive anymore than we know how to tell a robot to walk, or to see. Maybe driving involves an array of subtasks that are cognitively impenetrable and opaque to introspection.
im.thatoneguy, I don’t know who you are or what your background is, but it seems like you have really good instincts because you proposed months ago that Tesla could just upload mid-level representations instead of sensor data. When I said above:
I think it was your post on TMC that had planted the seed in my mind. It’s pretty cool that your hunch has turned into a Waymo research paper and some reporting that suggests Tesla might actually be trying this approach.
What you said about using path planning failures to notice perception failures jives with what Karpathy said in this talk about Tesla’s “data engine”:
Perhaps the development process is a loop. Get far enough with perception to deploy a path planning feature (e.g. Navigate on Autopilot), then notice failures with that feature and identify them as either failures in perception or path planning, and then go back and work on perception some more or work on path planning some more. At the same time, keep working on new perception features (e.g. stop sign recognition) to enable new path planning features (e.g. automatic stopping for stop signs). Repeat the loop with those features.
I think the way I have been thinking about autonomous car development may be wrong because I have been thinking that we know what we need to solve. We know what all the parts of the problem are, we can solve those parts independently, and when we put all the parts together, that will be a complete solution. But this overlooks the fact that we have no idea why features will fail. The behaviour of the overall system is emergent from complex interactions within the system and with the environments, and it’s often unexpected.
Neural networks are black boxes, and even hand-coded software which is in theory transparent and deterministic often fails in ways we don’t expect.
If you try to build something without testing it in wild and varied conditions as quickly as possible, you run the risk that your posthoc reconstruction of what needs to be solved will diverge more and more over time with what actually needs to be solved.
My mental model has largely been “feed neural networks lots and lots of data and eventually they might solve the problem”. But this implies you already know a priori the problem that needs to be solved. And that knowledge of what needs to be solved comes from a posthoc reconstruction which is fallible. You need to test your whole system in the wild as early as possible to narrow the gap between your posthoc reconstruction and real driving.
To use an analogy, it won’t do to move closer and closer to hitting a target. You also have to keep checking whether that’s the right target to hit. You can’t just keep making progress on solving a problem. You have to make sure that’s the right problem to solve.
This is a made-up example just to illustrate the point. I can’t think of a real example, and I think the point I’m making is that real examples are hard to think of because they’re gaps between our explicit knowledge via posthoc reconstruction and how humans really drive using implicit knowledge.
Say that figuring out speed limits was a really hard problem for self-driving car engineers. And say that engineers thought this was a vital problem to solve because human drivers follow speed limits.
But say that, in reality, it turned out that human drivers completely ignore speed limits and just follow the natural flow of traffic, which emerges organically. (There might be a grain of truth in this; it’s inspired by a theory I read but only half-remember and can’t find now. I think some people argue it’s safer to increase speed limits because driving is safest when the traffic flows at an organic speed.)
You wouldn’t notice that until you deployed your self-driving car and found that it was getting into trouble because it was going a different speed than all the other vehicles (either driving too fast or too slow). You would be operating on a false theory about how driving is done, and you might put a lot of work into developing a solution to the speed limit problem before finally deploying and realizing that you solved the wrong problem. Not only is the solution you built unnecessary, it’s also insufficient.
To get a self-driving car working in the real world, you need to solve it feature by feature, and test the smallest possible features (atomic features?) as quickly as possible in the real world with the whole system running. If you don’t, you might solve problems that don’t need to be solved (like detecting speed limits, in the made-up example), and you might not solve problems that need to be solved (like how to follow the flow of traffic).
This is a whole new way of thinking for me that I’m not used to. I will have to think about this more and revisit some of my old assumptions.
It’s a super exciting conceptual revelation. What’s particularly interesting to me here on a meta level is that you can derive an engineering approach from epistemology, i.e. thinking carefully about what you know and how you know it, about how human knowledge is created (especially with regard to complex systems), what humans can and can’t know in different contexts (e.g. you can’t predict the discovery of a failure mode without making that discovery), and the difference between human competence and human comprehension (implicit knowledge and explicit knowledge).
Epistemology, either explicit or implicit (or a combination of both), is arguably behind the success of science and engineering as approaches and cultures of solving problems. I’m always excited when really abstract, dreamy concepts unexpectedly collide with nitty gritty technical concepts. It’s a reminder that thinking dreamy thoughts isn’t a waste of time and actually impacts the physical world in big ways.
What I saw was just smartphone snapshots of two slides touting the ‘pros’ and ‘cons’ of end-to-end. The slides were clearly promoting the idea that end-to-end had some advantages. I didn’t save copies and they seem to have fallen off of my twitter stream now.
If your comment about AlphaGo is ‘why won’t the same method work’, then the main issue there is probably that AlphaGo has the advantage of a perfect model of the environment - this means there’s no noise in their feedback signal. Additionally AlphaGo’s model is extremely compact so they don’t need to be particularly sample efficient. Developing path planning from RL will need an approach that is more noise resistant and more sample efficient than AlphaGo needed to have.
Working from the latent space of a perception system that is trained on labeled data (what Waymo seems to be doing) helps with the sample efficiency issue but doesn’t resolve the noise issue.
This is not to say that RL cannot be made to work for path planning. I think it’s likely that all of these issues will be overcome in time. But you probably can’t do it naively today. Tomorrow? Who knows.
That’s very helpful, thanks. What do you think is the source of feedback signal noise in our environment models for autonomous driving? Is this a perception problem, or something else?
Based on what I’ve heard and read recently, the actual physics of the world isn’t hard to model for a car context, it’s modelling the behaviour of agents in the environment.
Pieter Abbeel suggests using inverse reinforcement learning to derive a reward function from observation of human driving. This is an interesting idea to me because a company like Tesla (or whoever else in the future might have a large enough production fleet with the right hardware) could, in theory:
Upload 10 billion miles of mid-level representations data
Use inverse reinforcement learning to derive a reward function
Use reinforcement learning in simulation to search for a policy that optimizes for that reward function
The derived reward function would — presumably — include a tacit model of how agents in the world behave, and how to interact with them.
I wonder how far you could get with a randomization approach. That is, populate a simulator with randomized variants of the ego car. Or run multiple simulations in parallel each with a different, randomized variant populating the roads.
If you can’t accurately simulate the physics relevant to your problem, you can just train a neural network on many random variants of the physics (like OpenAI did with their robot hand). Maybe if you can’t accurately simulate the behaviour of other agents in the environment, you can just train a neural network on many random variants of behaviour.
In a self-driving car context, this is analogous to — though different than — self-play. As the ego car gets better at driving, the randomized other cars will generally, on average get better at driving too. Hopefully this means the end product isn’t a car that is impractically cautious for the real world. The need for caution will decrease as the surrounding drivers get better. And if you need bad drivers for your simulator, you have plenty of older versions of the ego car you can use.
A potential problem I can see is that the simulated cars might develop unhumanlike behaviours even if the end result is good driving by human standards. So the ego car might not know how to predict the behaviour of real cars in the real world. The design space/possibility space of driver behaviour might be too large for randomization to be an effective workaround.
I say ‘noise’ just to differentiate it from signal. Strictly speaking it is information which correlates poorly with your objective function. When you are simulating a Go game there’s almost nothing uncorrelated in the simulation, but the driving problem involves absorbing very high dimensional and highly abstracted data which is largely uncorrelated with the objective of driving. The car doesn’t care what kind of trees are on the side of the road, whether the fence is plastic or wood, if the clouds are cirrus or cumulus. The car also doesn’t care about the vast majority of the actions of other road users - it only cares about the ones that plausibly affect its future options. This is all on top of the fact that the other information is stochastic. To an RL agent all of that stuff hides the stuff it really needs to pay attention to. Training an RL agent on ‘noisy’ data presents kinds of problems that AlphaGo didn’t need to consider and those problems will need addressing.
I’m personally sanguine about the potential for RL to make contributions to the driving problem. I think that eventually even ‘end to end’ can probably be made to work - with enough computation, some new techniques, and a lot of refinement. It’s surprising how far you can get with simple approaches. It’s almost as if the universe was designed with this kind of problem solving in mind.
I’m also a fan of physics simulations making contributions. Humans are actually really bad at physics compared to even simple computers. Having accurate physical simulation integrated into driving agents is going to be a big advantage.
That makes sense. I suppose the perception network can narrow it down somewhat by outputting only key variables as mid-level representations. Different types of clouds and trees just won’t be represented. But even if you narrow it down to just the stuff you see in verygreen’s videos, there is still a lot of irrelevant information, and the relevant information is highly stochastic.
Or that this kind of problem solving evolved in organisms adapted to a universe designed this way.
Can you help me understand what this means in practice? I’m especially interested since Karpathy also recommended using one big network.
In Tesla’s case, let’s say Software V10 pre-alpha uses one big network, and they’re using a neural network approach for path planning. Does this mean the bottom layer of the perception network feeds into the top layer of the path planning network, making them one network? If so, how do you train the perception network with labelled images?
End to end - as I understand the term - doesn’t refer to the architecture of a system being used but rather how it is trained. A system could also operate end-to-end but not be trained end-to-end. My sense of Karpathy’s suggestion is that the ideal system is end-to-end, or nearly so, in operation. But it might be trained on synthetic data, a curated dataset, partially self-supervised, or some other mix and still satisfy Karpathy.
An end-to-end trained system is the easiest to create in a sense because the process demands very little of the developer in terms of work on the training and test set. The reason you don’t see that done a lot right now is that work on the training set - for NNs that are in common use today - generally results in substantially better performance. That benefit might gradually go away as techniques improve.
An operationally end-to-end system potentially allows for higher performance compared to a modular system. Modular systems have interfaces between the modules and those interfaces are generally designed by humans to present a particular set of abstractions which are almost always selected based on human intuition about what kind of data should be crossing that interface. End to end systems allow the process of training the NN to determine the appropriate data representation at every boundary. For instance - hand coded kernels for ConvNets were the gold standard before back propagation was fully developed. Those hand coded kernels always had a rationale for why they were constructed in a particular fashion and that rationale was derived from human intuition about what kinds of features were interesting and useful. Automatically generated kernels greatly outperform hand coded kernels today (though they might be less efficient) and part of the reason for this is that they are not constrained to representations which are easy to ‘explain’. In a similar fashion the internal interfaces in a system composed of multiple elements is likely to ‘discover’ better internal representations if it isn’t restricted to those which are easy for a human to understand.
I just realized I didn’t respond to your final question. The answer is, there are various ways to train subsets of a ‘monolithic’ network and which is best depends on a lot of factors. But just to provide an example: you could do multi-head training where an intermediate representation is brought out and provided with an auxiliary loss function. During back propagation you can inject training signals into that intermediate network head to encourage earlier stages of the network to converge towards particular solutions. In this particular case you might bring out an intermediate representation and train it to generate segmentation maps and bounding boxes. Later stages would take those as inputs as well as perhaps other intermediate products from the earlier stages and train against a target which might be a path planning output. This multi-head approach is pretty common in deep non-residual networks. For instance - inception is trained with, if I recall correctly, 4 heads: the main plus 3 auxiliary objectives. Also - the AKNET_V9 network in it’s operational form has something like ten heads. It might have even more than that in it’s training form.
I think there is a challenge in defining End-To-End. For instance would a pooling layer be considered breaking End-To-End? I would say no. I would also say that breaking up networks in a modular fashion isn’t necessarily breaking End-To-End.
I can’t see anyone making progress without a modular system. The problem with the AlphaGo approach as I see it is that AlphaGo can find exotic solutions that humans can’t comprehend. A self driving path finding algorithm needs to be indistinguishable from an excellent human driver. And it’s really hard to teach humans let alone machines through reinforcement learning. That’s how you end up with Cargo Cults on jungle runways mimicing the behavior of air traffic controllers trying to guide in planes that will never come. The Google talk touches on this with the importance of “meaning”.
When I’m driving I’m very modal in my behavior. I have one behavior when driving down a city street. “Stay in lane. Watch for jaywalkers. Stay in lane. Watch for Jaywalkers.” I will then get into modes like “approaching cross walk, can I see the entire crosswalk or is my vision obstructed… is there a pedestrian possibly behind that delivery van?” I clear the intersection and it’s back to “Stay in lane. Watch for jaywalkers.”
Given enough data of course a NN could eventually derive these different modes intuitively and learn purely through mimicry but it’s much more efficient to guide the NN. It’s much more efficient to create a mode such as “van randomly stopped in a 4 lane road… maybe they stopped for a pedestrian you can’t see, don’t pass quickly until you can see that they are actually turning or broken down.” That’s a lesson I had to learn as a driver and I would teach another human as a discreet scenario to learn, I would break that out into a unique driving strategy for a very specific situation. A NN could also eventually learn that a black and white rectangle in frame is a “speed limit” but you can far more quickly teach the meaning of speed limits through directed learning and sign segmentation feeding a concept of “current speed limit”. The risk of the pure end-to-end is that it learns a McDonalds sign means you should drive 25mph because McDonalds are always on city streets.
I would define an End to End NN pretty liberally and say that as long as all pathfinding and driving logic is performed in a NN it’s an E2E network. If you’re only using traditional code as the glue for your network it’s an E2E network. That’s just my personal opinion. I’m not sure how you differentiate that from a NN that’s trained by just dumping tens of billions of miles of driving video and only car controls are outputted but I don’t think one or the other is more or less an end to end approach.
Here is a different approach to imitation learning than ChauffeurNet. The Waymo paper says ChauffeurNet is a recurrent neural network (RNN). Here is a paper from DeepMind where they use a deep Q-network: Deep Q-learning from Demonstrations.
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstra- tions to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
Inverse reinforcement learning (IRL), which “assumes that the expert follows an optimal policy with respect to an unknown reward function” and then uses reinforcement learning (RL) “to find a policy that behaves identically to the expert”
Generative Adversarial Imitation Learning (GAIL), which imitates behaviour by “training a policy to produce actions that a binary classier mistakes for those of an expert”