Wayve: Urban Driving with End-to-End Deep Learning


Short demo with no narration:

Various clips with narration:

Blog post:



As I understand it:

End-to-end: Camera pixels go into a neural network, then steering, accelerator, and brake commands come out. The network learns to produce the actuator commands directly from the camera pixels.

Mid-to-mid: Camera pixels go into a perception neural network trained on large datasets of labelled images. A “mid-level representation” comes out. It goes into a second neural network, and that second network produces steering, accelerator, brake commands.

In either case, the neural network that produces the actuator commands is trained with imitation learning and/or reinforcement learning.

The pros of end-to-end: 1) allows training of a neural network on an unlimited amount of driving data with no human labelling or other bottlenecks and 2) allows the neural network to learn representations from pixels, rather than outputting representations (like 3D bounding boxes around vehicles) that are hand crafted by humans.

The con: since perception errors and driving errors aren’t corrected independently, the amount of training data needed to correct both kinds of errors is combinatorially larger than the amount needed to train both independently.

The pro of mid-to-mid: perception and driving are trained independently, allowing for errors to be corrected independently.

The cons: 1) requires massive labelling of images to train perception and 2) relies on hand crafted representations that may perform worse than learned representations due to the fallibility of human understanding and/or the differences between human cognition and neural network cognition.

But what I’ve just said might be an oversimplification. In another thread, @jimmy_d gave this explanation:

I am trying to wrap my head around this. Does this mean the driving network, or the driving part of a monolithic network, can take as its input both hand-crafted representations and learned representations?


Tesla AI and behaviour cloning: what’s really happening?
Tesla from an Investor Perspective

I guess a third option would be to do unsupervised learning of representations. DeepMind recently showed this working on StarCraft:

Unsupervised representation learning would decompose perception and action, while still allowing the neural networks to be trained on an unlimited amount of driving data with no need for human labelling.



My explanation was probably needlessly complicated. Let me try again:

There are a lot of different ways you can decompose the problem of driving. You can also choose not to decompose it. If you choose not to decompose it then you have a monolithic system. For a driving system that is an NN that would be a monolithic NN and would usually be sensors as inputs and actuators as outputs. To train such a network you need to supply examples of sensor / actuation examples. This works really well for demonstrations and is quite easy to do up to a level that shows some level of useful functionality. But this approach has numerous drawbacks. One is that it’s very data intensive for the level of functionality that it provides. Another is that it, by design, does not interface with other systems that would allow for regulation, inspection, and so forth. These limitations are a big enough problem that nobody is currently developing commercial products using this method.

If you decompose the problem into multiple blocks you can often get better results on each block than a monolithic system. This requires that you do a good job of defining the blocks and their interfaces, which is not a trivial problem but it has been studied in depth and is relatively well understood for some arrangements. Some blocks might be best done by NNs and other blocks by other techniques. Today this is the most common approach for developing commercial products. The main reason that this approach is taken today is that it allows for better block performance given appropriate labor and capital. Organizations that take this approach are making the decision to invest additional resources to get better performance sooner.

As the underlying NN technology improves, more data becomes available, and more computational resources become available any given level of performance can be achieved with less labor and capital using few, bigger blocks. There’s a good chance that this ultimately leads to systems that have very few blocks, maybe only one, and which are trained in a fairly simple and general way - though they may take immense computational resources and data. This also removes the inherent performance limitations that block interfaces impose on the performance of the final system by allowing the training process to define whatever internal interfaces, representations, and restrictions result in the best performance on the objective function.

So this whole topic of end-to-end versus not is very complicated because there are a lot of options and the tradeoffs are not simple. A lot of people, myself included and apparently Karpathy as well, expect that more and more blocks will become NNs. When two adjacent blocks are NNs then they can be merged and over time the great majority of the system become a monolithic NN. So ‘end-to-end’ is a conceptual description of a simple and powerful technique which is not currently capable of making the best products. We may never get to true end-to-end but we will probably get fairly close, because over time those will be the best performing systems that take the least human labor and capital to construct.

1 Like


Okay, thanks as always Jimmy for being a patient and helpful teacher. I think I got confused because Karpathy had these slides in his talk at the PyTorch Developer Conference:

I’m guessing that, back in the halcyon days of October 2018, I completely misunderstood what he’s talking about. I interpreted this as meaning that, for instance, the vision neural network and the path planning neural network should be a single neural network. But in retrospect, he probably meant something else.

In the other thread, you said:

That’s the part I’m still straining to understand. What’s the difference between operating end-to-end and being trained end-to-end?



I think you probably do understand it and I’m just not expressing myself very well. Ideally we do want vision and path planning subsumed into a single network. But that sentiment doesn’t express the optimal design for today - it describes where we want to eventually end up after iterating the design sufficiently. Today we (probably) separate them because it’s easier to come up with an effective objective function to optimize if you can start with human intuition about what a certain part of the system should be doing and tune for that. So we train the vision system to know about depth, driveable space, and pedestrians because those are important in human intuition and we know how to score them. And we train the path planner to pick smoothly curving routes that maximize safety cushion and are aware of lane markings because that is also intuitive and something we know how to score. Eventually these two networks will become functional and complete enough that they can interface directly, and at that point you can start thinking about removing the intermediate representations (pedestrians, “driveable space”) and just tune for the final objective: high median user happiness and zero nasty surprises.

So yeah, eventually almost everything in the box is pink (neural network). We’d do that now if we could, but we can’t. What we can do is proceed in a manner that gradually moves us towards that goal while also giving us something usable and understandable along the way.

As for operating end-to-end versus training end-to-end: you can build two separately trained networks, say one for vision and one for path planning, and put them together in a system. If the interfaces support it those two networks can pass information directly between them and effectively operate as if they were a single network. But the parts are trained separately even though operationally they are a single network. In theory they could be trained as a single network with the intermediate abstraction that they pass between them being discovered in the training process instead of being designed by an engineer. Right now we don’t know how to do that efficiently. Eventually we will.

1 Like


So this is an instance of the Sutton effect. Over time, the role of human domain knowledge in AI decreases and the role of learning and search increases.

Do you think currently Tesla or anyone has vision and path planning networks that operate end-to-end?


For the benefit of anyone who doesn’t catch the reference:

From this talk.



oops - blue. :slight_smile:

1 Like