End-to-end learning vs. world models

If a neural network takes in raw pixels from a car’s cameras and outputs a path, that is end-to-end learning. (Or technically I guess it would be end-to-mid learning since the path is still executed by hand-coded control software.)

George Hotz (President of Comma AI) makes an interesting argument that we should feed raw pixels into a path planning network. He argues it’s futile to attempt to capture the information that humans glean about the world from vision with a set of ~100 hand-designed representations (such as driveable space, bounding boxes, etc.). I find this argument appealing because it is accords with the theoretical perspective that learning should be learning and not hand coding. So, that puts Hotz into the end-to-end learning (or end-to-mid learning) camp.

But Yann LeCun (bio) takes the same theoretical perspective and spins it in a different direction. We do need world models. And they need to be learned models. That’s what self-supervised learning is for. To learn the rich internal structure of video data, allowing it to do things like predict the next frame of video. Without the limitations of hand-made labels like “car” or abstracted representations like bounding boxes.

Self-supervised learning is still in the early research phase, whereas supervised learning has received intense interest since around 2012, and is at a more mature phase of R&D. For now, if you want to use a world model in a commercial application like Autopilot, it’s probably gotta be mostly supervised learning.

Supervised learning is World Models v1.0 and self-supervised learning is World Models v2.0. Supervised learned world models are less learned and more human-designed than self-supervised learned world models. But, for now, they are the best world models we can use in commercial applications. And Yann LeCun’s arguments about why we need world models also apply for supervised learned world models, not just self-supervised learned ones. Namely:

I think this explains how building an explicit world model using ~100 hand-designed representations isn’t necessarily antithetical to the Bitter Lesson thesis; it’s just the best available alternative to self-supervised learning at the moment. World models are still important even if they’re largely hand-crafted.