Tesla Autonomy Day: watch the full event

Karpathy (2:24:00):

If I was to summarize my entire talk in one slide, it would be this. … We see a lot of things coming from the fleet. And we see them at some rate — like, a really good rate compared to all of our competitors. So, the rate of progress at which you can actually address these problems, iterate on the software, and really feed the neural networks with the right data — that rate of progress is really just proportional to how often you encounter these situations in the wild. And we encounter them significantly more frequently than anyone else, which is why we’re going to do extremely well.

If the city driving software Tesla demoed really is the product of just 3 months of neural network training and software development, then the rate of progress does seem very fast.

1 Like

New, better distillation of what we know about Tesla’s collection and labelling of training data post-Autonomy Day:

1 Like

Positive comments from folks at DeepMind and OpenAI:

1 Like

I published a Waymo vs. Tesla article recently that covers five arguments I’ve heard over and over:

If you’re reading this, you’ve most likely heard these arguments too:

  • “Self-driving cars need lidar.”

  • “Waymo is years ahead of Tesla.”

  • “Google and DeepMind are the world leaders in machine learning, so Waymo is the leader in self-driving cars.”

  • “Waymo has the lowest rate of disengagements by safety drivers.”

  • “Waymo is already operating a self-driving taxi service.”

Not too much new on self-driving from Tesla’s shareholder meeting. Elon comments on it a bit at 1:37:35.

Facebook AI researcher and co-creator of PyTorch:

1 Like

I’m really skeptical that the Tesla hacking community has a good handle on how the fleet gets used to support training AP. For one thing, all the data they have comes from just a few cars and only one or two individual cars has had detailed examination of it’s processes while they were running in actual use. Nobody has yet succeeded in decompiling the code to a format that can be interpreted at a high level so everything that is known comes from tapping OS process monitoring features to observe running processes and then looking at their I/O. And Tesla knows about the cars that have been rooted and it’s strongly believed that Tesla treats those cars differently in terms of fleet pull requests, firmware pushes and so forth.

Shadow mode often gets interpreted as separate processes running in parallel, alternate configuration files and so forth. The inability of the hacking community to find these particular elements in the few cars they’ve looked at closely has led many to believe that shadow mode doesn’t exist. I think that’s an over-interpretation. AK described the use of shadow mode during development of the cut-in feature and nothing in his talk implied that it requires separate parallel processes. His description can be satisfied just by gathering data from the running system about features which are present in the planner, controller, or perception system but which are not currently being employed to make driving decisions. For instance the NN can (and does) have an output labeled ‘CUTIN’ which can be present in cars which currently don’t employ it for driving, but data about the behavior of that output can still be captured in triggers and uploaded to Tesla for use in the data engine. This kind of feedback would be invisible unless you knew exactly how to trigger it and happened to be observing one of the cars that currently had that trigger enabled.


This is an opinion, but I’m going to state it emphatically because I’m pretty confident in it.

Dojo isn’t going to be a training computer deployed into car, it’s going to be training infrastructure that is optimized to perform unsupervised learning from video at scale. Tesla is probably going to produce custom silicon to enable this because available commercial hardware is inadequate to the task, but it should be doable with a level of silicon development effort comparable to what it took to create Tesla’s FSD chip.

Unsupervised learning works, but it’s too resource intensive to use on video right now. It’s been employed to generate language models from large collections of curated but unlabeled text with great success and I believe Tesla’s objective with building Dojo is to bring a similar capability to the task of building world models from large collections of curated but unlabeled video. In a few years, with the right hardware, it should be possible to build world models that would be the self driving car equivalent of GPT-2. By training against a proxy task like predicting future video frames it will be possible to train a network that can extract high level representations of the world as seen by a camera. Using this as a foundation to train interpretive heads, or by fine-tuning it to target objectives it will be possible to achieve performance levels that are not economically attainable with human labeled data.

One characteristic of networks that are trained in this manner is that they are very large - typically many times larger than the effective network size needed to learn from labeled data. A major limitation of the FSD chip’s NNE is the 32MB of SRAM used for activation caching. A network trained on unsupervised video might well need more than 32MB for activations, or it might need some mechanism for compressing activations during caching. Tesla’s NNE patents do not include such functionality. If so then the current NNE could not efficiently run the networks that might be produced from a mature implementation of the dojo system and this might be something that the HW4 chip would be targeting.


Yeah I’m still suspicious that they can dynamically modify the weights on a per-vehicle basis at-run-time. Or as you said they could even theoretically encode data to the cut-in output.
00 = No Cut-in,
01 = Shadow cut-in
10 = Regular cut-in
11 = Shadow and Regular cut-in

You’re now running shadow mode on every car and it’s a one bit difference on the visible outputs. Fleet-wide with that small tweak you could then get A/B testing with a single bit. “Shadow -> 80% accuracy, Regular 70% accuracy.”

And it could be per-feature as well to minimize performance impact. “Stop lights right now get to run two slightly different versions since that’s our current dev focus.”

My initial snap judgment was that it would be unfeasible for Tesla to upload up to an hour of raw HD video from up to 8 cameras from up to 1 million cars every day. But now I’m rethinking this.

Each car would upload about 10 GB per day if the video is 720p, 30 fps, and compressed. If I’m getting this right, Azure charges $0.05 per GB for bandwidth. That would be $0.50 per car per day, $500,000 for 1 million cars per day, and $183 million for 1 million cars per year.

This is also an upper estimate because 1) it assumes 100% of the time the car is driving the cameras record video that is later uploaded and 2) services like Azure offer volume discounts to customers.

To estimate storage costs, let’s say each 10 GB daily upload is kept in Amazon S3 Standard for 3 months at a cost of $0.021 per GB per month, then moved to S3 Glacier Deep Archive Storage at a cost $0.00099 per GB per month. That’s $0.63 for S3 Standard and $0.06 for 5 years of Glacier Deep Archive Storage. Let’s say all storage costs are paid upfront just because it’s easier to calculate that way. Total: $0.69 per car per day, $690,000 for 1 million cars per day, and $252 million for 1 million cars per year.

So, bandwidth and storage combined would be $435 million/year. That’s not an unfeasible amount and, as I said, this is an upper estimate.

Is this what’s called unsupervised representation learning? I think I understand the idea that by learning to predict the next few frames of video from the past few frames a neural network will learn the semantic features of the scene. (A recent DeepMind paper found this.) The part I don’t quite understand is this:

Can you explain more?

The bandwidth on the cloud side is feasible, but how do you get it to the internet? If through cellular, you’re looking at at least $1/GB. If you try to send 10GB per day through the home wifi of Tesla owners, they’re all going to go over their bandwidth caps.

IMO the way to do this is to have the computer constantly predicting what is going to happen next. If the situation does not play out as expected, that situation becomes a training example and is uploaded.

Whether this works as training data or merely validation data is an open question though. I don’t know what the true structure of the long tail is. It could be that there are millions of types of rare events, in which case, if you have a big enough fleet, you can learn all of them and make a safe vehicle. It could be though that there are billions of types of rare events, and that there is a substantial area under the curve of events that essentially only happen once ever. In that case, trying to learn all of them doesn’t help you because you’re learning how to deal with a situation that will never happen again. Meanwhile, the thing that’s going to kill you never appears in your training data.

If this is the case, the computer is going to have to be able to do higher level reasoning to be able to deal with novel situations, or at least be able to recognize when it’s in one of them and be able to fail safe, or punt to a remote driver if there’s enough time.


That’s a great point. There may be some exceptions in countries without bandwidth caps or if Tesla allows users to opt-in to unlimited uploads if they have an unlimited bandwidth plan. Still, a lot of North American Tesla drivers are going to be capped.

This sounds right to me. It follows the method for training the cut-in detector that Karpathy described on Autonomy Day:

The cut-in detector is an example of a neural network attempting to predict how another car will behave. An idea I find intriguing is running NNs in shadow mode that attempt to predict how the Tesla (the “ego-vehicle”, as they say) will behave when under full human control. This is the same principle as the cut-in detector. Tesla doesn’t have to continually upload just more and more random demonstrations for imitation learning. It can upload just the instance where the imitation networks failed to correctly imitate.

In terms of what @jimmy_d was talking about with unsupervised learning for computer vision, the same principle is potentially applicable. The NNs in the car attempt to predict the next frames of video, and when they fail a video clip (combined with other data) is stored and later uploaded over wifi. The 10 GB per day is winnowed down; if the NNs are 90% accurate, then the maximum useful amount becomes only 1 GB per day.

Thank you for articulating this so clearly. I am really curious whether this is a) just an unanswerable question right now or b) if there is some reason to think that novel situations don’t come up frequently enough to defeat imitation learning and reinforcement learning.

Tweet that brings to mind jimmy_d’s theory that the Project Dojo computer will be used for unsupervised/self-supervised learning:

Today I watched the Chris Urmson interview that Lex Fridman did, and I have to say that it convinces me more that ever that the waymo/cruise/aroura sensor-heavy/hd maps/heuristic code approach is best thought of as a collection of brittle stop-gap measures and wishful thinking. The long term solution is going to be machines that navigate the way humans (and animals) do - perception heavy, highly dynamic, and intuitive.

As ever - it’s hard to say who will accomplish what and when. And I’m happy that various approaches are being pursued because diversity is strength. But I really found Urmson’s arguments unconvincing and his narrative self-serving.

Karpathy is right here - shortcuts will become limitations and are evolutionary dead ends.

Quite interesting how Urmson says at 38:40 that if he could wave a magic wand and magically solve one part of the system, it would be prediction: predicting what will happen around the car over the next 5 seconds. Same as Anthony Levandowski said.



Bad machine transcript:

On the topic of unsupervised learning for computer vision: DeepMind demonstrated that an unsupervised model, CPC, surpasses AlexNet on ImageNet. Unless I’m somehow misinterpreting these results, this feels astonishing. AlexNet is what spurred the rush into deep supervised learning. If unsupervised learning is starting to beat AlexNet, then will we see a similar rush into unsupervised learning over the next few years? Yann LeCun has been advocating unsupervised (a.k.a. self-supervised) learning as the key to future progress in machine learning.

DeepMind also demonstrated that CPC works well with semi-supervised learning. When it gets the full labelled training dataset, its accuracy is close to the best supervised ResNet. And with less than 200 labelled images CPC performs much better.

Blog post explaining the results:


Dedicated thread on CPC:

1 Like

This makes sense if you consider that the label in labeled data is just one of the signals that the network uses to discover correlations. In the case of supervised learning the label is often chosen to be the most useful possible input for a particular target application so it is exceptionally useful and networks that train on labeled data tend to converge to utility much faster than networks which have to work from more indirect signals.

But labeled data can be hard to come by at scale. Naturally occurring labeled data is especially rare if you want super valuable labels for super narrow tasks, which is why we create labeled data by having humans assemble it by hand.

NNs have the property that they perform better with more data, however. So if you have enough data you will eventually overcome the advantage of the super precise and useful signal that labeling provides. Today this happens when we have a good proxy goal to train against - one that has a clean relationship to the eventual target application. But with enough data and computation every task will succumb to the advantage of big enough data. So in the long run the most powerful and useful techniques are going to be unsupervised training on massive datasets. This is what the brain does.

But unsupervised training comes later than supervised training because of the need for big data and big computation. The low hanging fruit applications are the ones where we can quickly generate useful amounts of labeled data, and those applications are what we are seeing now in the early days, but it’s going to change. In NLP we have already made the transition to the unsupervised learning regime with the advent of BERT and GPT-2. It’s now faster to train your application on top of BERT than it is to build from scratch with labeled data. Unsupervised training takes over sooner with language than with images because the signal provided by labeling is not as strong with language so the relative advantage of having labels is less in NLP than it is with images. It’ll take a bit longer with images, but the outcome will be similar: we’ll see large, general, unsupervised models used as the basis for the best performing systems. I’d guess we are 3 to 5 years from that transition with images, but it’s just a guess.


I think one of the more interesting things about CPC from my brief reading is that it attempts to generalize unsupervised recognition.

Humans are really good at generalized image recognition. If you show me a TV Remote once I can find it in a near black room with massive occlusion and extreme lighting changes from a single reference. I can also understand the 3D form of the object which helps immensely with perspective shifts.

I think while the current approaches obviously have massive interim applications, until I see a model that can compete with humans on single-reference classification I’m going to assume that the underlying model is far more fragile than human recognition.

If I had to guess, once a model has robust unsupervised labeling capabilities, it should also be able to output a rough point cloud volume of the object from a single reference as well since it’ll have to be happening internally. Which is why 3D bounding boxes are so important IMO for Tesla’s training. I bet that not only is it providing useful spatial data critical to predicting vehicle paths but it’s also improving the robustness of their segmentation since it’s forcing the network to have a volumetric understanding of “car”. I wonder if feeding image/volume pairs would help train a more generalized model. Unfortunately we don’t have any large data sets yet, but my intuition says the group that creates one for themselves will spectacularly outperform everybody working off of Google Images and 2D bounding boxes.

1 Like