Cruise misses its milestones, and how to measure a self-driving car's performance

Blog post by Brad Templeton:

(Note: There is a typo where he says “2109” instead of “2019”.)

The original reporting from Amir Efrati at The Information, which you can read by giving your email:

According to Amir’s reporting, Cruise made an inductive error:

What is clear from the documents is that Cruise misjudged how quickly its software would reach the final milestone, known as Apollo. Engineers led by then-CEO Kyle Vogt believed that partly by hiring more people Cruise would be able to improve its software exponentially. Instead, Cruise and other major programs in the industry improved quickly at first, but then saw diminishing returns from software development.

Brief summary of Amir’s other article on Cruise (which you can’t read unless you pay $40/month for a subscription):

Cruise apparently has an abrupt braking event on average once per mile. Oof.

And Cruise vehicles reportedly induced discomfort for passengers — by braking abruptly, for example — around 10 times every 10 miles during a recent 30,000-mile stretch of testing, which was similar to a rate of around six to 12 incidents every 10 miles last year, according to The Information.

As much as I want self-driving cars to become a reality, I worry that it simply isn’t possible via the pathway Waymo, Cruise, Uber ATG, Argo, Zoox, Voyage, and others are trying to take. At least not without new fundamental advances in machine learning that allow for vastly better data efficiency.

The pathway of large-scale data collection that Tesla is trying to take (and other companies like Mobileye may try) might not work either. I worry even more about that because right now it looks like our best hope.

Without large-scale data, you can’t leverage machine learning to its fullest extent. Without leveraging machine learning to its fullest extent, it might be impossible to develop software that is superhuman at the constituent tasks of autonomous driving: computer vision, behaviour prediction, and path planning/driving policy. Even with large-scale data, it might be impossible, but it’s more likely to be possible with large-scale data than without it.

If it isn’t possible, then self-driving cars just won’t work. Not until we make new breakthroughs that push robotics and AI closer to general intelligence.

It’s disheartening to see Kyle Vogt’s optimism crash against the rocks. Especially after the same thing happened to John Krafcik; Waymo was supposed to have a robotaxi service with no safety drivers last year, and now Krafcik downplays robotaxis and emphasizes freight trucking instead.

Uber ATG’s culture was so bad that they disabled a critical safety feature — braking for pedestrians — to impress the new CEO with a smooth demo ride. I’m not even sure Uber should be allowed to continue public road testing. I definitely don’t trust Uber to do a competent job at self-driving car development. I’m sure they have some fantastic engineers, but the culture is bad. The problems at Uber ATG may not be isolated from the cultural problems at Uber more broadly, like HR and management covering up sexual harassment.

I don’t know much about Argo or Ford Autonomous Vehicles. I don’t think they are doing anything fundamentally different or better than Waymo or Cruise. So I don’t see why Argo/Ford would be any more successful. Same for other startups like Zoox.

Voyage and Nuro are trying to solve somewhat more limited and easier problems first (gated retirement communities and grocery delivery, respectively), but I don’t see how that ultimately makes it easier to solve the robotaxi problem.

Wayve is taking a fundamentally different approach — end-to-end learning — but end-to-end learning seems to require large-scale data in the same way that all machine learning requires large-scale data. Uber ATG tried doing end-to-end learning for a year and abandoned it. If this approach works, I think it will be at the scale of millions of cars, not hundreds.

So, I see Waymo’s and Cruise’s inability to meet their goals as telling us that small data self-driving car companies in general will not be able to meet their goals. Not until Waymo and Cruise do.

Big data self-driving car companies (of which Tesla is the only confirmed example so far) might find more success. Let’s hope so because if not, it bodes poorly for everyone.

The worst thing for everyone would be if self-driving cars aren’t possible.

If Tesla (and/or others) succeed with the big data approach, that would be a good thing for Waymo and Cruise. GM could start equipping its production cars with Cruise hardware. Alphabet could acquire a car company to vertically integrate with Waymo, or forge partnerships to integrate Waymo’s hardware and software with mass produced vehicles. Or possibly even sell Waymo to a car company or a consortium of car companies, although it would be hard to get $100 billion (which is what Waymo reportedly values itself at).

These companies are competing with each other, but more importantly they are also collectively competing against time to deploy a commercial product before the people providing capital give up. So Waymo and Cruise should root for Tesla’s success. And vice versa, since if Waymo and Cruise can do it with small data, surely Tesla can do it with big data.


In particular, the report states that the forecast is that Cruise will, by the end of 2019, have a vehicle that performs at between 5% and 11% of the safety level of average human driving, when it comes to frequency of crashes.

I was just thinking about this. If 5-11% of human level can be achieved by the end of the year, that’s not bad actually. An improvement of 10-20x to meet human safety and 20-40x to double it doesn’t sound out of this world. It’s better than the 100x I’ve heard from Mobileye’s Amnon Shashua and others.

I would love to know what the figure is at Waymo, and how much it’s been improving year by year. Has Waymo’s rate of improvement soowed? Plateaued, even? Or do they see steady improvement that, if extrapolated naively, would put them at 100%+ human driving ability within 10 years?

Disheartening to hear. If I remember correctly, the MIT study by Lex suggested that Tesla saw 1 disengagement every 10 miles. Lots of caveats to that obviously, since we don’t know what the disengagement was for, and this was using data from awhile ago.

I would love to see the dashboard of data that Elon sees every week…

Me too!

I couldn’t find in the paper whether this was Hardware 1 Autopilot (using the Mobileye EyeQ3 vision system) or Hardware 2 Autopilot (with the Tesla Vision neural networks).

Cruise etc though put in very few highway miles. Tesla is only engaged in pretty simple roads and even then I was up to my ears with auto lane confirmation free NoAP behavior yesterday.

Tesla still in the latest firmware will change lanes into an obviously fast oncoming car and get rear ended if left to its own devices. I guarantee waymo and cruise wouldn’t with LIDAR.

Re: Big Data. Unfortunately training becomes more difficult the larger the dataset and provides at best logarithmic returns. After 1M+ samples, diversity, hard negatives, and clean labels become more important. I don’t think big data is sufficient to bring AVs to the road. We’ll need significant advances in multi-object tracking and prediction architectures, which are underserved by deep learning today.

I also think we’ll need better training hardware + strategies. Even with TPUs or GPU strategies like all-reduce, Perception data (images, point clouds) is just too large to train quickly enough to test hypotheses cost-effectively.


Thank you for articulating this. This is a somewhat subtle point that was initially not obvious to me. What does more data mean? More doesn’t just have to mean more of the same. More can mean you now have 1,000 examples of a rare semantic class where previously you had zero. Or more can mean surfacing only the false negatives and false positives (e.g. for something like cut-in detection) rather than just uploading every example (e.g. of a cut-in) you encounter.

When studying human driving behavior, it appears that it originates at two rather distinct levels: the conscious and the subconscious. The most utilized level happens subconsciously. From our earliest driver’s training days to having driven a car for decades, driving is mostly a conditioned response skill. We often rely on this “mindless” mode of driving to get to a destination along a route which has been traveled many times, leaving our thinking mind to do other things. Driving is bumped up to the conscious level when triggered by something that doesn’t fit into our normal driving pattern.

Conditioned response driving is so firmly programmed in our subconscious that it can cause embarrassment. Like leaving for work in the morning to a new job, and after having driven several miles, realize that you’re on the route to your former employer. Even little things like flipping the stalk up to go into reverse (which is what you do in a Tesla), but, instead, activate the wipers because you’re in your other car. We do most of our driving “automatically” with little conscious effort.

This is the kind of driving that I assume will be achieved by a well-trained autonomous car. Autopilot will never be "conscious " of the driving experience. Autonomous vehicles will respond to visual inputs the way they are conditioned to respond. As long as nothing atypical happens en route, the car will carry out its mission beautifully.

The question is whether this will be enough to achieve FSD with an acceptable degree of safety. Some driving tasks require more than simple conditioned response behavior. I think that’s where the real challenge will be, and I don’t think we are even remotely close to solving that level of AI. The need for a cognitive problem solving capability in Waymo’s autonomous driving system is provided by the safety driver they continue to place in their robotaxis.

I think this is the fear we all have, that lacking a kind of artificial consciousness, a built-in reasoning layer on top of the instinctive driving layer, there will always come a situation that the car will not be able to handle. The best we can do is to have some kind of default behavior that keeps the car and its passengers safe from injury. But even that is a difficult problem to solve.

It’s people’s experience with these difficult driving experiences that convince them that FSD will never be achievable. They could be right. Getting to SAE Level 3 for generalized self driving may be as far as we can go until some new advance in AI is developed. Until we develop a self driving car that recognizes what it doesn’t know, I think even Level 3 will be hard to achieve.


I concur with your description of driving and with the notion that the challenging part right now is extending the envelope of situations which are reliably managed by the ‘conditioned’ responses of the system - i.e. those that are managed without having to revert to some higher level analytical system. As you point out humans drive almost entirely via conditioned responses and an experienced driver who is driving in their usual environment will almost never have to employ anything other than conditioned responses. We get so used to this that it becomes the cause of various accidents where we didn’t drop out of conditioned response mode when we should have. Nonetheless, humans perform remarkably well when you exclude the situations where they are either impaired or distracted. And I think that should be considered a lower bound for what is possible with current approaches to self-driving given enough scaling.

Is that enough for robo-taxi operation with zero intervention, ever? Maybe, or maybe not. But even if it’s not it’s still very useful. At a minimum it would be safer that humans by a large margin - 10x or 100x when you look at the causes of serious accidents. Which would make it a tremendous ADAS system, a major aid to people who drive for a living (including taxi drivers), and a boon to society in terms of reduced accidents. It’s also good enough to operate without supervision in a lot of circumstances and business models.

What ‘more data’ means has been studied extensively, and while it’s not something that can be simply and accurately described it is more or less understood. In the kind of perception problems that are limiting for AP today you can expect an approximately linear relationship between -log(err) and log(data_size) assuming that the distribution of data follows the natural distribution. This appears to scale without limit, so each doubling of data will provide the same fractional reduction in error rate - forever.

But it’s becoming clear that there are ways to beat this and reduce error at a faster rate by adjusting the distribution of data which is included in the training set. It’s likely that this is what the brain does - resources are distributed according to how important various phenomena are and not simply by how frequently they occur. As Karpathy put it “you only need so much data for driving straight in your lane - at some point the neural network just gets it”.

This is what Tesla’s system is doing - they use the fleet to specifically sample interesting cases which are at the margin of what the system can currently do and feed those into the training pool. They don’t sample stuff that is already over-represented in the pool and they don’t bother sampling phenomena which are well outside the current set of capabilities. As the capability envelope expands they are always adding data that gives the most incremental benefit considering the current system capabilities. This approach allows Tesla to reduce the error rate faster than is suggested by the simple -log(err)~Klog(data_size). At the same time they are working to dramatically scale up data_size and their ability to manage it by extending their data-center capability. Project Dojo will be focused exactly on that task.

The above only considers scaling with data and ignores process refinements, algorithmic improvements, and improvements in domain understanding on the part of the development staff. These other items each multiply the rate of improvement beyond what the optimized data scaling already provides.

You wonder what Elon sees on his dashboard? He’s seeing the trends in all these metrics; that they improve dramatically year after year and that they show no sign of tapering. This is what provides so much confidence that they will succeed. But what he isn’t seeing, and what Waymo/Cruise also can’t see, is where the point is that allows for unfettered use in a robotaxi. We won’t know where that line is until we cross it.


I’ve had this experience too, but I see it a bit differently. AP’s improvement curve isn’t monotonic. Sometimes it get’s worse, occasionally dramatically worse, when a new capability is brought online. When cut-in was activated there was a rash of phantom braking experienced by AP users. My sister and brother-in-law (who are on the Alpha program) were complaining mightily for two weeks and I got to experience it myself a bit later. But over the next few weeks the phantom events dissipated and after they did the new cut-in detection ability dramatically raised the bar on how well AP behaved in traffic.

This isn’t an unusual occurrence. I’ve seen it regularly in the four years that I’ve been closely tracking AP performance. As a result my habit is to compare current AP to AP from 3 or 6 months previously and to do so holistically - don’t get caught up in focusing on the one thing that is giving you problems right now, consider the whole picture.

When viewed in this light AP’s progress is shockingly rapid. It’s far from perfect, but when I think about it compared to 6 months ago it’s amazing. And that’s been true at every point in time since I bought my first AP car back in 2015.

1 Like

I, too, am amazed at how much AP has improved over the 8 months that I’ve used it. One of the advantages that Tesla has over Waymo, Cruise, and others who are aiming directly for Level 4/5 is that for Tesla, it’s not an all or nothing bet. As you say, AP will become a great ADAS even if Level 4 eludes them. That would be a shame, of course, because it would preclude their Tesla Network business prospects until the AI ceiling can be broken through. Still, as a consumer product, AP is on a trajectory to becoming a ‘must have’ feature which is bound to have a positive effect on Tesla sales.

Waymo appears to have bumped into a ceiling on their way to FSD. They’ve been at this goal for 10 years! I just hope that the reason is their mistaken approach, not some fundamental limitation in our understanding of AI. It’s frustrating to find so little information about what Waymo is doing to address their problems. For a company valued at what, $50 billion?, you’d think Wall Street and the media who be all over this company for more transparency.

If Enhanced Summon is all it’s promised to be, Tesla will have the first truly driverless car on the market. Granted its operation will be limited to parking lots and within some radial distance from the summoner, but still a major first. It’s brilliant! This bottom up introduction of self driving into the public’s consciousness is measured and, I think, powerful. One of the biggest hurdles facing autonomous vehicle manufacturers is winning public acceptance. Enhanced Summon also lends itself to solving the problem that Waymo has navigating to off road network pickup and drop off points.

Navigating a parking lot seems like it would be a difficult problem to solve. If Tesla can pull it off, it’s a strong indication that they can achieve general self-driving in more complex situations at higher speeds. Parking lots present dozens of edge cases that AP will have to manage. I suspect that’s why it’s taking longer to release it than was initially estimated. I hope it turns out to be an impressive, useful, and safe feature.

1 Like