Toyota AI: Self-supervised learning and self-driving cars

#1

This is super interesting!

That’s where self-supervised learning comes in. It is a unique form of ML that is a variant of unsupervised learning. An example we’ve seen in a number of startups is the ability to predict the depth of a scene from a single image, by using prior knowledge about the geometry of scenes. They are essentially creating geometric rules that supervise the network automatically.

We’ve also seen a few startups that can predict future frames of video from past ones, by using prior knowledge about time and causal reasoning. Behind every two-dimensional (2D) image, there is a three-dimensional (3D) world that explains it. When the 3D world is compressed down to a single 2D image, a lot of data is lost, and the way that it is compressed is not random. If you can harness that relationship between the 3D world and its compression into 2D images, you can then work backwards and input an image that allows an AV to understand the 3D world.

Another unique self-supervised approach that some startups and OEMs have been testing is placing cameras in their cars and comparing the video footage to the driver’s interaction with each specific road condition. Essentially, every time a driver is on the road, he or she is labeling driving data for the company through natural/self-supervision — without the company having to pay a human to physically annotate what the driver’s reaction was to the road conditions. Also, the spatial/temporal coherence of videos has a lot of latent structure that could be explored. Toyota AI Ventures is very interested in meeting with startups that can think of innovative ways to label “open space” on the road by harnessing these natural behaviors of data. Supervised or not, it needs to scale!

Also, information on the cost of labelling images:

approximately $4-$8 per image of 1920x1080 density for semantic segmentation, depending on quality of service

Semantic segmentation involves assigning every pixel in an image to a category (e.g. road, sidewalk, vehicle, pedestrian, tree, etc.), so it is probably more labour intensive than, say, drawing bounding boxes around traffic lights and labelling the colour of the light.

1 Like

#2

Makes me think of Elon’s comment:

And we’re really starting to get quite good at not even requiring human labelling. Basically the person, say, drives the intersection and is thereby training Autopilot what to do.

Could be self-supervised learning for computer vision tasks!

Also note this part of Tesla’s Autopilot internship postings:

Devise methods to use to enormous quantities of lightly labelled data in addition to a diverse set of richly labelled data.

0 Likes

#3

I could generate a few hundred thousand photo realistic 1080p super perfect segmented images for less than $4 a frame in CG… Hmmm…

Also I find it interesting that both of those asks are exactly what wayze is marketing…

0 Likes

#4

$4 - $8? Is that right? How long do you think it takes for someone to label an image that follows those requirements? If it takes 15 minutes, then you’ve got at minimum a $16/hr job. Seems pretty pricey

This seems really interesting. Self supervised learning seems to fit more along with Tesla’s comments that you’ve highlighted. I’m trying to think about how this might be applied in practice. What is the label when a driver slams on the brakes because autopilot didn’t recognize a truck? I can see it being a flag, but a label that is autoapplied? What kind of other information could be used? I’m not imaginative or knowledgable enough about ML to think this through

0 Likes

#5

This is my first time hearing about this so I’m also struggling to imagine how it works in practice.

0 Likes

#6

Here is a paper on self-supervised/unsupervised depth mapping from monocular video: https://arxiv.org/pdf/1904.04998.pdf

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal.

It looks like supervised learning approaches to depth estimation significantly outperform this latest, state-of-the-art unsupervised approach.

But I wonder if:

  1. Driver input data is a useful signal for training depth mapping. (The above-linked paper just used video and the information inherent within.)

  2. Unsupervised learning can catch up if you train on absurd quantities of video (especially in combination with other signals like driver input and also maybe radar, ultrasonics, IMU, speedometer, GPS, etc.).

  3. Unsupervised learning can be helpful in augmenting supervised learning.

0 Likes