Elon Musk: Autopilot rewrite

Video > video grammatry > 3D segmentation, Depth and labeling (¿Automatic?) >>> Train New model against videogrammatry 3D scene as ground truth with all vehicle sensors input together as single tensor, more route planning in model output.

Or at least that is what I heard. (And consistent with previous Tesla statements).

1 Like

Can you explain for folks like me who aren’t familiar with videogrammatry how you think this might work?

Just like how everyone else is labeling lidar data, as was demonstrated at autonomy day, you take, say, 1,000 frames and use the parallax of the car motion to build a point cloud. https://youtu.be/Ucp0TTmvqOE?t=8297

Not sure how they are doing moving objects though beyond just filtering them out. Or relying on multi cam Depth estimation.


Yes. I have the same interpretation. Some process is run on the video which correlates known things about the vehicle and scene with all the different camera views to synthesize a dynamic 3d representation of the scene. This is the vector space representation. Once that representation is constructed a human can perform a single labeling pass on it and then those labels can be pushed back to the original pixel space frames that were used to compose the scene. So you get many frames of labels out of a single labeled scene. These labels are useful for training conventional NNs that work in pixel space.

The video-grammetry processing is very computation intensive so you can’t run it in real time on a car. It has to be run as a post processing operation in a data center. But the output from it can be used both to improve labeling efficiency and to facilitate creation of a training corpus for an NN that could directly generate the vector space representation.

In the long run it’s that second NN that you want to have driving your car but it’s much harder to get that network working so in the short run you use networks that output pixel space labels and heuristics that work with those pixel space labels to make driving decisions. When you have the vector space network working well it takes you to the next level. Vector space representation is inherently superior to pixel space because it maps well onto physics - which as a representation is both extremely simple and extremely accurate.


Thanks guys. Really helpful to hear the explanation

1 Like