This is super interesting!
That’s where self-supervised learning comes in. It is a unique form of ML that is a variant of unsupervised learning. An example we’ve seen in a number of startups is the ability to predict the depth of a scene from a single image, by using prior knowledge about the geometry of scenes. They are essentially creating geometric rules that supervise the network automatically.
We’ve also seen a few startups that can predict future frames of video from past ones, by using prior knowledge about time and causal reasoning. Behind every two-dimensional (2D) image, there is a three-dimensional (3D) world that explains it. When the 3D world is compressed down to a single 2D image, a lot of data is lost, and the way that it is compressed is not random. If you can harness that relationship between the 3D world and its compression into 2D images, you can then work backwards and input an image that allows an AV to understand the 3D world.
Another unique self-supervised approach that some startups and OEMs have been testing is placing cameras in their cars and comparing the video footage to the driver’s interaction with each specific road condition. Essentially, every time a driver is on the road, he or she is labeling driving data for the company through natural/self-supervision — without the company having to pay a human to physically annotate what the driver’s reaction was to the road conditions. Also, the spatial/temporal coherence of videos has a lot of latent structure that could be explored. Toyota AI Ventures is very interested in meeting with startups that can think of innovative ways to label “open space” on the road by harnessing these natural behaviors of data. Supervised or not, it needs to scale!
Also, information on the cost of labelling images:
approximately $4-$8 per image of 1920x1080 density for semantic segmentation, depending on quality of service
Semantic segmentation involves assigning every pixel in an image to a category (e.g. road, sidewalk, vehicle, pedestrian, tree, etc.), so it is probably more labour intensive than, say, drawing bounding boxes around traffic lights and labelling the colour of the light.