This suggests that the way forward in AI is what I call self-supervised learning. It’s similar to supervised learning, but instead of training the system to map data examples to a classification, we mask some examples and ask the machine to predict the missing pieces. For instance, we might mask some frames of a video and train the machine to fill in the blanks based on the remaining frames.
This approach has been extremely successful lately in natural language understanding. Models such as BERT, RoBERTa, XLNet, and XLM are trained in a self-supervised manner to predict words missing from a text. Such systems hold records in all the major natural language benchmarks.
In 2020, I expect self-supervised methods to learn features of video and images. Could there be a similar revolution in high-dimensional continuous data like video?
I wonder if it would be possible for HW3 Teslas to run video prediction on the FSD Computer while driving in manual mode if everything else is turned off. 144 TOPS just for video prediction. It wouldn’t have to be real time. Just record a video clip, attempt to predict future frames, and if successful, discard it. It unsuccessful, save the clip as a training example and upload it when on wifi.