In this project, we propose to employ “self-supervision”: using the data as its own supervisory signal. The team will also explore the use of temporal and spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. This will be achieved in two ways: predict the relative arrangement of pairs of patches and predicting the actual content of patches from their context. The team will build upon and extend preliminary work to not only consider the arrangement prediction within a single image, but more broadly predict the spatial and temporal arrangement of patches within entire scenes. Outdoor street data used in many driving applications would be a great source of imagery for such a training approach. This will make the arrangement prediction harder, leading to both a better and more task (driving) specific representation.
In a second part of the project, the team will work to predict the content of parts of the scene directly from its surrounding. This tasks is much more challenging than predicting the spatial arrangement, and potentially provides a much stronger supervisory signal. In order to succeed in this task, a model will need to both understand the content of the image, as well as reproduce a plausible hypothesis for the missing parts.