We present DeepMVS, a deep convolutional neural network (ConvNet) for multi-view stereo reconstruction. Taking an arbitrary number of posed images as input, we first produce a set of plane-sweep volumes and use the proposed DeepMVS network to predict high-quality disparity maps. The key contributions that enable these results are (1) supervised pretraining on a photorealistic synthetic dataset, (2) an effective method for aggregating information across a set of unordered images, and (3) integrating multi-layer feature activations from the pre-trained VGG-19 network. We validate the efficacy of DeepMVS using the ETH3D Benchmark. Our results show that DeepMVS compares favorably against state-of-the-art conventional MVS algorithms and other ConvNet based methods, particularly for neartextureless regions and thin structures.
Training deep ConvNets for disparity reconstruction requires a large number of ground truth disparity maps. A solution is to train the network on the combination of a largescale synthetic dataset and a smaller real-world dataset . Synthetic datasets provide dense pixel-wise ground truth labels for training, but they do not reflect the complexity of realistic photometric effects, illumination, and natural imagenoise. On the other hand, real-world datasets are limited in scale and often do not have labels for regions in which it is difficult to obtain ground-truth data, such as sky and reflective surfaces. To address this issue, we introduce the MVSSYNTH dataset — a set of 120 photorealistic sequences of synthetic urban scenes for learning-based MVS algorithms.We show that the use of a photorealistic synthetic dataset greatly improves the quality of disparity prediction.