CNNs use statistical shortcuts to classify images


A great feature of the BagNets is their transparent decision-making. For example, we can now look which image features are most predictive for a given class (see below). For example, a tench (a very big fish) is typically recognized by fingers on top of a greenish background. Why? Because most images in this category feature a fisherman holding up the tench like a trophy. Whenever the BagNet wrongly classifies an image as a tench it’s often because there are some fingers on top of a greenish background somewhere in the image.


Karpathy had a good response to this:

NNs are incredibly flexible and are strongly incentivized to find the lowest effort method of achieving a goal. If the task is designed such that bag of features will work on it then it’s a good solution from the standpoint of the NN. Just because they will use bag-of-features when they can doesn’t mean that they cannot find other solutions.

The authors have a good point and it’s a very interesting result. But this paper is being interpreted by some as meaning that CNNs cannot do something other than bag-of-features, which is an erroneous conclusion from the data presented. Additionally, there are other results in the field which clearly are not the result of bag-of-features processing. For example, it is very hard to explain GANs if bag-of-features is the limit of what CNNs can do.


That is a really interesting point. Maybe the discussion should be re-framed from designing neural network architectures to designing training datasets that make shortcuts impossible. At least that would be a way to test competing conjectures about neural networks, and would be a way to learn how far existing NN architectures can be pushed.


That’s a major focus of designing training corpora - trying to suppress the ability of the NN to find a shortcut to reducing the training loss. More broadly, how to get NNs to generalize is a widely studied topic with a lot of practical methods having been created but it’s still quite challenging. That a particular academic training benchmark from several years ago fails to, by itself, sufficiently regularize a large network so as to prevent overfitting beyond the needs of the validation set is unsurprising. For most groups that develop networks with the objective of scoring well on a benchmark, forcing the NN to generalize is a secondary concern. This is also generally true in industrial applications and in human cognition as well. Humans generalize very well but use shortcuts prolifically where useful. Everything from visual illusions to confirmation bias to rules-of-thumb shows that humans prefer to remember where doing so is more efficient than trying to understand. In the real world success is often a blend of generalization and memorization. If NNs could only generalize and not memorize then they’d be inferior learning systems. The fact that they can do both and can switch between them fluidly is a strength, not a weakness.