Tsinghua-Tencent 100K is a training and testing dataset for traffic sign recognition using 100,000 high-resolution panoramic images from Tencent Street View:
Some interesting findings:
There are about 10,000 annotated images that include traffic signs.
About 6,600 of these images are used for training, and about 3,300 are used for testing.
The best performance I’ve found on this benchmark is 92.82% (F1 score).
Human performance has not been tested on the Tsinghua-Tencent 100K dataset, but it has been tested on the German Traffic Sign Recognition Benchmark (GTSRB) at 99.22% (CCR). Even in 2011, convolutional neural networks outperformed this score with an accuracy of 99.46%.
The GTSRB is a very different dataset from the Tsinghua-Tencent 100K (TTK) dataset. It’s hard to say which is likely to be harder for humans, even though GTSRB is clearly easier for neural networks. A lot of the images in the GTSRB are blurry or overexposed.
If we guess that human accuracy on TTK is 99.9% or 1 error in 1,000 examples, we can guess how much improvement is needed by neural networks to surpass humans. 92.82% accuracy is about 1 error in 14 examples. A 75x improvement would be 1 error in 1,042 examples, or 99.904% accuracy.
If we instead guess humans will have 99.5% accuracy on TTK, or make one error per 200 examples, then neural networks are only 15x away. A 15x improvement would mean 1 error per 208 examples, or 99.52% accuracy.
On this point, the most interesting fact to me is that 92.82% accuracy was achieved by training on only around 6,600 images. There is no reason a neural network used in an autonomous car couldn’t be trained on tens of millions of labelled images. A 3,000x increase in training data is well within reason.
I may be getting this part wrong, but I believe a bigger neural network could be used for traffic sign recognition as well. The neural network that achieved 92.92% accuracy on TTK is based on Faster R-CNN, which has around 100 million parameters. The winner of the 2017 ImageNet challenge, SENet-154, has 146 million parameters. ResNeXt-101 32x48d has 861 million parameters.
I saw a sign today that made me think about sign recognition:
This Speed Limit 10 sign is apparently intended for snowmobiles(?) driving next to the road. The sign is on a short pole and a little far from the edge of the road, but basically looks like a normal speed limit sign.
Now that’s a true edge case.
You might be overestimating the performance of humans when you set such a high bar. In discussions about NN performance being compared to humans there seems to be a tendency to believe that humans are very nearly perfect on many of these basic tasks. When humans are actually tasked with the same function as an NN under laboratory conditions their performance is generally very far from perfect. 95% accurate human performance is probably a better generalization than 99%.
For example, human recognition of single words from standardized recordings tends to run about 5% error rate assuming good listening conditions. When noise is added or acoustic distortions that mimic common listening environments are introduced the error rate can be much higher.
Similarly when AK himself tried doing ImageNet he had about a 6% error rate:
Most of these basic tasks, when performed on real world data, are much more difficult than people seem to believe.
And in the same vein, surpassing human performance is not as hard as generally imagined. In isolated domains that have seen a lot of NN development it’s not uncommon to see NNs outperform even highly trained humans.
Can you give any more examples of this?
I’d have to google for the numbers, but categorically: individual human face discrimination, voice recognition for small phrases, english to chinese transcription, cell biopsy analysis, radiograph interpretation, and the game of go are a few examples of things considered to be human specialities where, in reality, neural networks outperform human experts according to standardized metrics. There are many more.
In another five years the list is probably going to be endless.
Translation is an interesting exception because it’s not a narrow task like the others, but a general, open-ended task. Translation at the level of human experts requires an understanding of the world and a general reasoning ability that vastly exceeds today’s neural networks. Superhuman translation is a Turing test-level problem. When we have superhuman translation, humans will no longer rule the Earth.
Douglas Hoftstadter, a cognitive scientist I respect a lot, wrote about this earlier this year:
The reason why you can’t use a machine-calculated metric for translation like the BLEU score is the same reason why you can’t use a ConvNet to label the training images for ImageNet. To get ground truth, you need human evaluation.
Hoftsdtater’s ideas about how humans’ unique ability for understanding, reasoning, and language is implemented in the human brain are high-level and abstract. If you wanted to turn these ideas into something you could implement in a machine, I don’t know where you would start. Somebody could probably do a whole PhD on that topic, and maybe not get anywhere.
One of my favourite talks ever.
You might recall that Vicarious quoted Hofstadter in their blog post on their Recursive Cortical Network architecture. I’m happy that cognitive science is influencing today’s AI researchers.
I’m trying to find human vs. neural network papers not about ImageNet, which is complicated, but about some simple image classification task that is more similar to traffic sign recognition. Like, “Is this a stop sign? Yes/No?” From the paper that describes Karpathy and one other person benchmarking themselves on ImageNet (my emphasis):
We found the task of annotating images with one of 1000 categories to be an extremely challenging task for an untrained annotator. The most common error that an untrained annotator is susceptible to is a failure to consider a relevant class as a possible label because they are unaware of its existence.
Therefore, in evaluating the human accuracy we re- lied primarily on expert annotators who learned to recognize a large portion of the 1000 ILSVRC classes. During training, the annotators labeled a few hundred validation images for practice and later switched to the test set images.
The German Traffic Sign Recognition Benchmark (GTSRB) paper I linked to above benchmarks human vision against ConvNets for traffic sign recognition, but Tsinghua-Tencent 100K is a better benchmark because it tries to reproduce naturalistic viewing conditions with panoramic Tencent Street View photos. The GTSRB uses cropped images of just traffic signs.
I should add: it’s entirely possible that Tsinghua-Tencent 100K is a harder dataset for humans to do traffic sign recognition on (especially if you introduce a time limit in order to more accurately reflect real world driving) than the GTSRB. So human accuracy might be a lot closer to 95.0% than 99.5%. But in the absence of better evidence than results on the GTSRB, 99.5% is a nice conservative figure to use.
To get from 92.8% to 99.5%, is it enough to use 3,000x more labelled training images and a 10x to 1000x bigger/more computationally intensive neural network? Of course we can’t know, but it’s the kind of thing that can make you go, “Hm, maybe.” Or at least to conclude that we can’t rule it out since nothing like this has (to my knowledge) ever been tried. A company willing to throw a few hundred million dollars at developing self-driving cars could achieve this kind of scale increase.
For about $10 million, you could pay people in data capturing vehicles to drive every road in the U.S. — in every lane. So you could capture images of most traffic signs in the U.S., of which there are (I think) tens of millions, with relatively little money. Say you capture 20 million images of traffic signs, want to all them to be labelled with bounding boxes. If it takes 10 minutes to label an image, and labellers are paid $15/hour, that’s $50 million to label all the images. So, for about $60 million total you could build this massive dataset, 3,000x bigger than the Tsinghua-Tencent 100K training dataset.
Jimmy, you’re the one to ask about how the representational power of Faster R-CNN compares to AKnet_V9. Or size, or computation.
If every 2x increase in training data yields a 10% increase in classification accuracy, a 3,000x increase would mean a 1500% increase, or the 15x needed to get from 92.82% to 99.52%. If it’s every 5x increase, then that’s only good for a 6x improvement, and the other 9x has to made up by the neural network architecture.
BLEU scores aren’t perfect, but if you’re going to perform a comparison you need to have a repeatable metric whose implications are at least somewhat understood. BLEU is used because it’s the best of what we can actually measure right now. And if you’re going to try to compare humans to NNs you have to have an unbiased metric to work from.
NNs lag behind people, unquestionably, in many categories. But a funny thing happens when you formalize a measurement for one of these categories; the process of creating the metric creates the need to state clearly what you’re measuring and why it matters. Once you do that you create both the framework and the metric for groups to work at building NNs that perform well against the designed metric. And the NNs often win given some time.
And of course, as soon as the NNs are winning we declare the metric invalid as a measurement of ‘true intelligence’ and move the goalpost. Not that it was
Which is fine. It’s all a process of improvement after all. But just to speak very specifically of things we have designed rigorous metrics for and for which we can compare humans to NNs - there are a lot of metrics where the NNs outperform humans even though convention wisdom is that they don’t.
What algorithm would you use to to determine whether a ConvNet has correctly classified an image?
With image classification, there is no getting around human labelling as the source of ground truth. We can’t use machines to judge the image classification accuracy of machines. We need humans to tell us what’s really in the images.
With translation, it’s no different. Human evaluation has to be the source of ground truth. We need humans to tell us what a sentence really says. The BLEU score is blind to many objectively wrong translations. It was never intended to replace human evaluation.
The plainest way to see this is that even machine translators who get higher BLEU scores than human translators can’t perform the same economically useful work as human translators. A publisher would never use a machine translator to translate a book because the translation would be filled with mistakes.
“I am not afraid of a machine that passes the Turing test, I fear one that fails it intentionally. So tell me, what do you have to hide?”
Interesting Reddit thought experiment.
The better an AI is able to perform on an open ended task, with major variables, then perhaps the closer we get to an AI that may one day be sentient.
With regards to the topic of the thread itself, wouldn’t the network during the process of inference perhaps see that the sign might be a bit small and that its distance from the road and the way its facing may mean that the sign isn’t for the path in question?
The problem I have with my own statement is that my brain just buckled thinking about the sheer number of calculations needed to do this in real time for one object while still needing to drive the bloody car.
I’m a layman when it comes to such matters (still studying a course on NN) and this all truly astounds me.
Well, you make a good point in that Tsinghua-Tencent 100K (TTK) might overstate the difficulty of the problem for self-driving cars because they only need to recognize the subset of traffic signs that are directly relevant to their driving at the moment.
Recently I’ve been in cars with two different drivers who both unintentionally blew through a stop sign because it was dark and the sign was hard to see. All the TTK photos are from Tencent Street View, so they’re all daytime photos. In that way TTK is easier than real world driving. But my recent anecdotal experience makes me wonder if stop sign recognition in humans might be a lot less than 1 error per 200 signs (99.5% accuracy).
Maybe human drivers are so bad that neural network accuracy doesn’t even need to be that high for self-driving cars to be better. Here’s some anecdotal evidence.
This study claims that 32% of cars don’t stop at a stop sign when there is obvious conflicting traffic, and that 83% don’t stop when there isn’t obvious conflicting traffic. But I’m not sure they’re including rolling stops. Rolling stops are not the same as ignoring a stop sign completely.
Humans get used as reference models in situations where it is convenient to use them either to generate labels or as a benchmark. Treating them as ‘ground truth’ can lead to problems. Humans used via mechanical turk ‘vote’ on the ‘correct’ label because individuals are not consistent. Humans used to generate reference baselines for translations similarly have a significant ‘error’ - given more time or more humans the translation will often change.
The error rate among human professional translators is substantially less than translation neural networks, as evidenced by the fact that neural networks can’t do the economically valuable work that human translators do. So, neural networks only have subhuman performance on translations.
Since translation is an open-ended problem with a multiplicity of correct ways to translate any sentence, there is no single correct English “label” for a French or Japanese sentence. I think human evaluation has to come in after the translation, to determine whether the sentence was translated correctly or not.
There is an epistemological problem here. Say that I invent a score called ROUGE. With the ROUGE algorithm, a perfect score is given if the translation output is “purple monkey dishwasher” — no matter what the input is — and all other translation outputs get a score of zero. According to ROUGE, the only correct translation for any string of text is “purple monkey dishwasher”.
Why is BLEU better than ROUGE, and how do we know? If human evaluation is irrelevant and all we need is a formal metric, well, ROUGE is a formal metric just as much as BLEU is. What makes a formal metric better or worse than anything? This question reduces to: what makes translations correct or incorrect? Why isn’t “purple monkey dishwasher” a correct translation for any string of text, and how do we know?
The only basis for saying that BLEU is a better score than ROUGE is that BLEU correlates more highly with what human experts say is correct.
Humans, is this a stop sign?
I look forward to the day when my car will tell me if it thinks this is a stop sign.
There’s the stop line, there’s the same shape of a stop sign, and you’re at an intersection where at least one of the other sides doesn’t have a stop sign… I’d say it’s a stop sign!