This post was originally published by Tyler Ganter at Towards Data Science
In Depth Analysis: I reassessed Open Images with a SOTA object detection model only to discover that over 1/3 of all false positives were annotation error!
As the performance of deep learning models trained on massive datasets continues to advance, large-scale dataset competitions have become the proving ground for the latest and greatest computer vision models. We’ve come a long way as a community from the times where MNIST — a dataset with only 70,000 28×28 pixel images — was the de facto standard. New, larger datasets have arisen out of a desire to train more complex models to solve more challenging tasks: ImageNet, COCO and Google’s Open Images are among the most popular.
But even on these huge datasets the differences in performance of top models is becoming narrow. The 2019 Open Images Detection Challenge shows the top five teams fighting for less than a 0.06 margin in mean average precision (mAP). It’s even less for COCO.
There’s no doubt that our research community is delivering when it comes to developing innovative new techniques to improve model performance, but the model is only half of the picture. Recent findings have made it increasingly clear that the other half — the data — plays at least as critical of a role, perhaps even greater.
Just this year…
- …researchers at Google and DeepMind reassessed ImageNet and their findings suggest that the developments of late may not even be finding meaningful generalizations, instead just overfitting the idiosyncrasies of the ImageNet labeling procedure.
- …MIT has widthdrawn the Tiny Images dataset after a paper brought to light that a portion of the 80 million images contained racist and misogynistic slurs.
- …Jo and Gebru from Stanford and Google, respectively, argued that more attention needs to be put on data collection and annotation procedures by drawing analogy to more matured data archival procedures.
- …researchers from UC Berkeley and Microsoft performed a study showing that when using self-supervised pre-training, one could achieve gains on downstream tasks by focusing not on the network architecture or task/loss selection, but on a third axis, the data itself. To paraphrase: focusing on the data is not only a good idea, it’s a novel idea in 2020!
And here’s what two leaders of the field are saying about this:
- “In building practical systems, often there’s more manual error analysis and more human insight that goes into these systems than sometimes deep learning researchers like to acknowledge.” — Andrew Ng
- “Become one with the data” — Andrej Karpathy in his popular essay on training neural networks
How many times have you found yourself spending hours, days, weeks pouring over samples in your data? Have you been surprised by how much manual inspection was necessary? Or can you think of a time when you trusted macro statistics perhaps more than you should?
The computer vision community is starting to wake up to the idea that we need to be close to the data. If we want accurate models that behave as expected, it’s not enough to have a large dataset; it needs to have the right data and it needs to be accurately labeled.
Every year, researchers are battling it out to climb to the top of a leaderboard with razor thin margins determining fates. But do we really know what’s going on with these datasets? Is a 0.01 margin in mAP even meaningful?
With another Open Images Challenge just wrapping up, it seemed only appropriate to investigate this popular benchmark dataset and try to better understand what it means to have an object detection model with high mAP. So, I took it upon myself to do some basic error analysis of a pre-trained model with the goal being to observe patterns in errors in the context of the dataset, not the model. To my surprise, I found that a significant portion of these errors were in fact not errors; instead, the dataset annotations were incorrect!
What is error analysis?
Error analysis is the process of manually inspecting a model’s prediction errors, identified during evaluation, and making note of the causes of the errors. You don’t need to look at the whole dataset, but at least enough examples to know that you are correctly approximating a trend; let’s say 100 samples as a bare minimum. Open up a spreadsheet, or grab a piece of paper, and start jotting down notes.
Why do this? Perhaps the majority of images your model is struggling on are low resolution or have poor lighting. If this is the case, adding more high resolution well-lit images to the training set is unlikely to manifest as significant improvement in model accuracy. Any number of other qualitative characteristics of your dataset may be at play; the only way to find out is to analyze your data!
Preparing for the analysis
I generated predictions on the Open Images V4 test set using this FasterRCNN+InceptionResNetV2 network. This network seemed like an ideal choice as it is trained and evaluated on Open Images V4, has a relatively high-mAP of 0.58, and it is readily available to the public via Tensorflow Hub. I then needed to evaluate each image individually.
Open Images uses a sophisticated evaluation protocol that considers hierarchy, groups and even specifies known-present and known-absent classes. Despite the availability of the Tensorflow Object Detection API, specifically supporting evaluation on Open Images, it took some non-trivial code to get per-image evaluation results. Why isn’t this supported natively? At any rate, I was eventually able to determine exactly which detection was a true positive or false positive for each image.
I decided to filter the detections, only looking at those with confidence > 0.4. This threshold turned out to be roughly the point where the number of true positives surpasses the number of false positives.
Types of Errors
The structure of this analysis is inspired by a 2012 study which took two state-of-the-art (at the time) detectors and performed manual error analysis. Hoiem’s group created categories such as localization error, confusion with semantically similar objects and false positives on background, but interestingly not anything related to ground truth error!
I broke down the cause of error into three groups: model errors, ground truth errors and other errors, each subsisting of a few specific causes of error. The next sections define these specific errors and provide examples, in order to provide context before we look at the aggregate results of the error analysis.
Model errors are the familiar set of errors that Hoiem’s paper established and many researchers have subsequently used in their own publications. A small modification I made here was to omit “other” errors and to add “duplicate” errors which is a split of the “localization” errors.
- loc: localization error, i.e. IoU below threshold of 0.5
- sim: confusion with semantically similar objects
- bg: confusion with background
- dup: duplicate box, meaning both localization error and a true positive exists; this was made as a separate category from loc because it was observed to be a very common type of error
Examples of Model Errors. Top Left: localization error from a Clothing box that doesn’t capture the sleeves. Top Right: Confusion with similar semantically similar object, as a stuffed animal is mistaken as a Dog. Bottom Left: confusion with background, which was mistaken as a Boat. Bottom Right: two duplicate Dog boxes around the same dog with bad localization.
Ground Truth Errors
Ground truth errors are causes of false positives whose “fault” is in the annotation, not the model prediction. Were these to be corrected, they would be reassigned as true positives.
- missing: the ground truth box should exist but does not
- incorrect: the ground truth box exists but the label is incorrect or it is not as specific in the label hierarchy as it could be
- group: the ground truth box should be marked as a group but is not
An example of missing ground truth, where the back wheel of the car is clearly visible and the model detects it, however this is incorrectly marked as a false positive due to missing ground truth.
An example of an incorrect ground truth label; in this case, the ground truth label is not adequately specific. The ground truth label for these meerkats is Animal and the prediction of the meerkat standing on the right is Carnivore. Carnivore is technically correct. The ground truth was not specific enough.
An example of group error. Three of the four predictions on the corn are correctly localized and labeled Vegetable boxes. (The fourth is a dup). However, the ground truth is a single box around all three pieces. To fix this annotation, one would need to flag the ground truth bounding box as a group or replace it with three individual boxes like what is done for the zucchini to the right.
Lastly we have unclear errors, meaning edge cases where it isn’t apparent whether the prediction is correct or not.
- unclear: model error? ground truth error? ask ten different people and you’ll get ten different answers
Fortunately this category only accounted for roughly 6.5% of error, but it is still important to note that when trying to create a label hierarchy that can categorize everything in the world there are always going to be edge cases like this circuit lady which the model predicted as a Toy.
Would you consider this to be a toy? An example of error with unclear “blame”
The following table shows the results of analyzing a subset of 178 of the total 125,436 test images.
Results from error analysis of 275 false positive bounding boxes detected by a FasterRCNN object detection model on Open Images V4 test set.
This is crazy! 36% of the false positives should actually be true positives! It’s not immediately clear what impact this would have on mAP, given that it is a rather complex metric, however it’s safe to say that the officially reported mAP of 0.58 is underestimating the true performance of the model.
The single most common cause of error was missing ground truth annotation, accounting for over ¼ of all errors. This is a challenging problem. It’s unrealistic to ask for a dataset that is not missing boxes. Many of these missing annotations are peripheral objects, not the central focus of the image. But this only emphasizes the need for easy, possibly automated, identification of annotations that will go through an additional round of review. There are other implications as well. Peripheral objects are generally smaller; how do these missing annotations affect accuracy metrics when split into small/medium/large bounding box sizes?
A few of these other causes of error — duplicate bounding boxes, incorrect ground truth labels and group errors, in particular — signal the importance of labeling ontology and annotation protocols. Complex label hierarchies can lead to incorrect ground truth labels, though this study indicates that this is not the case for Open Images. Handling groups is another complication that needs to be carefully defined and reviewed; while not as prevalent as other causes of error, 7.6% being due to boxes that should have been flagged as a group is certainly not insignificant. Finally, duplicate bounding boxes could be, at least in part, a byproduct of expanded hierarchy. In the Open Images object detection challenge a model is tasked with generating a bounding box for each label in the hierarchy. For example, for an image containing a jaguar, the Open Images challenge expects boxes to be generated for not just Jaguar, but also Carnivore, Mammal and Animal. Could this unintentionally lead to a model generating multiple Jaguar boxes for the same animal? Faster RCNN applies classification as a post-processing step after region proposal. So if the model is trained to generate four boxes for every jaguar it sees, it shouldn’t be surprising that these four boxes sometimes get the same classification label.
What would happen to the Open Images Leaderboard if these ground truth errors were corrected? How could this affect our understanding of what strategies work best?
It should be noted that these errors aren’t just statistical noise. Similar to the findings of the DeepMind team that analyzed ImageNet, there are patterns to the annotation errors in Open Images. For example, missing face-annotations are a very common cause for false positives and bounding boxes around trees should often be labeled as groups but are not.
The purpose of this article is not to criticize the creators of Open Images — to the contrary, this dataset and its corresponding challenges have instigated great achievements — but rather to shed light on a blind spot that could be holding back progress. The impact of these popular open datasets is far-reaching, as they are often used as the starting point for fine-tuning/transfer learning. Furthermore, if popular datasets are suffering from annotation correctness, then most likely while inspecting our own datasets we will be encountering the same. But I’m certainly not speaking from experience…*ahem*
We’re at the forefront of a shift in focus, where the data itself is being rightfully acknowledged as every bit as important as the model trained on it, if not more! Perhaps we will see smaller, more carefully curated datasets rising in popularity, or more demand for methods like active or semi-supervised learning, that allow us to automate and scale annotation work. Either way, a key challenge will be creating the infrastructure to manage dynamic datasets that grow in size and evolve based on feedback from humans and machine learning models. There’s a lot of potential in this nascent topic!
To perform this error analysis study, I used FiftyOne, a Python package that makes it really easy to load your datasets and interactively search and explore them both through code and a visualization app. Here’s the FiftyOne code I ran to perform the error analysis in this study:
Want to explore this data for yourself? Download it here!
Want to evaluate your own model on Open Images? Try this tutorial!
Want to learn more about best practices for inspecting visual datasets? Check out this post!
 L. Beyer, et al, Are we done with ImageNet? (2020)
 V. Prabhu and A Birhane, Large image datasets: A pyrrhic win for computer vision? (2020)
 E. Jo and T. Gebru, Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning (2020), Conference on Fairness, Accountability, and Transparency
 R Rao, et al, Quality and Relevance Metrics for Selection of Multimodal Pretraining Data (2020), Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
 D. Hoiem, et al, Diagnosing Error in Object Detectors (2012), European Conference on Computer Vision (ECCV)
 S. Ren, et al, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015), Advances in Neural Information Processing Systems (NeurIPS)
This post was originally published by Tyler Ganter at Towards Data Science