A new study out of MIT showed that 10 of the most commonly used AI test datasets are full of labeling errors. Specifically, the researchers found:

  • 3.4% of images are mislabelled, and as many as 5.8% in ImageNet and 10.1% in QuickDraw. 
  • Once labels were corrected in the benchmarking datasets, lower capacity models outperformed higher capacity models. Inaccurate training data may be driving a perceived – but false – need for larger networks.

The conclusions could reshape the way we evaluate model performance, and pick new winners amongst the algorithms that govern AI. The study also puts numbers to a well known, systemic issue in computer vision: hand-labeling images is extraordinarily difficult to do well, and bad labels are impacting progress in AI. 

Why Is Hand Labeling So Difficult?

Human annotators face multiple challenges. Their role is to examine thousands of images – including those that are small, blurry, occluded or dimly lit – and determine what they are looking at. The task is difficult even before we factor in subjectivity. 

The New York Times last month reported on data labels that have injected racism into AI. Sexist labels are equally problematic. But the MIT report highlighted the more basic issue that labels are often plain wrong: “A mushroom is labeled a spoon, a frog is labeled a cat, and a high note from Ariana Grande is labeled a whistle.”

The Proposed Solution

There are a couple of solutions. 

Firstly, the study’s authors have provided open source code that can be used to check accuracy in training datasets. The tool may reduce the often intensive resources needed for data query and corrections. 

Another, more precise alternative has come to the fore: Synthetic data now works like real data, and it includes perfect, procedurally generated annotations. Because the content of each image is labeled at creation, there’s no need for interpretation: we know the contents of each image by name, as well as weight, density, depth and even temperature. 

The more we can introduce automated precision labeling into AI training, the more accurate and equitable the resultant algorithms will be.