Company
Date Published
Nov. 29, 2022
Author
Aditya Thyagarajan, ElĂ­as Snorrason, Curtis Northcutt, Jonas Mueller
Word count
1434
Language
English
Hacker News points
1

Summary

Image/document tagging is an important instance of multi-label classification tasks, where each example can belong to multiple classes. However, these datasets often contain many label errors that harm the performance of machine learning (ML) models. Researchers have developed algorithms to detect incorrect annotations in any multi-label classification dataset using the open-source cleanlab package. These algorithms are model-agnostic and can be used with any existing or future ML model to efficiently find and fix errors in its training set, test set for benchmarking, reduce the number of annotations needed, and perform other data-centric tasks. The EMA label quality score is a robust method for producing a label quality score for each example in a dataset by computing an exponential moving average over the model's self-confidences for every tag/annotation given to the example. Cleanlab's multi-label algorithms have been benchmarked against nine other approaches, demonstrating their effectiveness in detecting mislabeled examples with any error in their annotation and those which are severely mislabeled.