Automatic Error Detection for Image/Text Tagging and Multi-label Datasets

Company

Cleanlab

Date Published

Nov. 29, 2022

Author

Aditya Thyagarajan, Elías Snorrason, Curtis Northcutt, Jonas Mueller

Word count

1434

Language

English

Hacker News points

URL

cleanlab.ai/blog/multilabel

Summary

Image/document tagging is an important instance of multi-label classification tasks, where each example can belong to multiple classes. However, these datasets often contain many label errors that harm the performance of machine learning (ML) models. Researchers have developed algorithms to detect incorrect annotations in any multi-label classification dataset using the open-source cleanlab package. These algorithms are model-agnostic and can be used with any existing or future ML model to efficiently find and fix errors in its training set, test set for benchmarking, reduce the number of annotations needed, and perform other data-centric tasks. The EMA label quality score is a robust method for producing a label quality score for each example in a dataset by computing an exponential moving average over the model's self-confidences for every tag/annotation given to the example. Cleanlab's multi-label algorithms have been benchmarked against nine other approaches, demonstrating their effectiveness in detecting mislabeled examples with any error in their annotation and those which are severely mislabeled.