Handling Label Errors in Text Classification Datasets

Company

Cleanlab

Date Published

May 10, 2022

Author

Wei Jing Lok, Jonas Mueller, Hui Wen Goh

Word count

3490

Language

English

Hacker News points

None

URL

cleanlab.ai/blog/label-errors-text-datasets

Summary

Recent studies have found that even highly curated machine learning benchmark datasets contain label errors, which can significantly impact model performance. The open-source cleanlab library provides a standard framework for identifying and addressing these issues in real-world data. In this hands-on blog, the authors demonstrate how to use cleanlab to find label problems in the IMDb movie review text classification dataset and improve models without changing them. They also provide code examples for implementing the workflow on other datasets.