Company
Date Published
Dec. 6, 2024
Author
Harpreet Sahota
Word count
1682
Language
English
Hacker News points
None

Summary

The paper addresses a significant issue in machine learning research, where benchmark datasets are often contaminated with errors, leading to overestimation of model performance and hindering scientific progress. The authors propose SELFCLEAN, a data cleaning method that employs self-supervised learning (SSL) to identify and mitigate data quality issues in benchmark datasets. SELFCLEAN uses two-step process: representation learning using SSL and distance-based indicators to identify potential data quality issues. The method offers two operating modes, fully automated and human-in-the-loop, allowing users to choose between automatic cleaning and manual verification. Experiments demonstrate the effectiveness of SELFCLEAN in detecting off-topic samples, near duplicates, and label errors, highlighting its practical importance for accurate model evaluation and restoring confidence in benchmark results.