The NeurIPS 2024 Preshow: Are We Measuring What We Think We Are? The Perils of Contaminated Benchmark Datasets
The paper addresses a significant issue in machine learning research, where benchmark datasets are often contaminated with errors, leading to overestimation of model performance and hindering scientific progress. The authors propose SELFCLEAN, a data cleaning method that employs self-supervised learning (SSL) to identify and mitigate data quality issues in benchmark datasets. SELFCLEAN uses two-step process: representation learning using SSL and distance-based indicators to identify potential data quality issues. The method offers two operating modes, fully automated and human-in-the-loop, allowing users to choose between automatic cleaning and manual verification. Experiments demonstrate the effectiveness of SELFCLEAN in detecting off-topic samples, near duplicates, and label errors, highlighting its practical importance for accurate model evaluation and restoring confidence in benchmark results.
Company
Voxel51
Date published
Dec. 6, 2024
Author(s)
Harpreet Sahota
Word count
1682
Language
English
Hacker News points
None found.