The NeurIPS 2024 Preshow: Are We Measuring What We Think We Are? The Perils of Contaminated Benchmark Datasets

Company

Voxel51

Date Published

Dec. 6, 2024

Author

Harpreet Sahota

Word count

1682

Language

English

Hacker News points

None

URL

voxel51.com/blog/the-neurips-2024-preshow-are-we-measuring-what-we-think-we-are-the-perils-of-contaminated-benchmark-datasets

Summary

The paper addresses a significant issue in machine learning research, where benchmark datasets are often contaminated with errors, leading to overestimation of model performance and hindering scientific progress. The authors propose SELFCLEAN, a data cleaning method that employs self-supervised learning (SSL) to identify and mitigate data quality issues in benchmark datasets. SELFCLEAN uses two-step process: representation learning using SSL and distance-based indicators to identify potential data quality issues. The method offers two operating modes, fully automated and human-in-the-loop, allowing users to choose between automatic cleaning and manual verification. Experiments demonstrate the effectiveness of SELFCLEAN in detecting off-topic samples, near duplicates, and label errors, highlighting its practical importance for accurate model evaluation and restoring confidence in benchmark results.