/plushcap/analysis/voxel51/voxel51-the-neurips-2024-preshow-are-we-measuring-what-we-think-we-are-the-perils-of-contaminated-benchmark-datasets

The NeurIPS 2024 Preshow: Are We Measuring What We Think We Are? The Perils of Contaminated Benchmark Datasets

What's this blog post about?

The paper addresses a significant issue in machine learning research, where benchmark datasets are often contaminated with errors, leading to overestimation of model performance and hindering scientific progress. The authors propose SELFCLEAN, a data cleaning method that employs self-supervised learning (SSL) to identify and mitigate data quality issues in benchmark datasets. SELFCLEAN uses two-step process: representation learning using SSL and distance-based indicators to identify potential data quality issues. The method offers two operating modes, fully automated and human-in-the-loop, allowing users to choose between automatic cleaning and manual verification. Experiments demonstrate the effectiveness of SELFCLEAN in detecting off-topic samples, near duplicates, and label errors, highlighting its practical importance for accurate model evaluation and restoring confidence in benchmark results.

Company
Voxel51

Date published
Dec. 6, 2024

Author(s)
Harpreet Sahota

Word count
1682

Language
English

Hacker News points
None found.