The NeurIPS 2024 Preshow: A Data-Centric Look at Curation Strategies for Image Classification
This paper formalizes the task of data curation strategy as a function that takes a cost input and produces a set of samples drawn from a distribution over a set of plausible images, highlighting the critical role of data curation in machine learning. The authors discuss five data curation strategies, including expert curation, crowdsourced labeling, schema matching, synthetic data generation, and embedding-based search, each with its strengths and limitations. They introduce SELECT, a benchmark framework for evaluating data curation strategies, and IMAGENET++, a large-scale dataset designed to test the SELECT benchmark. The research emphasizes the need for systematic evaluation of data curation methods, moving beyond reliance on base accuracy, and highlights the importance of considering factors like robustness, generalization, and dataset properties when assessing curated data.
Company
Voxel51
Date published
Dec. 6, 2024
Author(s)
Harpreet Sahota
Word count
2137
Language
English
Hacker News points
None found.