/plushcap/analysis/voxel51/voxel51-the-neurips-2024-preshow-a-data-centric-look-at-curation-strategies-for-image-classification

The NeurIPS 2024 Preshow: A Data-Centric Look at Curation Strategies for Image Classification

What's this blog post about?

This paper formalizes the task of data curation strategy as a function that takes a cost input and produces a set of samples drawn from a distribution over a set of plausible images, highlighting the critical role of data curation in machine learning. The authors discuss five data curation strategies, including expert curation, crowdsourced labeling, schema matching, synthetic data generation, and embedding-based search, each with its strengths and limitations. They introduce SELECT, a benchmark framework for evaluating data curation strategies, and IMAGENET++, a large-scale dataset designed to test the SELECT benchmark. The research emphasizes the need for systematic evaluation of data curation methods, moving beyond reliance on base accuracy, and highlights the importance of considering factors like robustness, generalization, and dataset properties when assessing curated data.

Company
Voxel51

Date published
Dec. 6, 2024

Author(s)
Harpreet Sahota

Word count
2137

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.