This paper formalizes the task of data curation strategy as a function that takes a cost input and produces a set of samples drawn from a distribution over a set of plausible images, highlighting the critical role of data curation in machine learning. The authors discuss five data curation strategies, including expert curation, crowdsourced labeling, schema matching, synthetic data generation, and embedding-based search, each with its strengths and limitations. They introduce SELECT, a benchmark framework for evaluating data curation strategies, and IMAGENET++, a large-scale dataset designed to test the SELECT benchmark. The research emphasizes the need for systematic evaluation of data curation methods, moving beyond reliance on base accuracy, and highlights the importance of considering factors like robustness, generalization, and dataset properties when assessing curated data.