The NeurIPS 2024 Preshow: Creating SPIQA: Addressing the Limitations of Existing Datasets for Scientific VQA
SPIQA is a new approach to building datasets for scientific paper comprehension that addresses the limitations of existing datasets. It incorporates visual data from approximately 26,000 computer science research papers, providing a robust platform for training more comprehensive AI systems. The dataset covers over 25,000 papers, 150,000 figures, and 117,000 tables, offering a large-scale resource specifically designed to interpret complex figures and tables within the context of scientific papers. SPIQA's dual approach to question generation combines automated methods with manual curation, ensuring both scale and quality control. The creation of SPIQA involved overcoming several challenges in data curation and generation, including balancing automation and human expertise, domain-specific considerations, and evaluation metrics.
Company
Voxel51
Date published
Dec. 6, 2024
Author(s)
Harpreet Sahota
Word count
1117
Language
English
Hacker News points
None found.