NeurIPS 2023 Survival Guide

Company

Voxel51

Date Published

Dec. 9, 2023

Author

Jacob Marks

Word count

2837

Language

English

Hacker News points

None

URL

voxel51.com/blog/neurips-2023-survival-guide

Summary

The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) took place in New Orleans, LA from December 10th to 16th, 2023. With over 3,584 accepted papers and numerous tutorials and workshops, the conference covered a wide range of topics in machine learning. Among these were advances in multimodal machine learning, which include techniques that enable machines to process and understand information from multiple types of data simultaneously, such as text, images, and audio. The ten most notable papers on multimodal machine learning at NeurIPS 2023 are: 1. Chameleon: This paper presents a technique for extending pure language models to challenging multimodal tasks by leveraging the general reasoning capabilities of large language models in conjunction with tools for web search, mathematical analysis, and visual understanding. 2. Cheap and Quick: The authors introduce a new technique called Mixture-of-Modality Adaptation (MMA) for fine-tuning vision-language models more efficiently by training lightweight adapters instead of the entire model. 3. DataComp: This paper presents a competition and benchmark for evaluating novel filtering strategies for constructing multimodal datasets, as well as a state-of-the-art 1 billion sample multimodal dataset. 4. Holistic Evaluation of Text-to-Image Models (HEIM): The authors propose the first holistic evaluation benchmark for text-to-image models, which encapsulates 12 aspects of performance and incorporates both computational metrics and crowd-sourced human evaluations. 5. ImageReward: This paper presents a general-purpose reward model for human preferences on T2I generated images, as well as an approach for aligning T2I models with these preferences using Reinforcement Learning from Human Feedback (RLHF). 6. InstructBLIP: The authors apply instruction tuning to vision-language models and present a family of state-of-the-art zero-shot models, as well as fine-tuned models that achieve SOTA on specific tasks. 7. LAMM: This paper presents a dataset and benchmark for evaluating 2D and 3D visual reasoning in multimodal large language models. 8. MagicBrush: The authors present a high-quality hand-crafted image editing dataset, which includes both single-turn and multi-turn edits, and demonstrate its utility in improving performance on instruction-guided image editing tasks. 9. OBELICS: This paper presents the first web-scale dataset of natural multimodal documents with interleaved images and text, and demonstrates its utility in training a new 80 billion parameter model that is competitive with Flamingo. 10. Pick-a-Pic: The authors present a large-scale dataset of prompts, generated images, and human preferences for T2I models, which can be used to train models to better align with user preferences.