Dragonfly: A large vision-language model with multi-resolution zoom

Post Details

Company

Together AI

Date Published

June 6, 2024

Author

Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

Word Count

1,061

Language

English

Hacker News Points

143

Source URL

www.together.ai/blog/dragonfly-v1

Summary

Dragonfly is an instruction-tuning Vision-language architecture that enhances fine-grained visual understanding and reasoning about image regions by employing multi-resolution zoom-and-select strategies. This approach allows for a detailed and efficient visual understanding of complex image data in specific domains, such as biomedical imaging. The model achieves competitive performance on vision-language benchmarks like commonsense visual QA and image captioning, outperforming prior models including Med-Gemini on multiple medical imaging tasks. Dragonfly's effectiveness is attributed to its ability to focus on fine-grained details of image regions, enabling better commonsense reasoning and fine-grained understanding of high-resolution image data.