Dragonfly is an instruction-tuning Vision-language architecture that enhances fine-grained visual understanding and reasoning about image regions by employing multi-resolution zoom-and-select strategies. This approach allows for a detailed and efficient visual understanding of complex image data in specific domains, such as biomedical imaging. The model achieves competitive performance on vision-language benchmarks like commonsense visual QA and image captioning, outperforming prior models including Med-Gemini on multiple medical imaging tasks. Dragonfly's effectiveness is attributed to its ability to focus on fine-grained details of image regions, enabling better commonsense reasoning and fine-grained understanding of high-resolution image data.