MM1.5 is an upgraded multimodal large language model (MLLM) that scales efficiently and excels at fine-grained image and text tasks. It introduces both dense and mixture-of-experts (MoE) variants, with a data-centric approach to improve performance in areas like OCR, image comprehension, image captioning, and video processing. MM1.5 offers specialized variants for video understanding (MM1.5-Video) and mobile UI analysis (MM1.5-UI). The model demonstrates strong few-shot learning capabilities and competitive performance even at smaller scales. Its enhanced multimodal capabilities make it suitable for diverse applications, from document processing to augmented reality.