MM1: Apple’s Multimodal Large Language Models (MLLMs)

Company

Encord

Date Published

March 26, 2024

Author

Akruti Acharya

Word count

2259

Language

English

Hacker News points

None

URL

encord.com/blog/apple-mm1-multimodal-llm

Summary

MM1` (Multimodal Large Language Model) is a family of large multimodal language models that combines text and image understanding. It boasts an impressive 30 billion parameters and excels in both pre-training and supervised fine-tuning, generating and interpreting both images and text data. MM1 incorporates a mixture-of-experts architecture, contributing to its state-of-the-art performance across benchmarks. The model demonstrates exceptional in-context learning abilities, particularly in its largest configuration, and achieves competitive performance after supervised fine-tuning on various multimodal benchmarks. It excels at making predictions within the context of a given input, demonstrating impressive capabilities in multi-image reasoning, chain-of-thought reasoning, few-shot learning with instruction tuning, visual question answering, and captioning. The model's performance evaluation encompasses scaling via mixture-of-experts, supervised fine-tuning experiments, impact of image resolution, pre-training effects, and qualitative analysis. Apple's MM1 model is designed to respect user privacy, reduce biases, be transparent about its capabilities, ensure fairness, avoid harm, and maintain human oversight.