MM1` (Multimodal Large Language Model) is a family of large multimodal language models that combines text and image understanding. It boasts an impressive 30 billion parameters and excels in both pre-training and supervised fine-tuning, generating and interpreting both images and text data. MM1 incorporates a mixture-of-experts architecture, contributing to its state-of-the-art performance across benchmarks. The model demonstrates exceptional in-context learning abilities, particularly in its largest configuration, and achieves competitive performance after supervised fine-tuning on various multimodal benchmarks. It excels at making predictions within the context of a given input, demonstrating impressive capabilities in multi-image reasoning, chain-of-thought reasoning, few-shot learning with instruction tuning, visual question answering, and captioning. The model's performance evaluation encompasses scaling via mixture-of-experts, supervised fine-tuning experiments, impact of image resolution, pre-training effects, and qualitative analysis. Apple's MM1 model is designed to respect user privacy, reduce biases, be transparent about its capabilities, ensure fairness, avoid harm, and maintain human oversight.