MM1 - Methods, analysis, and insights from multimodal LLM pre-training by researchers at Apple discussed the development of efficient models by exploring architectural components and data selection strategies. They integrated different kinds of data to improve few-shot learning performance on a range of benchmarks, showcasing the model's complex architecture and its potential for real-world applications. The 30B-parameter-dense model beats prior state-of-the-art (SOTA) on VQA (Visual Question Answering) dataset and captioning tasks. HyperLLaVA is a framework that dynamically tunes both the projector and LLM parameters, using a unique training methodology that aligns visual-language features and refines language model tuning with multimodal instructions. This approach shows amazing progress in MLLM benchmarks, opening the door for AI systems that are more nuanced, adaptable, and capable of handling complex multimodal data. Google's Video Gaming Companion: Scalable Instructable Multiworld Agent [SIMA] is an AI agent trained on a dataset of video games to interact with the environment in real-time using a generic human-like interface. MORA is a multi-agent framework designed for generalist video generation, integrating several visual AI agents into a cohesive system. Developer resources include Gemini 1.5 Pro API support and 15 GitHub repositories for image segmentation.