Llama 3V: Multimodal Model 100x Smaller than GPT-4

Company

Encord

Date Published

May 30, 2024

Author

Stephen Oladele

Word count

1692

Language

English

Hacker News points

None

URL

encord.com/blog/llama-3v-100x-smaller-than-gpt-4

Summary

Llama 3-V is a groundbreaking open-source multimodal AI model that delivers comparable performance to GPT4-V at a fraction of the size and training cost. Developed by researchers Aksh Garg and Mustafa Aljadery, Llama 3-V combines the language model Llama3 8B from Meta with the vision model SigLIP-SO400M to enable joint understanding of images and text. Its compact size sets it apart - it is 100 times smaller than GPT4-V yet achieves 10-20% better performance on benchmarks, costing only around $500 to train. This makes Llama 3-V a highly efficient and accessible alternative to large proprietary models. The model's open-source nature aligns with the trend of democratizing AI, enabling researchers and developers worldwide to access, use, and build upon state-of-the-art models. Its novel training approach combines precomputed embeddings from SigLIP with a two-stage process of pretraining and supervised fine-tuning on a large dataset of image-text pairs. This methodology allows effective alignment of visual and textual modalities while remaining computationally efficient. Llama 3-V's performance has been demonstrated across various benchmarks, rivaling and surpassing significantly larger models. Its potential applications include healthcare, agriculture, content creation, visual question answering, and autonomous vehicles, among others.