Understanding LLaVA: Large Language and Vision Assistant

Company

Voxel51

Date Published

Dec. 11, 2023

Author

Dan Gural

Word count

1584

Language

English

Hacker News points

None

URL

voxel51.com/blog/understanding-llava-large-language-and-vision-assistant

Summary

LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.