/plushcap/analysis/voxel51/understanding-llava-large-language-and-vision-assistant

Understanding LLaVA: Large Language and Vision Assistant

What's this blog post about?

LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.

Company
Voxel51

Date published
Dec. 11, 2023

Author(s)
Dan Gural

Word count
1584

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.