Understanding LLaVA: Large Language and Vision Assistant
LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.
Company
Voxel51
Date published
Dec. 11, 2023
Author(s)
Dan Gural
Word count
1584
Language
English
Hacker News points
None found.