Multimodal Retrieval Augmented Generation(RAG)
Humans learn differently from current state-of-the-art models like Large Language Models (LLMs). While LLMs are trained on trillions of tokens, they still lack a vivid understanding of the causal relationships in the world. In contrast, humans can efficiently form coherent understandings of the world by integrating multiple sensory inputs and using multimodal representations of information. To train models that understand multimodal data, one approach is Contrastive Learning. This involves training individual models for each modality separately and then unifying their representations through contrastive training. Another technique is Any-to-Any Search and Retrieval, which uses multimodal embedding models to perform any-to-any search and scaling these embeddings into production using Vector Databases. Multimodal Retrieval Augmented Generation (MM-RAG) augments the generation from Large Multimodal Models (LMMs) with multimodal retrieval of images and more. This process involves two steps: retrieving information from a multimodal knowledge base and generating content using a large multimodal model by grounding in the retrieved context. In summary, understanding how humans learn can help improve AI models' ability to understand causal relationships and utilize multimodal data effectively. Techniques like Contrastive Learning, Any-to-Any Search and Retrieval, and MM-RAG contribute to this goal by enabling efficient integration of multiple sensory inputs and improving the accuracy and scalability of AI models.
Company
Weaviate
Date published
Dec. 5, 2023
Author(s)
Zain Hasan
Word count
2023
Hacker News points
None found.
Language
English