This tutorial demonstrates how to build a Multimodal Retrieval Augmented Generation (RAG) System, which allows the use of different types of data such as images, audio, videos, and text. The system utilizes OpenAI CLIP for understanding the connection between pictures and text, Milvus Standalone for efficient management of large-scale embeddings, Ollama for Llama3 usage on a laptop, and LlamaIndex as the Query Engine in combination with Milvus as the Vector Store. The tutorial provides code examples available on Github and explains how to run queries that can involve both text and images.