Multimodal architectures are gaining prominence in Generative AI (GenAI) as organizations increasingly build solutions using multimodal models such as GPT-4V and Gemini Pro Vision. These models can semantically embed and interpret various data types, making them more versatile and effective than traditional large language models across a broader range of applications. However, challenges arise in ensuring their reliability and accuracy due to hallucinations where they produce incorrect or irrelevant outputs. Multimodal Retrieval Augmented Generation (RAG) addresses these limitations by enriching models with relevant contextual information from external sources. Evaluation tools like Trulens help developers monitor performance, test reliability, and identify areas for improvement in multimodal RAG systems to ensure accuracy and relevance while minimizing hallucinations.