Multimodal Large Language Models (MLLMs) are reshaping how we process and integrate text, images, audio, and video. To effectively build, evaluate, and monitor a Multimodal LLM, it's essential to understand the architecture of these models, which typically follow one of two primary approaches: alignment-focused or early-fusion architectures. The alignment architecture uses pretrained vision models connected to pretrained LLMs through specialized alignment layers, while the early-fusion architecture processes mixed visual and text tokens together in a unified transformer. MLLMs have seen rapid advancement, with both closed and open-source models pushing the boundaries of what's possible. However, addressing challenges such as hallucinations, data quality, and monitoring strategies is crucial to ensure optimal real-world performance. Evaluating your multimodal LLMs effectively requires specialized metrics for cross-modal performance, consistency, and bias detection, which can be handled by platforms like Galileo's Luna Evaluation Foundation Models.