LLM evaluation refers to ensuring that Large Language Models (LLMs) output aligns with human expectations, considering ethical and safety aspects as well as correctness and relevancy. LLM systems are composed of multiple components that make them more effective, but their evaluation is complex due to this architecture. Offline evaluations involve testing LLM systems in a local development setup, while real-time evaluations use production data to improve benchmark datasets. To evaluate LLM systems, it's essential to choose the right metrics, such as correctness, answer relevancy, and contextual recall, which can be reference-based or reference-less. Benchmarks are custom-made for each use case, using evaluation datasets and metrics that reflect the specific architecture of the LLM system. Improving benchmark datasets over time is crucial, and real-time evaluations in production help achieve this goal. By understanding how to evaluate LLM systems effectively, developers can ensure their applications produce accurate and relevant outputs.