LLM evaluation metrics are essential for building robust Large Language Model (LLM) applications. These metrics score an LLM system's output based on criteria you care about and help quantify the performance of different LLM systems. Common metrics include answer correctness, semantic similarity, hallucination, contextual relevancy, responsible metrics such as bias and toxicity, task-specific metrics like summarization, and fine-tuning metrics that assess the LLM itself. Statistical scorers, model-based scorers, and use case specific metrics are used to evaluate LLM outputs. G-Eval, Prometheus, SelfCheckGPT, QAG, and DeepEval are some of the most accurate scorers for LLM evaluation due to their high reasoning capabilities. The choice of metrics depends on the use case and implementation of the LLM application, with RAG and fine-tuning metrics being a great starting point. G-Eval is particularly useful for use case-specific metrics and can be used with few-shot prompting for accurate results.