There has been tremendous progress in the world of Large Language Models (LLMs), with blockbuster models like GPT3, GPT4, Falcon, MPT, and Llama pushing the state of the art. However, evaluating these models is challenging due to their tendency to hallucinate. To address this issue, companies are developing evaluation metrics that can help them make data-driven decisions without relying solely on human judgment. These metrics include context adherence measures, correctness metrics, log probability-based metrics, prompt perplexity, and safety metrics such as PII, toxicity, tone, sexism, and prompt injection detection. By using these metrics, companies can identify potential issues with their LLMs, optimize their performance, and ensure that they are generating high-quality outputs that meet the needs of their users.