LLMs (Large Language Models) are difficult to evaluate because of their non-deterministic nature, meaning they can generate multiple possible outputs for a given input. This makes it challenging to determine what constitutes an "appropriate" response. LLM applications, such as chatbots and code assistance tools, often rely on proprietary data to improve performance, making evaluation crucial to ensure the desired outputs are generated. There are different ways to evaluate LLM outputs, including using other machine learning models derived from NLP, and utilizing state-of-the-art LLMs like GPT-4 with frameworks like G-Eval. Evaluating LLM outputs in Python can be done using open-source packages such as ragas and DeepEval, which provide an evaluation framework to measure how well the application is handling a task. The article concludes by highlighting the importance of evaluating LLM applications and providing resources for further learning.