The current Large Language Models (LLMs) range from 7 billion to over 100 billion parameters, each more powerful than the last, but also share some flawed behaviors such as producing gibberish outputs and being not always factually correct. To confidently assert that one LLM is superior to another, a standard benchmarking system is needed, ensuring they are ethically reliable and factually performant. Current research frameworks for benchmarking LLMs include Language Model Evaluation Harness, Stanford HELM, PromptBench, and ChatArena, each with their strengths and limitations. However, these systems have moving components that can be difficult to manage, and there is a need for standardization in naming conventions. Best practices for LLM benchmarking include pre-production evaluation using prompt engineering, RAG, fine-tuning, and experimentation, as well as post-production evaluation through continuous monitoring, explicit feedback, and continuous fine-tuning. Implementing these best practices can be achieved with the use of DeepEval, an open-source evaluation infrastructure that provides a robust framework for LLM benchmarking.