This article discusses best practices for selecting the right model for Language Learning Model (LLM) as a judge evaluations. It emphasizes the importance of using an LLM to evaluate other models, which can save time and effort when scaling applications. The process involves starting with a golden dataset, choosing the evaluation model, analyzing results, adding explanations for transparency, and monitoring performance in production. GPT-4 emerged as the top performer in recent evaluations, achieving an accuracy of 81%. However, other models like GPT-3.5 Turbo or Claude 3.5 Sonnet may also be suitable depending on specific needs. The article suggests using Arize's Phoenix library for pre-built prompt templates and resources to run LLM-as-a-judge evaluations.