Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Company

Arize

Date Published

Aug. 16, 2024

Author

Sarah Welsh

Word count

7858

Language

English

Hacker News points

None

URL

arize.com/blog/judging-the-judges-llm-as-a-judge

Summary

This paper evaluates the performance of various LLMs acting as judges on a TriviaQA benchmark. The researchers assess the alignment between the judge models' outputs and human annotations, finding that only the best-performing models (GPT-4, Turbo, and Llama 3 7 B) achieve high alignment with humans. The study highlights the importance of using top-performing models for evaluating LLMs as judges. The results also show that larger models tend to perform better than smaller ones, but the difference in performance is not always significant. Additionally, the paper finds that prompt optimization and handling under-specified answers can improve the performance of LLM judges. However, it's essential to note that this study is conducted in a controlled environment and might not generalize well to real-world use cases. The authors recommend using Cohen's Kappa as a metric for evaluating alignment between human evaluators and LLM judges, which accounts for agreement by chance.