Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
This paper evaluates the performance of various LLMs acting as judges on a TriviaQA benchmark. The researchers assess the alignment between the judge models' outputs and human annotations, finding that only the best-performing models (GPT-4, Turbo, and Llama 3 7 B) achieve high alignment with humans. The study highlights the importance of using top-performing models for evaluating LLMs as judges. The results also show that larger models tend to perform better than smaller ones, but the difference in performance is not always significant. Additionally, the paper finds that prompt optimization and handling under-specified answers can improve the performance of LLM judges. However, it's essential to note that this study is conducted in a controlled environment and might not generalize well to real-world use cases. The authors recommend using Cohen's Kappa as a metric for evaluating alignment between human evaluators and LLM judges, which accounts for agreement by chance.
Company
Arize
Date published
Aug. 16, 2024
Author(s)
Sarah Welsh
Word count
7858
Language
English
Hacker News points
None found.