How "Correct" are LLM Evaluators?

Company

LangChain

Date Published

Sept. 28, 2023

Author

Word count

1442

Language

English

Hacker News points

None

URL

blog.langchain.dev/how-correct-are-llm-evaluators

Summary

LangChain tested its LLM-assisted evaluators on common tasks, finding that GPT-4 excels in accuracy across various tasks while GPT-3.5 and Claude-2 lag for tasks requiring complex reasoning. The QAEvalChain, CoT evaluator, and Criteria evaluator were investigated to grade whether a predicted output is "correct" relative to a label. Results indicated that GPT-4 outperforms other models in structured reasoning tasks, while simpler tasks like translation and Web Q&A show reliability with Claude-2 and GPT-3.5 but falter when additional reasoning is needed. The default "qa" prompt most consistently produced the expected answers across different tasks. Future enhancements may include more flexibility in grading scales, few-shot examples in prompts, and function calling for GPT-3.5 models to generate more reliable results.