[AARR] G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

Company

Align AI

Date Published

Nov. 11, 2023

Author

Align AI R&D Team

Word count

560

Language

English

Hacker News points

None

URL

tryalign.ai/resources/blog/aarr-g-eval-nlg-evaluation-using-gpt-4-with-better-human-alignment

Summary

The Align AI Research Review discusses recent research on using large language models (LLMs) as reference-free metrics for evaluating natural language generation (NLG). Microsoft Cognitive Services Research team proposed the G-EVAL framework, which employs complex LLMs with chain-of-thoughts and a form-filling strategy to evaluate NLG outputs. The G-EVAL framework consists of three main elements: prompt specification, chain-of-thought instructions, and a scoring function. An experiment was conducted to investigate whether LLMs exhibit a preference for their own outputs over human-written summaries. The findings suggest that while LLMs offer efficiency in handling large data volumes, they are not yet reliable enough to be the sole evaluators. Align AI recommends a hybrid approach integrating LLMs' computational power with human expert judgment for effective evaluation strategies.