Introducing ChainPoll: Enhancing LLM Evaluation

Company

Galileo

Date Published

Oct. 26, 2023

Author

Atindriyo Sanyal

Word count

269

Language

English

Hacker News points

None

URL

www.galileo.ai/blog/chainpoll

Summary

The development of large language models (LLMs) has been marked by significant advancements in generating coherent and intelligent responses. However, the presence of hallucinations - inaccurate or unmotivated claims - remains a persistent challenge, prompting the need for automated metrics to detect hallucinations in LLM outputs. A new methodology called ChainPoll has been proposed, which substantially outperforms existing alternatives, while a carefully curated suite of benchmark datasets called RealHall has been created to evaluate hallucination detection metrics. RealHall was developed by critically reviewing tasks and datasets used in prior work on hallucination detection and selecting four challenging and relevant datasets for modern LLMs. A comparison between ChainPoll and various other metrics using RealHall showed that ChainPoll achieves superior performance, with an aggregate AUROC of 0.781, while being cheaper to compute and more explainable than alternative metrics. Two new metrics, Adherence and Correctness, have also been proposed to quantify LLM hallucinations, focusing on reasoning abilities within provided documents and context for Adherence, and capturing general logical and reasoning-based mistakes for Correctness.