HellaSwag: Understanding the LLM Benchmark for Commonsense Reasoning

Company

Deepgram

Date Published

Aug. 21, 2023

Author

Brad Nikkel

Word count

835

Language

English

Hacker News points

None

URL

deepgram.com/learn/hellaswag-llm-benchmark-guide

Summary

HellaSwag is a large language model (LLM) benchmark designed by Zellers et al. in 2019 to evaluate commonsense reasoning in LLMs. The dataset tests common-sense natural language inference (NLI) about physical situations and uses adversarial filtering to generate deceptive, challenging incorrect answers for a multi-choice test setting. When initially released, state-of-the-art models like BERT had poor commonsense reasoning, with human accuracy soaring above 95% while these cutting-edge models mustered accuracies below 50%. Since its release, HellaSwag has pushed the field to evolve benchmarks and improve LLM performance.