A Metrics-First Approach to LLM Evaluation

Company

Galileo

Date Published

Sept. 19, 2023

Author

Pratik Bhavsar

Word count

2713

Language

English

Hacker News points

None

URL

www.galileo.ai/blog/metrics-first-approach-to-llm-evaluation

Summary

There has been tremendous progress in the world of Large Language Models (LLMs), with blockbuster models like GPT3, GPT4, Falcon, MPT, and Llama pushing the state of the art. However, evaluating these models is challenging due to their tendency to hallucinate. To address this issue, companies are developing evaluation metrics that can help them make data-driven decisions without relying solely on human judgment. These metrics include context adherence measures, correctness metrics, log probability-based metrics, prompt perplexity, and safety metrics such as PII, toxicity, tone, sexism, and prompt injection detection. By using these metrics, companies can identify potential issues with their LLMs, optimize their performance, and ensure that they are generating high-quality outputs that meet the needs of their users.