How I Built Deterministic LLM Evaluation Metrics for DeepEval

Company

Confident AI

Date Published

Feb. 9, 2025

Author

Jeffrey Ip

Word count

2335

Language

English

Hacker News points

None

URL

www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval

Summary

The author of the text noticed a divide among DeepEval users who were either happy with the out-of-the-box metrics or not. The issue was that the built-in metrics didn't fit their use case and weren't deterministic enough, leading to hundreds of lines of code dedicated to tweaking evaluation logic. To address this, the author introduced a new metric called DAG (Deep Acyclic Graph) which is structured around LLM-powered decision trees, providing customizability and determinism for evaluations. The DAG metric breaks down an LLM test case into atomic units and uses four core node types: Task nodes, Binary Judgment nodes, Non-Binary Judgment nodes, and Verdict nodes. This allows users to easily build DAGs within DeepEval, making evaluation easy for even smaller models to handle, and benefits from optimized parallel execution, efficient cost management, built-in caching, and error handling. The author concludes that the DAG metric solves the core problem of traditional metrics lacking control and provides a transparent, efficient, and adaptable evaluation process.