Company
Date Published
April 17, 2024
Author
Ankur Goyal
Word count
1002
Language
English
Hacker News points
None

Summary

In the field of AI engineering, establishing a set of real-world examples, known as "evals", is crucial to understand how changes will impact end users. However, finding great eval data, identifying interesting cases in production, and tying user feedback to evals can be challenging problems. To overcome these issues, connecting real-world log data to evals allows for the evaluation of new and interesting cases in the wild, enabling improvements and avoiding regression. This is achieved by structuring evals as a function of data, prompts/code, and scoring functions, and utilizing tools like Braintrust's Eval function that streamlines this process. By capturing and utilizing logs, teams can power their evals with real-world examples, making it easier to identify interesting cases and improve AI products. As teams scale, filtering logs to only consider the most interesting ones becomes critical, and using filters, tracking user feedback, or running online scores can help uncover test cases that need improvement. Braintrust's solution provides a unified UI for exploring logs and evals, automating code reuse, and storing datasets in a cloud environment.