In March 2024, the team shipped several improvements to their evaluation system, including the ability to repeat LLM generations and evaluations multiple times, custom evaluator versioning, improved support for large test sets, a test case search feature, and integration with Haystack. These updates aim to provide more confidence in evaluation results and better tools for users working with large datasets. The changes also include support for new models like Mistral 7B and Mixtral, improved model comparison pages, and enhanced global evaluator assignment. Overall, these updates are designed to make the evaluation system more robust and user-friendly.