Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5

Company

Cleanlab

Date Published

June 29, 2023

Author

Chris Mauck, Jonas Mueller

Word count

1366

Language

English

Hacker News points

URL

cleanlab.ai/blog/prompt-selection

Summary

The article highlights the importance of reliable model evaluation in MLops and LLMops, particularly in prompt selection for large language models (LLMs). It demonstrates that relying solely on observed test accuracy can lead to suboptimal choices due to noisy annotations. The study uses a binary classification variant of the Stanford Politeness Dataset and finds that the FLAN-T5 LLM performs better with certain prompts when assessed using cleaner test data, which more closely reflects actual model deployment performance. It emphasizes the need for high-quality evaluation data and suggests using software like Cleanlab to verify label quality before making critical decisions based on observed test accuracy.