Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5
The article highlights the importance of reliable model evaluation in MLops and LLMops, particularly in prompt selection for large language models (LLMs). It demonstrates that relying solely on observed test accuracy can lead to suboptimal choices due to noisy annotations. The study uses a binary classification variant of the Stanford Politeness Dataset and finds that the FLAN-T5 LLM performs better with certain prompts when assessed using cleaner test data, which more closely reflects actual model deployment performance. It emphasizes the need for high-quality evaluation data and suggests using software like Cleanlab to verify label quality before making critical decisions based on observed test accuracy.
Company
Cleanlab
Date published
June 29, 2023
Author(s)
Chris Mauck, Jonas Mueller
Word count
1366
Language
English
Hacker News points
66