Direct Preference Optimization with Synthetic Data on Anyscale

Company

Anyscale

Date Published

Aug. 21, 2024

Author

Franklin Wang, Sumanth Hegde, Kourosh Hakhamaneshi

Word count

9249

Language

English

Hacker News points

URL

www.anyscale.com/blog/direct-preference-optimization-with-synthetic-data

Summary

In this post, we explore preference tuning of LLMs through a practical case study on summarization, using Ray and Anyscale as our compute platform. We applied Direct Preference Optimization (DPO) to the Mistral-7B-Instruct-v0.1 model to produce good summaries for CNN articles. Our results show that DPO is effective in tackling specific domains such as summarization where there is no ground-truth response, and it can achieve much higher win-rates than using supervised fine-tuning or prompting GPT-4o. We also found that both β and learning rate are critical for performance and may require a thorough hyperparameter search. Additionally, we demonstrated the effectiveness of regenerating preference training data with the new model and applying additional rounds of DPO to achieve even more gains in performance.