In this post, we explore preference tuning of LLMs through a practical case study on summarization, using Ray and Anyscale as our compute platform. We applied Direct Preference Optimization (DPO) to the Mistral-7B-Instruct-v0.1 model to produce good summaries for CNN articles. Our results show that DPO is effective in tackling specific domains such as summarization where there is no ground-truth response, and it can achieve much higher win-rates than using supervised fine-tuning or prompting GPT-4o. We also found that both β and learning rate are critical for performance and may require a thorough hyperparameter search. Additionally, we demonstrated the effectiveness of regenerating preference training data with the new model and applying additional rounds of DPO to achieve even more gains in performance.