Company
Date Published
Oct. 1, 2024
Author
Kyle Corbitt
Word count
740
Language
English
Hacker News points
1

Summary

OpenPipe has introduced Direct Preference Optimization (DPO) support, allowing users to align models with their specific requirements more strongly. DPO is an advanced fine-tuning method that enables models to learn directly from preference data, making it useful when users have a source of preference data that they can exploit. This technique is particularly effective when used in conjunction with user-defined criteria and has shown promising results in initial tests, such as reducing the number of responses exceeding word limits by 77% or dropping hallucinated information by 76%. To get started with DPO on OpenPipe, users need to prepare their preference data, upload it to the platform, select the DPO option during fine-tuning job configuration, and launch their training run. The company is also working on integrating DPO into an online learning workflow to enable continual learning.