Fine-tuning Best Practices Series Introduction and Chapter 1: Training Data

Company

OpenPipe

Date Published

Aug. 1, 2024

Author

Reid Mayo

Word count

2048

Language

English

Hacker News points

URL

openpipe.ai/blog/fine-tuning-best-practices-series-introduction-and-chapter-1-training-data

Summary

Fine-tuning large language models (LLMs) requires high-quality training data to achieve optimal performance. The ideal training data should match the real-world input domain as closely as possible, with adequate coverage of various tasks and inputs. Collecting optimal training data involves leveraging pre-existing production application logs, using OpenPipe's SDK for automatic logging, or pulling relevant data from databases. Human-generated data can be effective when leveraged by domain-experts, but manual creation by non-specialists is not recommended due to quality issues. Synthetic data on the input side can lead to model quality issues, whereas synthetic data on the output side can be effective with human review and patching. Agentic approaches, such as OpenPipe's Mixture-of-Agents technology, can generate high-quality outputs superior to SOTA models. The amount of training data required varies depending on the task complexity and base model size, but a general rule is that larger base models require less training data for improved performance.