Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever (RLHF Part 1)

Company

OpenPipe

Date Published

Oct. 28, 2024

Author

Kyle Corbitt

Word count

2044

Language

English

Hacker News points

217

URL

openpipe.ai/blog/hacker-news-rlhf-part-1

Summary

This post discusses using reinforcement learning and human feedback (RLHF) to improve the performance of a Large Language Model (LLM) on predicting the upvote count of Hacker News (HN) stories. The author, Kyle Corbitt, founder of OpenPipe, explains how they built a reward model that can predict the upvote count based on the story title, URL, date, and content. The model is trained using a dataset of 114K HN stories with their corresponding upvote counts, and the training process takes around 1.5 hours on an H100 GPU for $4.05. The model achieves a root mean-square error (RMSE) of 1.11, which translates to an accuracy of e^1.11 ≈ 3. The author then runs the model against the entire corpus of HN stories and finds that it consistently over-estimates the score at the low end and under-estimates it at the high end. Despite this, the model identifies some great HN stories and provides interesting insights into what makes a story successful on HN. The author concludes by saying that RLHF gives them a powerful set of techniques to improve post quality, which they will cover in the next post in the series.