Company
Date Published
Author
Kyle Corbitt
Word count
2044
Language
English
Hacker News points
217

Summary

This post discusses using reinforcement learning and human feedback (RLHF) to improve the performance of a Large Language Model (LLM) on predicting the upvote count of Hacker News (HN) stories. The author, Kyle Corbitt, founder of OpenPipe, explains how they built a reward model that can predict the upvote count based on the story title, URL, date, and content. The model is trained using a dataset of 114K HN stories with their corresponding upvote counts, and the training process takes around 1.5 hours on an H100 GPU for $4.05. The model achieves a root mean-square error (RMSE) of 1.11, which translates to an accuracy of e^1.11 ≈ 3. The author then runs the model against the entire corpus of HN stories and finds that it consistently over-estimates the score at the low end and under-estimates it at the high end. Despite this, the model identifies some great HN stories and provides interesting insights into what makes a story successful on HN. The author concludes by saying that RLHF gives them a powerful set of techniques to improve post quality, which they will cover in the next post in the series.