This post discusses using reinforcement learning and human feedback (RLHF) to improve the performance of a Large Language Model (LLM) on predicting the upvote count of Hacker News (HN) stories. The author, Kyle Corbitt, founder of OpenPipe, explains how they built a reward model that can predict the upvote count based on the story title, URL, date, and content. The model is trained using a dataset of 114K HN stories with their corresponding upvote counts, and the training process takes around 1.5 hours on an H100 GPU for $4.05. The model achieves a root mean-square error (RMSE) of 1.11, which translates to an accuracy of e^1.11 ≈ 3. The author then runs the model against the entire corpus of HN stories and finds that it consistently over-estimates the score at the low end and under-estimates it at the high end. Despite this, the model identifies some great HN stories and provides interesting insights into what makes a story successful on HN. The author concludes by saying that RLHF gives them a powerful set of techniques to improve post quality, which they will cover in the next post in the series.