Company
Date Published
Author
Justin Olsson, Waleed Kadous
Word count
1195
Language
English
Hacker News points
1

Summary

The results of a blind test conducted by Anyscale showed that Llama 2 70b was slightly outperformed by human legislative interns when summarizing bills, while GPT-4 significantly outperformed both humans and Llama 2 70b. The test involved generating summaries for 28 legal bills from the BillSum dataset and scoring them on a scale of 1 to 5. The analysis revealed that GPT-4's superior performance was likely due to its ability to guess at what the user wanted, as well as its training on legislation, which may have given it access to external sources of information not available to Llama 2 70b. To improve Llama 2 70b's performance, Anyscale modified its prompt to focus on the most important aspects of the bill and used active verbs to describe the bill, similar to GPT-4's approach. The tweaked prompt led to improved summaries with more "summary-like" features, suggesting that using a powerful tool like GPT-4 can help refine and develop less capable learners.