LLM-based summarization: A case study of human, Llama 2 70b and GPT-4 summarization quality

Company

Anyscale

Date Published

Nov. 9, 2023

Author

Justin Olsson, Waleed Kadous

Word count

1195

Language

English

Hacker News points

URL

www.anyscale.com/blog/llm-based-summarization-a-case-study-of-human-llama-2-70b-and-gpt-4-summarization-quality

Summary

The results of a blind test conducted by Anyscale showed that Llama 2 70b was slightly outperformed by human legislative interns when summarizing bills, while GPT-4 significantly outperformed both humans and Llama 2 70b. The test involved generating summaries for 28 legal bills from the BillSum dataset and scoring them on a scale of 1 to 5. The analysis revealed that GPT-4's superior performance was likely due to its ability to guess at what the user wanted, as well as its training on legislation, which may have given it access to external sources of information not available to Llama 2 70b. To improve Llama 2 70b's performance, Anyscale modified its prompt to focus on the most important aspects of the bill and used active verbs to describe the bill, similar to GPT-4's approach. The tweaked prompt led to improved summaries with more "summary-like" features, suggesting that using a powerful tool like GPT-4 can help refine and develop less capable learners.