Llama 3.1: Same model, different results. The impact of a percentage point.

Company

Together AI

Date Published

July 31, 2024

Author

Together AI

Word count

5632

Language

English

Hacker News points

None

URL

www.together.ai/blog/llama-31-quality

Summary

Llama 3.1, an open model rivaling top models, has sparked discussion on Twitter about differences in implementation decisions, optimizations, and quality testing processes among providers. A quick evaluation of Llama-3.1-405B showed significant variations in inference services, with some providers ranking high in GSM8K while others struggled with benchmark tests like AlpacaEval 2.0. The impact of these differences can be substantial, with a percentage point difference affecting the success or failure of an application task. To address this, Together AI has developed a five-step quality testing approach: reference matching, perplexity, analytic capability testing, generative capability testing, and qualitative testing. Their flagship implementation, Together Turbo, offers near-negligible differences in quality from the reference implementation with faster performance and lower cost, currently using FP8 quantization.