Comparing tokens per second across LLMs

Company

Baseten

Date Published

May 9, 2024

Author

Philip Kiely

Word count

769

Language

English

Hacker News points

None

URL

www.baseten.co/blog/comparing-tokens-per-second-across-llms

Summary

Comparing tokens per second across LLMs is crucial to accurately evaluate the performance of Large Language Models (LLMs) during inference. The efficiency of tokenizers used by different models varies widely, with some being more efficient than others in processing human-readable input text and generating output as tokens. When comparing performance across two different LLMs, it's essential to adjust Token Per Second (TPS) metrics based on the models' tokenizers to ensure accurate comparisons. Different LLMs have varying levels of efficiency in their tokenizers, with some being better suited for specific use cases such as code or prose. To accurately calculate changes in latency, throughput, and cost when switching between open source models, it's necessary to adjust TPS calculations to reflect real-world use and account for the relative value of each token generated. By doing so, developers can set accurate performance targets and optimize their LLMs for better performance.