How continuous batching enables 23x throughput in LLM inference while reducing p50 latency

Company

Anyscale

Date Published

June 22, 2023

Author

Cade Daniel, Chen Shen, Eric Liang, Richard Liaw

Word count

3568

Language

English

Hacker News points

110

URL

www.anyscale.com/blog/continuous-batching-llm-inference

Summary

Large language models (LLMs) dominate compute cost for most real-world applications due to their large GPU memory footprint and compute cost. However, traditional batching policies are inefficient, especially with varying sequence lengths. Continuous batching, also known as dynamic batching or iteration-level scheduling, is proposed as a solution. By leveraging vLLM, users can achieve 23x LLM inference throughput while reducing p50 latency. The continuous batching framework improves memory efficiency by allowing for dynamic allocation of GPU memory and reducing wastage. This approach outperforms traditional static batching and improves both throughput and latency. Continuous batching is particularly effective with high variance in sequence lengths, and its performance gap widens when combined with further optimizations such as iteration-level scheduling and advanced memory management techniques like PagedAttention. The continuous batching framework has been implemented in vLLM, Hugging Face's text-generation-inference, and Ray Serve, demonstrating its potential to significantly improve LLM inference efficiency.