Speculative decoding for high-throughput long-context inference

Company

Date Published

Sept. 5, 2024

Author

Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi Chen

Word count

2002

Language

English

Hacker News points

URL

www.together.ai/blog/speculative-decoding-for-high-throughput-long-context-inference

Summary

Speculative decoding for high-throughput long-context inference has been reevaluated, revealing that it can significantly improve throughput and latency. The analysis shows that as sequence lengths increase, bottlenecks shift from being compute-bound to memory-bound, making speculative decoding more effective. Two algorithmic innovations, MagicDec and adaptive Sequoia trees, have been proposed to take advantage of this shift. MagicDec uses a fixed context window in the draft model to speed up drafting, while adaptive Sequoia trees adaptively choose the tree size that maximizes speedup. These innovations can achieve significant speedups, up to 2x for LLaMA-2-7B-32K and 1.84x for LLaMA-3.1-8B on 8 A100 GPUs, making them an essential part of throughput optimization systems for long-context workloads.