/plushcap/analysis/together-ai/together-ai-speculative-decoding-for-high-throughput-long-context-inference

Speculative decoding for high-throughput long-context inference

What's this blog post about?

Speculative decoding for high-throughput long-context inference has been reevaluated, revealing that it can significantly improve throughput and latency. The analysis shows that as sequence lengths increase, bottlenecks shift from being compute-bound to memory-bound, making speculative decoding more effective. Two algorithmic innovations, MagicDec and adaptive Sequoia trees, have been proposed to take advantage of this shift. MagicDec uses a fixed context window in the draft model to speed up drafting, while adaptive Sequoia trees adaptively choose the tree size that maximizes speedup. These innovations can achieve significant speedups, up to 2x for LLaMA-2-7B-32K and 1.84x for LLaMA-3.1-8B on 8 A100 GPUs, making them an essential part of throughput optimization systems for long-context workloads.

Company
Together AI

Date published
Sept. 5, 2024

Author(s)
Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi Chen

Word count
2002

Language
English

Hacker News points
2


By Matt Makai. 2021-2024.