Speculative decoding for high-throughput long-context inference
Speculative decoding for high-throughput long-context inference has been reevaluated, revealing that it can significantly improve throughput and latency. The analysis shows that as sequence lengths increase, bottlenecks shift from being compute-bound to memory-bound, making speculative decoding more effective. Two algorithmic innovations, MagicDec and adaptive Sequoia trees, have been proposed to take advantage of this shift. MagicDec uses a fixed context window in the draft model to speed up drafting, while adaptive Sequoia trees adaptively choose the tree size that maximizes speedup. These innovations can achieve significant speedups, up to 2x for LLaMA-2-7B-32K and 1.84x for LLaMA-3.1-8B on 8 A100 GPUs, making them an essential part of throughput optimization systems for long-context workloads.
Company
Together AI
Date published
Sept. 5, 2024
Author(s)
Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi Chen
Word count
2002
Language
English
Hacker News points
2