SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Company

Together AI

Date Published

June 18, 2024

Author

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

Word count

1308

Language

English

Hacker News points

None

URL

www.together.ai/blog/specexec

Summary

We introduce SpecExec, a new speculative decoding method for interactive LLM inference on consumer devices, which achieves speeds of 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights. This approach is based on the classical concept of "speculative execution" applied to LLM inference and leverages the spikiness in token probability distributions in modern large language models. By using a powerful draft model to deterministically construct a large draft tree containing the most likely continuations of the input text, SpecExec directly applies speculative execution to LLM inference, achieving significant speedups over autoregressive decoding with offloading, with relative speedups ranging from 4.6x to 18.7x on various consumer GPUs. The method is particularly suited for the offloading regime and targets large language models that cannot fit on consumer GPUs due to limited memory availability. SpecExec outperforms other speculative decoding methods like SpecInfer, achieving faster speeds with larger budgets, and shows promise in making LLMs more accessible and usable by a broader audience.