Company
Date Published
June 18, 2024
Author
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
Word count
1308
Language
English
Hacker News points
None

Summary

We introduce SpecExec, a new speculative decoding method for interactive LLM inference on consumer devices, which achieves speeds of 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights. This approach is based on the classical concept of "speculative execution" applied to LLM inference and leverages the spikiness in token probability distributions in modern large language models. By using a powerful draft model to deterministically construct a large draft tree containing the most likely continuations of the input text, SpecExec directly applies speculative execution to LLM inference, achieving significant speedups over autoregressive decoding with offloading, with relative speedups ranging from 4.6x to 18.7x on various consumer GPUs. The method is particularly suited for the offloading regime and targets large language models that cannot fit on consumer GPUs due to limited memory availability. SpecExec outperforms other speculative decoding methods like SpecInfer, achieving faster speeds with larger budgets, and shows promise in making LLMs more accessible and usable by a broader audience.