How we built production-ready speculative decoding with TensorRT-LLM

Company

Baseten

Date Published

Dec. 20, 2024

Author

Pankaj Gupta, Justin Yi, Philip Kiely

Word count

2729

Language

English

Hacker News points

None

URL

www.baseten.co/blog/how-we-built-production-ready-speculative-decoding-with-tensorrt-llm

Summary

Speculative decoding is an inference optimization technique designed to improve the latency of LLM inference by coordinating two models on a single model server: a larger target model (e.g., Llama 70B) and a smaller draft model (e.g., Llama 8B). To support speculative decoding in production, the authors had to tackle issues such as inefficient batching, high time-to-first-token (TTFT), crashes, and unreliability. They implemented a mechanism that synchronizes the execution of the draft and target models, ensuring only one can run on the GPU at a time. This setup improves batching, TTFT, and stability by unlocking batching with scheduled and queued worker execution. The authors also fixed issues related to chunked prefill and KV cache re-use in TensorRT-LLM's request scheduling mechanism. With these improvements, speculative decoding is now production-ready, supporting streaming output, structured output, request termination support, and OpenAI spec compatibility. Benchmark results show that speculative decoding can reduce p50 latency by up to 90% for code generation tasks, with some tests showing improved TTFT and faster overall speed. The authors plan to continue improving the performance and stability of speculative decoding on TensorRT-LLM while contributing bugfixes back to the maintainers.