How we built production-ready speculative decoding with TensorRT-LLM
Speculative decoding is an inference optimization technique designed to improve the latency of LLM inference by coordinating two models on a single model server: a larger target model (e.g., Llama 70B) and a smaller draft model (e.g., Llama 8B). To support speculative decoding in production, the authors had to tackle issues such as inefficient batching, high time-to-first-token (TTFT), crashes, and unreliability. They implemented a mechanism that synchronizes the execution of the draft and target models, ensuring only one can run on the GPU at a time. This setup improves batching, TTFT, and stability by unlocking batching with scheduled and queued worker execution. The authors also fixed issues related to chunked prefill and KV cache re-use in TensorRT-LLM's request scheduling mechanism. With these improvements, speculative decoding is now production-ready, supporting streaming output, structured output, request termination support, and OpenAI spec compatibility. Benchmark results show that speculative decoding can reduce p50 latency by up to 90% for code generation tasks, with some tests showing improved TTFT and faster overall speed. The authors plan to continue improving the performance and stability of speculative decoding on TensorRT-LLM while contributing bugfixes back to the maintainers.
Company
Baseten
Date published
Dec. 20, 2024
Author(s)
Pankaj Gupta, Justin Yi, Philip Kiely
Word count
2729
Language
English
Hacker News points
None found.