Speculative decoding is an inference optimization technique designed to improve the latency of LLM inference by coordinating two models on a single model server: a larger target model (e.g., Llama 70B) and a smaller draft model (e.g., Llama 8B). To support speculative decoding in production, the authors had to tackle issues such as inefficient batching, high time-to-first-token (TTFT), crashes, and unreliability. They implemented a mechanism that synchronizes the execution of the draft and target models, ensuring only one can run on the GPU at a time. This setup improves batching, TTFT, and stability by unlocking batching with scheduled and queued worker execution. The authors also fixed issues related to chunked prefill and KV cache re-use in TensorRT-LLM's request scheduling mechanism. With these improvements, speculative decoding is now production-ready, supporting streaming output, structured output, request termination support, and OpenAI spec compatibility. Benchmark results show that speculative decoding can reduce p50 latency by up to 90% for code generation tasks, with some tests showing improved TTFT and faster overall speed. The authors plan to continue improving the performance and stability of speculative decoding on TensorRT-LLM while contributing bugfixes back to the maintainers.