With the introduction of our Speculative Decoding Engine Builder integration, developers can now add speculative decoding to their production LLM deployments as part of a streamlined TensorRT-LLM Engine Builder flow, allowing for ultra-low-latency inference. This integration is particularly useful for latency-sensitive LLM applications, such as live translation, chatbots, and coding assistants, where best-in-class performance is required without compromising output quality. By using our pre-optimized config files or further tuning settings according to their needs, developers can leverage state-of-the-art model performance optimizations for their mission-critical production AI workloads. The integration has been shown to halve latencies with no effect on output quality and provides a two-tiered approach that balances ease of use with control over parameters, making it an ideal solution for applications using large models in production.