The Together Inference Engine 2.0 introduces new Turbo and Lite endpoints, providing faster decoding throughput and higher quality models than commercial solutions. The new endpoints offer performance, quality, and price flexibility, allowing enterprises to scale their applications without compromising on any aspect. With the release of Together Turbo and Together Lite, developers can now build Generative AI applications at production scale with the fastest engine for Nvidia GPUs and the most accurate and cost-efficient solution. The engine achieves over 400 tokens per second on Meta Llama 3 8B by leveraging advanced techniques such as FlashAttention-3, faster GEMM & MHA kernels, innovations in quality-preserving quantization, and speculative decoding. The new endpoints are available starting today for Llama 3 models, with plans to roll out across other models soon.