Announcing Together Inference Engine 2.0 with new Turbo and Lite endpoints

Company

Together AI

Date Published

July 18, 2024

Author

Together AI

Word count

1802

Language

English

Hacker News points

URL

www.together.ai/blog/together-inference-engine-2

Summary

The Together Inference Engine 2.0 introduces new Turbo and Lite endpoints, providing faster decoding throughput and higher quality models than commercial solutions. The new endpoints offer performance, quality, and price flexibility, allowing enterprises to scale their applications without compromising on any aspect. With the release of Together Turbo and Together Lite, developers can now build Generative AI applications at production scale with the fastest engine for Nvidia GPUs and the most accurate and cost-efficient solution. The engine achieves over 400 tokens per second on Meta Llama 3 8B by leveraging advanced techniques such as FlashAttention-3, faster GEMM & MHA kernels, innovations in quality-preserving quantization, and speculative decoding. The new endpoints are available starting today for Llama 3 models, with plans to roll out across other models soon.