Driving model performance optimization: 2024 highlights

Company

Baseten

Date Published

Jan. 14, 2025

Author

Pankaj Gupta

Word count

1530

Language

English

Hacker News points

None

URL

www.baseten.co/blog/driving-model-performance-optimization-2024-highlights

Summary

In 2024, Baseten's Model Performance Team made significant breakthroughs in optimizing latency, scalability, quality, cost, functionality, and ease of use for high-volume real-world workloads. They adopted TensorRT-LLM as their core framework, leveraging its performance and incorporating features like Flash Attention, paged attention, and in-flight batching with SOTA CUDA kernels. The team also explored NVIDIA's Hopper architecture, particularly the H100 GPU, which offered exceptional performance thanks to large and high-bandwidth onboard memory, strong compute profiles, and excellent architectural features. Additionally, they implemented featureful inference servers, including guaranteed structured output, function calling, LoRA inference support, and innovations like Writing in the Margins for long-context retrieval accuracy. The team also developed automated tools like Engine Builder to streamline engine creation and deployment, reducing manual effort and improving efficiency. Notable achievements included optimizing custom LLMs, real-time AI phone calls, Whisper ASR, DeepSeek V3, and large model cold starts in under a minute. Looking ahead to 2025, the team is excited to broaden and deepen their work on speculative decoding, embeddings models, Blackwell GPU architecture, FP4 quantization, disaggregated serving, and more.