Company
Date Published
Jan. 14, 2025
Author
Pankaj Gupta
Word count
1530
Language
English
Hacker News points
None

Summary

In 2024, Baseten's Model Performance Team made significant breakthroughs in optimizing latency, scalability, quality, cost, functionality, and ease of use for high-volume real-world workloads. They adopted TensorRT-LLM as their core framework, leveraging its performance and incorporating features like Flash Attention, paged attention, and in-flight batching with SOTA CUDA kernels. The team also explored NVIDIA's Hopper architecture, particularly the H100 GPU, which offered exceptional performance thanks to large and high-bandwidth onboard memory, strong compute profiles, and excellent architectural features. Additionally, they implemented featureful inference servers, including guaranteed structured output, function calling, LoRA inference support, and innovations like Writing in the Margins for long-context retrieval accuracy. The team also developed automated tools like Engine Builder to streamline engine creation and deployment, reducing manual effort and improving efficiency. Notable achievements included optimizing custom LLMs, real-time AI phone calls, Whisper ASR, DeepSeek V3, and large model cold starts in under a minute. Looking ahead to 2025, the team is excited to broaden and deepen their work on speculative decoding, embeddings models, Blackwell GPU architecture, FP4 quantization, disaggregated serving, and more.