Partner Spotlight: Testing Llama 3.3 70B inference performance on NVIDIA GH200 with Baseten

Company

Lambda

Date Published

Feb. 7, 2025

Author

Baseten

Word count

1086

Language

English

Hacker News points

None

URL

lambda.ai/blog/partner-spotlight-testing-llama-3.3-70b-inference-performance-on-nvidia-gh200-with-baseten

Summary

The NVIDIA GH200 Grace Hopper Superchip is a unique datacenter hardware offering that combines an NVIDIA Hopper GPU with an ARM CPU via NVLink-C2C. This architecture promises to be promising for AI inference workloads requiring large KV cache allocations, thanks to the high-speed interconnect between the CPU and GPU. The GH200 has a higher memory bandwidth than the H100 GPU, improving generation speeds, while also offering advantages over H100 GPUs in prefill, such as offloading KV cache to abundant CPU memory. In experiments with Llama 3.3 70B on a single 96GB GH200 Superchip, the results show that the GH200 outperformed the H100 GPU by 32%, mainly due to the larger KV cache access. This suggests that the GH200 is an interesting processor for high-performance inference and model serving, particularly for large models that wouldn't fit on a standalone GPU with similar VRAM profiles.