Company
Date Published
Author
Baseten
Word count
1086
Language
English
Hacker News points
None

Summary

The NVIDIA GH200 Grace Hopper Superchip is a unique datacenter hardware offering that combines an NVIDIA Hopper GPU with an ARM CPU via NVLink-C2C. This architecture promises to be promising for AI inference workloads requiring large KV cache allocations, thanks to the high-speed interconnect between the CPU and GPU. The GH200 has a higher memory bandwidth than the H100 GPU, improving generation speeds, while also offering advantages over H100 GPUs in prefill, such as offloading KV cache to abundant CPU memory. In experiments with Llama 3.3 70B on a single 96GB GH200 Superchip, the results show that the GH200 outperformed the H100 GPU by 32%, mainly due to the larger KV cache access. This suggests that the GH200 is an interesting processor for high-performance inference and model serving, particularly for large models that wouldn't fit on a standalone GPU with similar VRAM profiles.