Benchmarking ZeRO-Inference on the NVIDIA GH200 Grace Hopper Superchip

Company

Lambda

Date Published

Dec. 20, 2023

Author

Chuan Li

Word count

434

Language

English

Hacker News points

None

URL

lambda.ai/blog/benchmarking-zero-inference-on-the-nvidia-gh200-grace-hopper-superchip

Summary

The NVIDIA GH200 Grace Hopper Superchip, a GPU-CPU hybrid system, has been benchmarked with ZeRO-Inference technology to demonstrate its potential in handling large AI models. The results show that the combination of ZeRO-Inference and the GH200 superchip effectively handles LLMs up to 176 billion parameters, significantly improving inference throughput compared to standalone GPUs like H100 or A100 Tensor Core GPU. The high bandwidth NVLink-C2C interconnect and Address Translation Services enable seamless memory access for both CPU and GPU, facilitating efficient inference. ZeRO-Inference reduces costs of large AI models by offloading model weights to CPU memory or NVMe, making advanced AI models more accessible. The benchmark results demonstrate the benefits of leveraging the GH200's high bandwidth for CPU-offload, producing higher throughput with larger batch sizes and enabling running inference with larger models than before. Overall, this combination marks a major leap in AI inference technology, democratizing access to advanced AI models and opening new possibilities for computational efficiency and scalability.