Putting the NVIDIA GH200 Grace Hopper Superchip to good use: superior inference performance and economics for larger models

Company

Lambda

Date Published

Nov. 22, 2024

Author

Thomas Bordes

Word count

870

Language

English

Hacker News points

None

URL

lambda.ai/blog/putting-the-nvidia-gh200-grace-hopper-superchip-to-good-use-superior-inference-performance-and-economics

Summary

The NVIDIA GH200 Grace Hopper Superchip is an excellent alternative to the NVIDIA H100 SXM Tensor Core GPU for large language model (LLM) inference, delivering superior cost efficiency on single GPU instances. It offers better performance and lower costs compared to the H100 SXM, with a 7.6x increase in throughput and an 8x reduction in cost per token when using Llama 3.1 70B inferencing. The GH200 is particularly beneficial for cost-conscious deployments and faster processing, making it ideal for applications that require quick responses or pipeline throughput. However, users should be aware of some gotchas associated with the novel infrastructure, including compatibility issues with certain libraries and tools, such as PyTorch, which may require compilation for ARM architecture. The NVIDIA GH200 is available on-demand on Lambda's Public Cloud at $3.19 per hour, making it an attractive option for optimizing LLM inference workloads.