How to serve DeepSeek-R1 & v3 on NVIDIA GH200 Grace Hopper Superchip (400 tok/sec throughput, 10 tok/sec/query)

Company

Lambda

Date Published

Feb. 24, 2025

Author

Luke Miles

Word count

710

Language

English

Hacker News points

None

URL

lambda.ai/blog/how-to-serve-deepseek-r1-v3-on-gh200

Summary

DeepSeek-R1 and v3 are being served on NVIDIA GH200 Grace Hopper Superchip instances, which provide a high throughput of 400 tokens per second. This is made possible by using 12 or 16 GPUs, depending on the required throughput. The model vLLM works better than Aphrodite for DeepSeek right now, and an update has improved its inference speed by roughly 40%. A script is provided to create instances, set up NFS caching, install Python 3.11, create a virtual environment, download models, install VLLM, and serve the model using ray. The guide includes a video showing inference speed with 64 parallel queries.