Company
Date Published
Author
Luke Miles
Word count
710
Language
English
Hacker News points
None

Summary

DeepSeek-R1 and v3 are being served on NVIDIA GH200 Grace Hopper Superchip instances, which provide a high throughput of 400 tokens per second. This is made possible by using 12 or 16 GPUs, depending on the required throughput. The model vLLM works better than Aphrodite for DeepSeek right now, and an update has improved its inference speed by roughly 40%. A script is provided to create instances, set up NFS caching, install Python 3.11, create a virtual environment, download models, install VLLM, and serve the model using ray. The guide includes a video showing inference speed with 64 parallel queries.