Benchmarking vLLM, SGLang and TensorRT for Llama 3.1 API
In this tutorial, Michael Louis from Cerebrium benchmarked vLLM, SGLang, and TensorRT for Llama 3.1 API on a single H100 GPU. The goal was to compare Time To First Token (TTFT) and throughput across various batch sizes. Results showed that vLLM had the lowest TTFT of 123ms, while SGLang achieved the highest throughput of 460 tokens per second on a batch size of 64. The choice of framework depends on user constraints and preferences for either low-latency or high-throughput applications.
Company
Cerebrium
Date published
Oct. 10, 2024
Author(s)
Michael Louis
Word count
643
Language
English
Hacker News points
None found.