/plushcap/analysis/cerebrium/cerebrium-benchmarking-vllm-sglang-tensorrt-for-llama-3-1-api

Benchmarking vLLM, SGLang and TensorRT for Llama 3.1 API

What's this blog post about?

In this tutorial, Michael Louis from Cerebrium benchmarked vLLM, SGLang, and TensorRT for Llama 3.1 API on a single H100 GPU. The goal was to compare Time To First Token (TTFT) and throughput across various batch sizes. Results showed that vLLM had the lowest TTFT of 123ms, while SGLang achieved the highest throughput of 460 tokens per second on a batch size of 64. The choice of framework depends on user constraints and preferences for either low-latency or high-throughput applications.

Company
Cerebrium

Date published
Oct. 10, 2024

Author(s)
Michael Louis

Word count
643

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.