Benchmarking vLLM, SGLang and TensorRT for Llama 3.1 API

Company

Cerebrium

Date Published

Oct. 10, 2024

Author

Michael Louis

Word count

643

Language

English

Hacker News points

None

URL

www.cerebrium.ai/blog/benchmarking-vllm-sglang-tensorrt-for-llama-3-1-api

Summary

In this tutorial, Michael Louis from Cerebrium benchmarked vLLM, SGLang, and TensorRT for Llama 3.1 API on a single H100 GPU. The goal was to compare Time To First Token (TTFT) and throughput across various batch sizes. Results showed that vLLM had the lowest TTFT of 123ms, while SGLang achieved the highest throughput of 460 tokens per second on a batch size of 64. The choice of framework depends on user constraints and preferences for either low-latency or high-throughput applications.