How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM

Company

Baseten

Date Published

March 28, 2025

Author

Michael Feil, Philip Kiely

Word count

2035

Language

English

Hacker News points

None

URL

www.baseten.co/blog/how-we-built-high-throughput-embedding-inference-with-tensorrt-llm

Summary

We built Baseten Embedding Inference (BEI), an optimized inference runtime leveraging TensorRT-LLM to significantly boost throughput and minimize latency for embedding, reranker, and classification models. BEI outperforms the competition by a large margin, offering double the throughput of previous industry standards in batch inference and improved latency for real-time queries. The core engine for BEI is TensorRT-LLM, which offers exceptional performance and consistent throughput without the risk of OOM errors. BEI has four main components: the model server, tokenizer, batch manager, and TensorRT-LLM inference engine. The runtime benefits from optimized inference engines, such as XQA kernel and layer fusing, as well as quantization to FP8, which provides a 50% or more gain in throughput while retaining >99% cosine similarity to outputs from non-quantized models. BEI supports traffic-based autoscaling, deployment on multiple clouds and regions, and reduced communication overhead between models.