Company
Date Published
Author
Michael Feil, Philip Kiely
Word count
2035
Language
English
Hacker News points
None

Summary

We built Baseten Embedding Inference (BEI), an optimized inference runtime leveraging TensorRT-LLM to significantly boost throughput and minimize latency for embedding, reranker, and classification models. BEI outperforms the competition by a large margin, offering double the throughput of previous industry standards in batch inference and improved latency for real-time queries. The core engine for BEI is TensorRT-LLM, which offers exceptional performance and consistent throughput without the risk of OOM errors. BEI has four main components: the model server, tokenizer, batch manager, and TensorRT-LLM inference engine. The runtime benefits from optimized inference engines, such as XQA kernel and layer fusing, as well as quantization to FP8, which provides a 50% or more gain in throughput while retaining >99% cosine similarity to outputs from non-quantized models. BEI supports traffic-based autoscaling, deployment on multiple clouds and regions, and reduced communication overhead between models.