Introducing automatic LLM optimization with TensorRT-LLM Engine Builder

Company

Baseten

Date Published

Aug. 1, 2024

Author

Abu Qader, Philip Kiely

Word count

939

Language

English

Hacker News points

URL

www.baseten.co/blog/automatic-llm-optimization-with-tensorrt-llm-engine-builder

Summary

The TensorRT-LLM Engine Builder is a tool that automates the process of building optimized model serving engines for open-source and fine-tuned large language models (LLMs) in minutes, replacing hours of manual work previously required. It uses the TensorRT-LLM performance optimization toolbox to create efficient inference servers with low latency and high throughput, compatible with over 50 LLMs and similar models. The engine builder is built into Truss, an open-source model packaging framework, and provides full control over the model server, including autoscaling, logging, and metrics, as well as secure and compliant inference. It can be used to build inference engines maximized for latency, throughput, cost, or a balance thereof, depending on the user's goals, such as supporting concurrent requests or minimizing latency.