Running Llama 3 8B with TensorRT-LLM on Serverless GPUs

Company

Cerebrium

Date Published

May 16, 2024

Author

Michael Louis

Word count

1410

Language

English

Hacker News points

None

URL

www.cerebrium.ai/blog/running-llama-3-8b-with-tensorrt-llm-on-serverless-gpus

Summary

The tutorial guides the reader through implementing the TensorRT-LLM framework on the Cerebrium platform to serve Llama 3 8B model, optimizing machine learning models for inference and achieving significant improvements in performance. The process involves setting up a Cerebrium account, installing required packages, and writing initial code to download the model, convert it to TensorRT-LLM format, build the engine, and deploy the application. The reader can achieve ~1700 output tokens per second on a single Nvidia A10 instance, with potential for further improvements through speculative sampling or FP8 quantization.