How to serve 10,000 fine-tuned LLMs from a single GPU

Company

Baseten

Date Published

July 23, 2024

Author

Pankaj Gupta, Philip Kiely

Word count

1895

Language

English

Hacker News points

None

URL

www.baseten.co/blog/how-to-serve-10-000-fine-tuned-llms-from-a-single-gpu

Summary

You can now serve thousands of fine-tuned LLMs from a single GPU using LoRA (Low-Rank Adaption) swapping with TensorRT-LLM on Baseten, which maintains low time to first token (TTFT) and high tokens per second (TPS). This allows for efficient inference and model management, making it feasible to serve multiple fine-tuned models from a single deployment. LoRA swapping is compatible with in-flight batching, does not affect latency significantly, and can handle thousands of active and cached LoRAs. The implementation involves caching LoRAs on system memory or GPU VRAM, with load times ranging from instant to 2 milliseconds depending on storage location. The method also allows for flexible inference by specifying which fine-tune to use through a three-part format that includes task_id, weights, and config parameters.