You can now serve thousands of fine-tuned LLMs from a single GPU using LoRA (Low-Rank Adaption) swapping with TensorRT-LLM on Baseten, which maintains low time to first token (TTFT) and high tokens per second (TPS). This allows for efficient inference and model management, making it feasible to serve multiple fine-tuned models from a single deployment. LoRA swapping is compatible with in-flight batching, does not affect latency significantly, and can handle thousands of active and cached LoRAs. The implementation involves caching LoRAs on system memory or GPU VRAM, with load times ranging from instant to 2 milliseconds depending on storage location. The method also allows for flexible inference by specifying which fine-tune to use through a three-part format that includes task_id, weights, and config parameters.