Running fine-tuned models on Workers AI with LoRAs

Company

Cloudflare

Date Published

April 2, 2024

Author

Michelle Chen, Logan Grasby

Word count

2415

Language

English

Hacker News points

None

URL

blog.cloudflare.com/fine-tuned-inference-with-loras

Summary

Inference from fine-tuned LLMs with LoRAs is now in open beta on Workers AI platform. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. LoRA is an efficient method of fine-tuning which takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.