Running fine-tuned models on Workers AI with LoRAs
Inference from fine-tuned LLMs with LoRAs is now in open beta on Workers AI platform. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. LoRA is an efficient method of fine-tuning which takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.
Company
Cloudflare
Date published
April 2, 2024
Author(s)
Michelle Chen, Logan Grasby
Word count
2415
Language
English
Hacker News points
None found.