Company
Date Published
April 2, 2024
Author
Michelle Chen, Logan Grasby
Word count
2415
Language
English
Hacker News points
None

Summary

Inference from fine-tuned LLMs with LoRAs is now in open beta on Workers AI platform. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. LoRA is an efficient method of fine-tuning which takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.