LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs

Company

Modal

Date Published

Aug. 22, 2024

Author

Yiren Lu

Word count

757

Language

English

Hacker News points

None

URL

modal.com/blog/lora-qlora

Summary

The fine-tuning of large language models (LLMs) is a computationally expensive process, but new techniques such as LoRA and QLoRA have made it more efficient by reducing the number of parameters to update. LoRA, or Low-Rank Adaptation, involves freezing pre-trained weights and training smaller "adapter" matrices that represent the update to the base model, which requires significantly less VRAM than full fine-tuning. In contrast, QLoRA, or Quantized LoRA, further reduces memory usage by quantizing the low-rank matrices, achieving a 4x reduction in memory usage compared to standard LoRA. While both techniques can lead to a loss of knowledge, QLoRA's quantization may actually reduce overfitting. When it comes to choosing between LoRA and QLoRA, the decision depends on available hardware resources, with LoRA being recommended for models that fit within 16GB VRAM, while QLoRA is suitable for smaller devices or those with limited space.