Company
Date Published
Author
Yiren Lu
Word count
757
Language
English
Hacker News points
None

Summary

The fine-tuning of large language models (LLMs) is a computationally expensive process, but new techniques such as LoRA and QLoRA have made it more efficient by reducing the number of parameters to update. LoRA, or Low-Rank Adaptation, involves freezing pre-trained weights and training smaller "adapter" matrices that represent the update to the base model, which requires significantly less VRAM than full fine-tuning. In contrast, QLoRA, or Quantized LoRA, further reduces memory usage by quantizing the low-rank matrices, achieving a 4x reduction in memory usage compared to standard LoRA. While both techniques can lead to a loss of knowledge, QLoRA's quantization may actually reduce overfitting. When it comes to choosing between LoRA and QLoRA, the decision depends on available hardware resources, with LoRA being recommended for models that fit within 16GB VRAM, while QLoRA is suitable for smaller devices or those with limited space.