A Guide to Quantization in LLMs

Company

Symbl.ai

Date Published

Feb. 21, 2024

Author

Kartik Talamadupula

Word count

2505

Language

English

Hacker News points

URL

symbl.ai/developers/blog/a-guide-to-quantization-in-llms

Summary

Quantization is a model compression technique that reduces the size of Large Language Models (LLMs) by converting their weights and activations from high-precision data representation to lower-precision data representation, making them more portable and scalable. This process enables LLMs to run on a wider range of devices, including single GPUs or even CPUs, while reducing memory consumption, storage space, energy efficiency, and inference time. Quantization techniques can be categorized into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Some popular LLM quantization methods include QLoRA, PRILoRA, GPTQ, GGML/GGUF, and AWQ. These techniques help to increase the adoption of LLMs by reducing their memory requirements and enabling them to run on a broader range of hardware.