Even Better, Even Faster Quantized LLMs with QTIP
QTIP (Quantization with Trellises and Incoherence Processing) is a new weight-only LLM post-training quantization method that achieves state-of-the-art quality and inference speed. It compresses model weights using trellis coded quantization, which significantly improves over QuIP#'s quality while being 3X faster than unquantized models. QTIP builds upon the incoherence processing framework introduced by QuIP and uses trellis coded quantization to achieve lower distortion on i.i.d Gaussian sources compared to vector quantization. The bitshift trellis and compute-based codes enable fast decoding for weight-only quantization, making QTIP practical for memory-bound inference settings.
Company
Together AI
Date published
Oct. 30, 2024
Author(s)
Albert Tseng, Qingyao Sun, David Hou, Chris De Sa
Word count
3170
Language
English
Hacker News points
None found.