/plushcap/analysis/together-ai/together-ai-even-better-even-faster-quantized-llms-with-qtip

Even Better, Even Faster Quantized LLMs with QTIP

What's this blog post about?

QTIP (Quantization with Trellises and Incoherence Processing) is a new weight-only LLM post-training quantization method that achieves state-of-the-art quality and inference speed. It compresses model weights using trellis coded quantization, which significantly improves over QuIP#'s quality while being 3X faster than unquantized models. QTIP builds upon the incoherence processing framework introduced by QuIP and uses trellis coded quantization to achieve lower distortion on i.i.d Gaussian sources compared to vector quantization. The bitshift trellis and compute-based codes enable fast decoding for weight-only quantization, making QTIP practical for memory-bound inference settings.

Company
Together AI

Date published
Oct. 30, 2024

Author(s)
Albert Tseng, Qingyao Sun, David Hou, Chris De Sa

Word count
3170

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.