Even Better, Even Faster Quantized LLMs with QTIP

Company

Together AI

Date Published

Oct. 30, 2024

Author

Albert Tseng, Qingyao Sun, David Hou, Chris De Sa

Word count

3170

Language

English

Hacker News points

None

URL

www.together.ai/blog/even-better-even-faster-quantized-llms-with-qtip

Summary

QTIP (Quantization with Trellises and Incoherence Processing) is a new weight-only LLM post-training quantization method that achieves state-of-the-art quality and inference speed. It compresses model weights using trellis coded quantization, which significantly improves over QuIP#'s quality while being 3X faster than unquantized models. QTIP builds upon the incoherence processing framework introduced by QuIP and uses trellis coded quantization to achieve lower distortion on i.i.d Gaussian sources compared to vector quantization. The bitshift trellis and compute-based codes enable fast decoding for weight-only quantization, making QTIP practical for memory-bound inference settings.