The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Company

Together AI

Date Published

Sept. 9, 2024

Author

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Word count

2582

Language

English

Hacker News points

URL

www.together.ai/blog/the-mamba-in-the-llama-distilling-and-accelerating-hybrid-models

Summary

The authors propose distilling large-scale Transformer models into hybrid linear RNNs like Mamba, preserving impressive generative capabilities while significantly enhancing efficiency. This approach combines the strengths of both Transformers and linear RNNs to create models that are powerful yet highly efficient. The authors demonstrate the effectiveness of this method through experiments on various benchmarks, including the OpenLLM Leaderboard, showing that the distilled hybrid models outperform open-source models in terms of performance and efficiency. Speculative decoding is also proposed as a means to accelerate inference speed for these models.