How to double tokens per second for Llama 3 with Medusa

Company

Baseten

Date Published

Aug. 20, 2024

Author

Abu Qader, Philip Kiely

Word count

1462

Language

English

Hacker News points

URL

www.baseten.co/blog/how-to-double-tokens-per-second-for-llama-3-with-medusa

Summary

Medusa is a technique for generating multiple tokens per forward pass during LLM inference, which can double the tokens per second of an LLM deployment. After training and validating Medusa heads, additional decoding heads grafted onto the base model, Medusa can be used in production by deploying the modified LLM using TensorRT-LLM. In a benchmark, Medusa was found to double the tokens per second running Llama 3 8B on an A100 in FP16 with no other major optimizations in place. However, it is crucial to validate output quality before deploying a model with Medusa to production.