/plushcap/analysis/baseten/baseten-how-to-double-tokens-per-second-for-llama-3-with-medusa

How to double tokens per second for Llama 3 with Medusa

What's this blog post about?

Medusa is a technique for generating multiple tokens per forward pass during LLM inference, which can double the tokens per second of an LLM deployment. After training and validating Medusa heads, additional decoding heads grafted onto the base model, Medusa can be used in production by deploying the modified LLM using TensorRT-LLM. In a benchmark, Medusa was found to double the tokens per second running Llama 3 8B on an A100 in FP16 with no other major optimizations in place. However, it is crucial to validate output quality before deploying a model with Medusa to production.

Company
Baseten

Date published
Aug. 20, 2024

Author(s)
Abu Qader, Philip Kiely

Word count
1462

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.