How to double tokens per second for Llama 3 with Medusa
Medusa is a technique for generating multiple tokens per forward pass during LLM inference, which can double the tokens per second of an LLM deployment. After training and validating Medusa heads, additional decoding heads grafted onto the base model, Medusa can be used in production by deploying the modified LLM using TensorRT-LLM. In a benchmark, Medusa was found to double the tokens per second running Llama 3 8B on an A100 in FP16 with no other major optimizations in place. However, it is crucial to validate output quality before deploying a model with Medusa to production.
Company
Baseten
Date published
Aug. 20, 2024
Author(s)
Abu Qader, Philip Kiely
Word count
1462
Language
English
Hacker News points
2