How FlashAttention-2 Accelerates LLMs on NVIDIA H100 and A100 GPUs

Company

Lambda

Date Published

Aug. 24, 2023

Author

Chuan Li

Word count

934

Language

English

Hacker News points

None

URL

lambda.ai/blog/flashattention-2-lambda-cloud-h100-vs-a100

Summary

This blog post discusses the release of FlashAttention-2, a new algorithm designed to accelerate attention modules in machine learning frameworks. Building upon its predecessor's success, FlashAttention-2 delivers an astounding 2x speedup, achieved through improved parallelism and work partitioning. The authors showcase how to use FlashAttention-2 on Lambda Cloud and share benchmark results for training GPT-3-style models using NVIDIA A100 and H100 Tensor Core GPUs. The results demonstrate a significant 3x or higher speedup over the baseline implementation, with the H100 80GB SXM5 producing more than 2x Tokens/Sec compared to the A100 80GB SXM4. The authors also explore the scalability of FlashAttention-2 on multiple GPUs and estimate the time to solution for training larger models like GPT3-175B. Overall, this release offers a promising advancement in accelerating attention modules, leading to improved performance and cost savings for machine learning applications.