FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Company

Date Published

July 11, 2024

Author

Jay Shah (Colfax Research), Ganesh Bikshandi (Colfax Research), Ying Zhang (Meta), Vijay Thakkar (NVIDIA), Pradeep Ramani (NVIDIA), Tri Dao (Princeton University, Together AI)

Word count

1753

Language

English

Hacker News points

287

URL

www.together.ai/blog/flashattention-3

Summary

FlashAttention-3 is a new version of the FlashAttention algorithm designed to speed up attention mechanisms in large language models by leveraging the capabilities of modern Hopper GPUs. It achieves 1.5-2.0x faster performance than its predecessor, FlashAttention-2, with FP16, reaching up to 740 TFLOPS and utilizing 75% of an H100 GPU's maximum capabilities. With FP8, it reaches close to 1.2 PFLOPS, while maintaining accuracy. The new algorithm uses powerful abstractions from NVIDIA's CUTLASS library and incorporates three main techniques: exploiting asynchrony with warp-specialization and interleave block-wise matmul and softmax operations, incoherent processing that leverages hardware support for FP8 low-precision, and overlapping GEMM and softmax to take advantage of the asynchronous nature of new instructions on Hopper GPUs. These optimizations enable more efficient GPU utilization, better performance with lower precision, and the ability to use longer context in large language models, ultimately unlocking new capabilities such as long context.