Company
Date Published
July 11, 2024
Author
Jay Shah (Colfax Research), Ganesh Bikshandi (Colfax Research), Ying Zhang (Meta), Vijay Thakkar (NVIDIA), Pradeep Ramani (NVIDIA), Tri Dao (Princeton University, Together AI)
Word count
1753
Language
English
Hacker News points
287

Summary

FlashAttention-3 is a new version of the FlashAttention algorithm designed to speed up attention mechanisms in large language models by leveraging the capabilities of modern Hopper GPUs. It achieves 1.5-2.0x faster performance than its predecessor, FlashAttention-2, with FP16, reaching up to 740 TFLOPS and utilizing 75% of an H100 GPU's maximum capabilities. With FP8, it reaches close to 1.2 PFLOPS, while maintaining accuracy. The new algorithm uses powerful abstractions from NVIDIA's CUTLASS library and incorporates three main techniques: exploiting asynchrony with warp-specialization and interleave block-wise matmul and softmax operations, incoherent processing that leverages hardware support for FP8 low-precision, and overlapping GEMM and softmax to take advantage of the asynchronous nature of new instructions on Hopper GPUs. These optimizations enable more efficient GPU utilization, better performance with lower precision, and the ability to use longer context in large language models, ultimately unlocking new capabilities such as long context.