Yaofu Liu, Harry Yang
This post summarizes our ongoing exploration of INT4 FlashAttention (FA) for large-scale video diffusion models, focusing on what works today, what breaks, and what we believe is the right direction forward.
FlashAttention is already the de-facto standard for efficient attention. However, attention is still the dominant bottleneck in large video diffusion models due to:
Low-bit quantization is the most promising path to another order-of-magnitude speedup, but attention is numerically fragile—especially after softmax.
Our goal is simple to state but hard to achieve:
Use INT4 for as much FlashAttention as possible, without breaking video quality.
The critical computation of FlashAttention can be briefly written as:
$$ S_{ij} = Q_iK_j^\top / \sqrt{d}\\ P_{ij} = \exp(S_{ij} - m_i)\\ O_{i} = \frac{P_{ij}V_{j}}{\sum_j \exp(S_{ij} - m_i)} $$
where:
Thus, quantization targets are naturally: