Yaofu Liu, Harry Yang

This post summarizes our ongoing exploration of INT4 FlashAttention (FA) for large-scale video diffusion models, focusing on what works today, what breaks, and what we believe is the right direction forward.


0. Motivation: Why INT4 FlashAttention?

FlashAttention is already the de-facto standard for efficient attention. However, attention is still the dominant bottleneck in large video diffusion models due to:

Low-bit quantization is the most promising path to another order-of-magnitude speedup, but attention is numerically fragile—especially after softmax.

Our goal is simple to state but hard to achieve:

Use INT4 for as much FlashAttention as possible, without breaking video quality.


1. Background: FlashAttention Computation

The critical computation of FlashAttention can be briefly written as:

$$ S_{ij} = Q_iK_j^\top / \sqrt{d}\\ P_{ij} = \exp(S_{ij} - m_i)\\ O_{i} = \frac{P_{ij}V_{j}}{\sum_j \exp(S_{ij} - m_i)} $$

where:

Thus, quantization targets are naturally: