INT4 Quantization for FlashAttention

Yaofu Liu, Harry Yang

This post summarizes our ongoing exploration of INT4 FlashAttention (FA) for large-scale video diffusion models, focusing on what works today, what breaks, and what we believe is the right direction forward.

0. Motivation: Why INT4 FlashAttention?

FlashAttention is already the de-facto standard for efficient attention. However, attention is still the dominant bottleneck in large video diffusion models due to:

Extremely long sequence lengths (often 10k–50k tokens)
High spatial resolutions (480p → 720p and beyond)
Multi-step iterative sampling (40–50 timesteps)

Low-bit quantization is the most promising path to another order-of-magnitude speedup, but attention is numerically fragile—especially after softmax.

Our goal is simple to state but hard to achieve:

Use INT4 for as much FlashAttention as possible, without breaking video quality.

1. Background: FlashAttention Computation

The critical computation of FlashAttention can be briefly written as:

$$ S_{ij} = Q_iK_j^\top / \sqrt{d}\\ P_{ij} = \exp(S_{ij} - m_i)\\ O_{i} = \frac{P_{ij}V_{j}}{\sum_j \exp(S_{ij} - m_i)} $$

where:

$Q, K, V \in \mathbb{R}^{[B, H, N, D]}$ are divided into blocks {$Q_i$} {$K_j$} {$V_j$ }
$Q_i\in \mathbb{R}^{[b_q, D]}, K_j, V_j \in \mathbb{R}^{[b_k, D]}$
P is online computed block-wise: [$b_q$, $b_k$]

Thus, quantization targets are naturally: