Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic
attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of
attention blocks. However, prior methods often either drop the remaining blocks or rely on learned predictors to approximate them, introducing training overhead and potential
output distribution shifting.
In this paper, we show that the missing contributions can be recovered without training: after
semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized
by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free
linear compensation branch that uses the centroid to approximate skipped blocks and recover their
contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset.
Standard sparsification typically selects blocks by attention scores, which indicate where the model places
its attention mass, but not where the approximation error would be largest.
SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the
compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio
while compensating for skipped blocks. We provide theoretical guarantees that relate attention
reconstruction error to clustering quality, and empirically show that SVG-EAR improves the
quality-efficiency trade-off, achieving up to 1.77× and
1.93× speedups while maintaining PSNRs of up to 29.759 and
31.043 on Wan2.2 and HunyuanVideo, respectively.