SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-Aware Routing

1 University of California, Berkeley
*Indicates Equal Contribution

arXiv:2603.08982

Gallery - All videos are generated by SVG-EAR!

TL;DR

SVG-EAR accelerates video Diffusion Transformers by up to 1.93x (PSNR = 31) using parameter-free linear compensation and error-aware routing to maintain high generation fidelity without any training overhead.

SVG-EAR teaser: speedup vs PSNR on Wan2.2 and HunyuanVideo

SVG-EAR significantly accelerates video generation for Wan2.2 and HunyuanVideo. On a single NVIDIA H100 GPU, achieving 1.81× and 1.93× speedups with 26 and 30 PSNR, respectively.

Full attention and sparse attention remain visually close across multiple Wan2.2 and HunyuanVideo samples, while the difference maps stay near-white for most regions.

Full Attention and SVG-EAR evolution clips appear as two sliding strips, with smooth progress bars and a final count comparison for the same playback window.

Overview

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting.

In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest.

SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off, achieving up to 1.77× and 1.93× speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

Limitations of Current Sparse Attention Algorithms

Previous works exploit the inherent sparsity in attention maps to accelerate DiTs. They first cluster semantically similar tokens and permute them so that tokens within each cluster are laid out contiguously, turning the attention matrix into a block structure. Then, only a subset of blocks is computed exactly while the rest are ignored.

However, existing methods face two fundamental problems:

Information Loss from Dropping Blocks. Existing sparse methods select blocks based on approximated attention scores and simply ignore low-score blocks. However, low-score blocks can still collectively carry important global context (e.g., background consistency, long-range semantic coupling). Naively discarding them incurs non-trivial information loss and perceptible quality degradation.

Score-Based Routing is Misaligned with Compensation Error. Once a linear compensation branch is introduced, conventional score-based routing becomes misaligned with the objective of controlling the final approximation error. A high-score block may be highly coherent within its cluster and thus well approximated by its centroid. Conversely, a low-score block can contain diverse key-value interactions where centroid-based compensation induces substantial error.

Motivation: existing methods fail due to dropping low-score blocks and error-unaware selection
Figure: Existing methods fail — dropping "low-score" blocks and error-unaware block selection degrade the error-density trade-off. (a) Original attention map. (b) Permuted map after semantic-aware clustering. (c) Ignoring low-score blocks causes a large, sparse attention error. (d) Linear compensation with cluster means still yields high error due to naive top-p selection. (e) Our method improves both error and density by routing based on the gap between full computation and compensation.

Methodology: SVG-EAR

SVG-EAR introduces parameter-free linear compensation combined with error-aware routing to achieve a superior quality-efficiency trade-off.

Algorithm Overview

  1. Semantic Clustering: Queries and keys are clustered using flash k-means to rearrange tokens so that semantically similar ones are positioned contiguously. This process partitions the attention map into a block structure where tokens within each cluster exhibit high inner-block similarity.
  2. Error-Aware Routing: Instead of selecting blocks based on attention scores, this method identifies blocks where centroid-based approximation would fail. It uses a lightweight probe to estimate the compensation error for each block, using query cluster centroids as proxies to reduce complexity from quadratic to near-linear. Blocks with the highest error-to-size ratio are greedily selected for exact computation under a fixed density budget.
  3. Linear Compensation: For blocks not selected for exact attention, the method applies a parameter-free linear branch that uses cluster centroids to recover lost contributions. Within these blocks, individual keys are replaced by their respective centroids, reducing the interaction to a single logit and a scalar-vector product. This approach mitigates information loss without requiring additional training or learned predictors.
Methodology overview of SVG-EAR
Figure: Static overview of the SVG-EAR pipeline, including semantic clustering, error-aware routing, and centroid-based linear compensation.
Video: Animated overview of SVG-EAR. It walks through the initial attention map, block zoom-in, probe-based attention generation, estimated vs. true error computation, dynamic error-aware selection, and the final rendered attention map.

Quantitative Evaluation

We evaluate SVG-EAR on Wan2.2-I2V/T2V-A14B and HunyuanVideo-T2V-13B at 720p resolution, comparing against SpargeAttn, SVG, and SVG2. SVG-EAR consistently outperforms all baselines across PSNR, LPIPS, and SSIM while simultaneously achieving the highest speedup.

Wan 2.2 14B — 720P Image-to-Video

Config PSNR↑ SSIM↑ LPIPS↓ ImgQual↑ SubCons↑ Density↓ FLOP↓ Speedup↑
Baseline 0.7040.960100%658.46 PF
SpargeAttn 27.1400.8830.116 0.7030.95830.15%396.83 PF1.58×
SVG 25.2970.8440.139 0.7030.95830.25%397.20 PF1.58×
SVG2 27.6680.8880.117 0.7010.95829.38%393.95 PF1.61×
SVG-EAR 29.7590.9180.093 0.7040.959 23.64%378.88 PF1.61×
SVG-EAR-Turbo 28.3440.9000.108 0.7020.958 20.42%363.85 PF1.77×

Wan 2.2 14B — 720P Text-to-Video

Config PSNR↑ SSIM↑ LPIPS↓ ImgQual↑ SubCons↑ Density↓ FLOP↓ Speedup↑
Baseline 0.7060.916100%658.46 PF
SpargeAttn 20.8720.7080.242 0.7080.91630.15%396.83 PF1.58×
SVG 19.4550.6540.292 0.7120.91230.25%397.20 PF1.59×
SVG2 23.5560.8020.183 0.7050.91432.30%404.88 PF1.57×
SVG-EAR 24.9950.8410.153 0.7060.915 25.95%387.53 PF1.59×
SVG-EAR-Turbo 23.9400.8140.174 0.7050.915 22.25%370.71 PF1.75×

HunyuanVideo 13B — 720P Text-to-Video

Config PSNR↑ SSIM↑ LPIPS↓ ImgQual↑ SubCons↑ Density↓ FLOP↓ Speedup↑
Baseline 0.6650.904100%612.38 PF
SpargeAttn 24.5890.7960.232 0.6290.90840.09%389.76 PF1.38×
SVG 27.3250.8800.140 0.6650.90529.92%351.97 PF1.57×
SVG2 29.4450.9110.112 0.6540.90126.21%299.02 PF1.89×
SVG-EAR 31.0430.9280.092 0.6590.903 22.17%281.86 PF1.93×

Efficiency Evaluation

SVG-EAR's routing overhead is negligible in practice, accounting for only 6.5% of total end-to-end latency on Wan2.2 T2V 720p inference. This efficiency is driven by a custom Triton kernel that achieves up to 13.74× speedup over the native PyTorch implementation via fused streaming error estimation.

Generation latency breakdown: SVG-EAR vs baselines on Wan2.2 T2V 720p
(a) Generation latency breakdown during a single Wan2.2 T2V 720p inference, compared with full attention and SVG2.
Efficiency of SVG-EAR custom Triton kernel vs PyTorch baseline
(b) End-to-end latency comparison between our efficient Triton implementation and the native PyTorch version — up to 13.74× speedup.

Error Analysis

To assess the quality-efficiency trade-off, we compare the Mean Squared Error (MSE) of three strategies against full attention: top-p selection (SVG2), top-p with linear compensation, and our proposed Error-Aware Routing with linear compensation (SVG-EAR). While standard score-based selection often suffers from information loss, SVG-EAR prioritizes blocks with the highest potential approximation error for exact computation, achieving the highest congruence with the full attention reference.

Empirical analysis validates the link between reconstruction error and clustering quality. As the number of query clusters ($Q_C$) increases, the average squared clustering error ($\delta_q^2$) and attention MSE decline in sync. This confirms that better semantic clustering directly enhances accuracy, making error estimation both predictable and theoretically sound. Thus, SVG-EAR provides a robust, high-fidelity acceleration mechanism for video diffusion.

Attention MSE vs Density: error-aware routing outperforms top-p selection
(a) Attention MSE vs. Density. SVG-EAR's error-aware routing yields the attention map most consistent with full attention.
Clustering error and attention MSE decrease as Q cluster count increases
(b) Clustering error δq² and attention MSE both decrease as the number of Q clusters increases, validating our theoretical bound.

BibTeX

@article{zhou2026svgear,
  title={SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing},
  author={Xuanyi Zhou and Qiuyang Mang and Shuo Yang and Haocheng Xi and Jintao Zhang and Huanzhi Mao and Joseph E. Gonzalez and Kurt Keutzer and Ion Stoica and Alvin Cheung},
  journal={arXiv preprint arXiv2603.08982},
  year={2026},
}
× Enlarge image