SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-Aware Routing

Xuanyi Zhou^*, Qiuyang Mang^*, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin Cheung

¹ University of California, Berkeley
^*Indicates Equal Contribution
arXiv:2603.08982

Paper Code arXiv Twitter

Gallery - All videos are generated by SVG-EAR!

TL;DR

SVG-EAR accelerates video Diffusion Transformers by up to 1.93x (PSNR = 31) using parameter-free linear compensation and error-aware routing to maintain high generation fidelity without any training overhead.

SVG-EAR teaser: speedup vs PSNR on Wan2.2 and HunyuanVideo

SVG-EAR significantly accelerates video generation for Wan2.2 and HunyuanVideo. On a single NVIDIA H100 GPU, achieving 1.81× and 1.93× speedups with 26 and 30 PSNR, respectively.

Full attention and sparse attention remain visually close across multiple Wan2.2 and HunyuanVideo samples, while the difference maps stay near-white for most regions.

Full Attention and SVG-EAR evolution clips appear as two sliding strips, with smooth progress bars and a final count comparison for the same playback window.

Overview

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting.

In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest.

SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off, achieving up to 1.77× and 1.93× speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

Limitations of Current Sparse Attention Algorithms

Previous works exploit the inherent sparsity in attention maps to accelerate DiTs. They first cluster semantically similar tokens and permute them so that tokens within each cluster are laid out contiguously, turning the attention matrix into a block structure. Then, only a subset of blocks is computed exactly while the rest are ignored.

However, existing methods face two fundamental problems:

Information Loss from Dropping Blocks. Existing sparse methods select blocks based on approximated attention scores and simply ignore low-score blocks. However, low-score blocks can still collectively carry important global context (e.g., background consistency, long-range semantic coupling). Naively discarding them incurs non-trivial information loss and perceptible quality degradation.

Score-Based Routing is Misaligned with Compensation Error. Once a linear compensation branch is introduced, conventional score-based routing becomes misaligned with the objective of controlling the final approximation error. A high-score block may be highly coherent within its cluster and thus well approximated by its centroid. Conversely, a low-score block can contain diverse key-value interactions where centroid-based compensation induces substantial error.

Motivation: existing methods fail due to dropping low-score blocks and error-unaware selection — Figure: Existing methods fail — dropping "low-score" blocks and error-unaware block selection degrade the error-density trade-off. (a) Original attention map. (b) Permuted map after semantic-aware clustering. (c) Ignoring low-score blocks causes a large, sparse attention error. (d) Linear compensation with cluster means still yields high error due to naive top-p selection. (e) Our method improves both error and density by routing based on the gap between full computation and compensation.

Methodology: SVG-EAR

SVG-EAR introduces parameter-free linear compensation combined with error-aware routing to achieve a superior quality-efficiency trade-off.

Algorithm Overview

Semantic Clustering: Queries and keys are clustered using flash k-means to rearrange tokens so that semantically similar ones are positioned contiguously. This process partitions the attention map into a block structure where tokens within each cluster exhibit high inner-block similarity.
Error-Aware Routing: Instead of selecting blocks based on attention scores, this method identifies blocks where centroid-based approximation would fail. It uses a lightweight probe to estimate the compensation error for each block, using query cluster centroids as proxies to reduce complexity from quadratic to near-linear. Blocks with the highest error-to-size ratio are greedily selected for exact computation under a fixed density budget.
Linear Compensation: For blocks not selected for exact attention, the method applies a parameter-free linear branch that uses cluster centroids to recover lost contributions. Within these blocks, individual keys are replaced by their respective centroids, reducing the interaction to a single logit and a scalar-vector product. This approach mitigates information loss without requiring additional training or learned predictors.

Methodology overview of SVG-EAR — Figure: Static overview of the SVG-EAR pipeline, including semantic clustering, error-aware routing, and centroid-based linear compensation.

Video: Animated overview of SVG-EAR. It walks through the initial attention map, block zoom-in, probe-based attention generation, estimated vs. true error computation, dynamic error-aware selection, and the final rendered attention map.

Quantitative Evaluation

We evaluate SVG-EAR on Wan2.2-I2V/T2V-A14B and HunyuanVideo-T2V-13B at 720p resolution, comparing against SpargeAttn, SVG, and SVG2. SVG-EAR consistently outperforms all baselines across PSNR, LPIPS, and SSIM while simultaneously achieving the highest speedup.

Wan 2.2 14B — 720P Image-to-Video

Config	PSNR↑	SSIM↑	LPIPS↓	ImgQual↑	SubCons↑	Density↓	FLOP↓	Speedup↑
Baseline	—	—	—	0.704	0.960	100%	658.46 PF	1×
SpargeAttn	27.140	0.883	0.116	0.703	0.958	30.15%	396.83 PF	1.58×
SVG	25.297	0.844	0.139	0.703	0.958	30.25%	397.20 PF	1.58×
SVG2	27.668	0.888	0.117	0.701	0.958	29.38%	393.95 PF	1.61×
SVG-EAR	29.759	0.918	0.093	0.704	0.959	23.64%	378.88 PF	1.61×
SVG-EAR-Turbo	28.344	0.900	0.108	0.702	0.958	20.42%	363.85 PF	1.77×

Wan 2.2 14B — 720P Text-to-Video

Config	PSNR↑	SSIM↑	LPIPS↓	ImgQual↑	SubCons↑	Density↓	FLOP↓	Speedup↑
Baseline	—	—	—	0.706	0.916	100%	658.46 PF	1×
SpargeAttn	20.872	0.708	0.242	0.708	0.916	30.15%	396.83 PF	1.58×
SVG	19.455	0.654	0.292	0.712	0.912	30.25%	397.20 PF	1.59×
SVG2	23.556	0.802	0.183	0.705	0.914	32.30%	404.88 PF	1.57×
SVG-EAR	24.995	0.841	0.153	0.706	0.915	25.95%	387.53 PF	1.59×
SVG-EAR-Turbo	23.940	0.814	0.174	0.705	0.915	22.25%	370.71 PF	1.75×

HunyuanVideo 13B — 720P Text-to-Video

Config	PSNR↑	SSIM↑	LPIPS↓	ImgQual↑	SubCons↑	Density↓	FLOP↓	Speedup↑
Baseline	—	—	—	0.665	0.904	100%	612.38 PF	1×
SpargeAttn	24.589	0.796	0.232	0.629	0.908	40.09%	389.76 PF	1.38×
SVG	27.325	0.880	0.140	0.665	0.905	29.92%	351.97 PF	1.57×
SVG2	29.445	0.911	0.112	0.654	0.901	26.21%	299.02 PF	1.89×
SVG-EAR	31.043	0.928	0.092	0.659	0.903	22.17%	281.86 PF	1.93×

Efficiency Evaluation

SVG-EAR's routing overhead is negligible in practice, accounting for only 6.5% of total end-to-end latency on Wan2.2 T2V 720p inference. This efficiency is driven by a custom Triton kernel that achieves up to 13.74× speedup over the native PyTorch implementation via fused streaming error estimation.

Generation latency breakdown: SVG-EAR vs baselines on Wan2.2 T2V 720p — (a) Generation latency breakdown during a single Wan2.2 T2V 720p inference, compared with full attention and SVG2.

Efficiency of SVG-EAR custom Triton kernel vs PyTorch baseline — (b) End-to-end latency comparison between our efficient Triton implementation and the native PyTorch version — up to 13.74× speedup.

Error Analysis

To assess the quality-efficiency trade-off, we compare the Mean Squared Error (MSE) of three strategies against full attention: top-p selection (SVG2), top-p with linear compensation, and our proposed Error-Aware Routing with linear compensation (SVG-EAR). While standard score-based selection often suffers from information loss, SVG-EAR prioritizes blocks with the highest potential approximation error for exact computation, achieving the highest congruence with the full attention reference.

Empirical analysis validates the link between reconstruction error and clustering quality. As the number of query clusters ($Q_C$) increases, the average squared clustering error ($\delta_q^2$) and attention MSE decline in sync. This confirms that better semantic clustering directly enhances accuracy, making error estimation both predictable and theoretically sound. Thus, SVG-EAR provides a robust, high-fidelity acceleration mechanism for video diffusion.

Attention MSE vs Density: error-aware routing outperforms top-p selection — (a) Attention MSE vs. Density. SVG-EAR's error-aware routing yields the attention map most consistent with full attention.

Clustering error and attention MSE decrease as Q cluster count increases — (b) Clustering error δ_q² and attention MSE both decrease as the number of Q clusters increases, validating our theoretical bound.

BibTeX

@article{zhou2026svgear,
  title={SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing},
  author={Xuanyi Zhou and Qiuyang Mang and Shuo Yang and Haocheng Xi and Jintao Zhang and Huanzhi Mao and Joseph E. Gonzalez and Kurt Keutzer and Ion Stoica and Alvin Cheung},
  journal={arXiv preprint arXiv2603.08982},
  year={2026},
}