Sparse VideoGen:
Accelerating Video Generation with Spatial-Temporal Sparse Attention by 2x with High Pixel Fidelity

1 University of California, Berkeley 2 Massachusetts Institute of Technology 3 NVIDIA 4 Tsinghua University
*Indicates Equal Contribution

Accepted by ICML 2025

TL;DR

Announcing Sparse VideoGen, a training-free method that accelerates video DiTs by achieving 2× speedup with high pixel fidelity (PSNR = 29). The secret? Unleashing spatial and temporal sparsity in 3D Full Attention. Dive into our paper and code to see the magic.

Sparse VideoGen maintains high pixel fidelity.

Sparse VideoGen achieves 2× speedup in video generation.

Figure description

Overview

In the field of video generation, the latest and best-performing Video Diffusion Transformer models all employ 3D Full Attention. However, their substantial computational demands pose significant challenges for real-world applications. For example, HunyuanVideo takes 30 minutes to generate a 5-second video on 1×H100, which is prohibitively time-consuming due to the O(n^2) computation of 3D Full Attention.

To speed up their inference, we introduce Sparse VideoGen (SVG), a training-free framework that leverages inherent spatial and temporal sparsity in the 3D Full Attention operations. Sparse VideoGen's core contributions include

  • Identifying the spatial and temporal sparsity patterns in video diffusion models.
  • Proposing an Online Profiling Strategy to dynamically identify these patterns.
  • Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels
We evaluate Sparse VideoGen with HunyuanVideo and CogVideoX on an H100 with CUDA 12.8 & torch 2.5.1. Results showcase Sparse VideoGen achieves up to 2x speedup while maintaining high pixel fidelity (up to PSNR=29).
Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

3D Full Attention is Extremely Slow

State-of-the-art Video DiTs models (such as HunyuanVideo, CogVideoX) adopt a 3D Full Attention mechanism to capture complex spatial and temporal dependencies in video data. This approach offers better generation quality compared to the 2D + 1D method.

However, since the computational complexity of Attention increases quadratically with the context length, the inference time becomes excessively long. For example, in HunyuanVideo it takes 29 minutes to generate a 5-second, 720p video on a single H100 GPU, with attention operations consuming over 80% of the runtime. This computational bottleneck motivated us to explore efficient attention mechanisms to accelerate DiTs while maintaining high generation quality and pixel fidelity.

Attention Portion Figure
Figure: Attention Module in HunyuanVideo (120k context length) and CogVideoX (45k context length) takes >80% of time.

Unveiling Inherent Sparsity in 3D Full Attention

We observed two distinct sparsity patterns emerging in video diffusion's attention maps: spatial sparsity and temporal sparsity. We also found that most attention heads can be distinctly classified into one of these two categories. In HunyuanVideo, 29.2% of attention heads are spatial, 66.7% are temporal, and only 4.1% are ambiguous. Check our demo to understand our findings!

Spatial Head, Temporal Head, and Layout Transformation to speedup attention.

Spatial Head focus on spatially-local tokens

The Spatial Head focuses on spatially local tokens within the same frame and adjacent frames, resulting in a block-wise layout of the attention map. Since pixels in a single frame are tokenized into contiguous sequences, the Spatial Head attends to tokens corresponding to neighboring pixels, making the attention mask concentrate around the main diagonal. Spatial Head is essential for maintaining video quality and spatial consistency in generated videos.

Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

Temporal Head focus on same token across frames

The Temporal Head is designed to capture relationships between tokens across different frames, facilitating the modeling of temporal dependencies. It employs a slash-wise layout with a constant stride, targeting tokens at consistent spatial locations over time. This mechanism is crucial for ensuring temporal consistency in the generated video sequences.

Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

Oracle selection for sparse patterns for each head

While the Spatial and Temporal Heads individually address spatial and temporal consistency, their optimal combination is essential for achieving lossless performance in video generation. By assigning each attention head the attention mask that yields the lower mean squared error (MSE), successfully achieves a PSNR > 28 dB, indicating near-lossless performance.

Achieving High-Fidelity Compression with Our Online Profiling Strategy

Our next question is: how to efficiently select the appropriate sparsity pattern for each head? The theoretical lossless performance above does not lead to real speedup since it requires the computation of the full attention mask. The challenge arises since the oracle sparsity pattern is not static; it actually varies across layers and denoising steps. This dynamic nature necessitates an adaptive and efficient method to determine the sparsity pattern on-the-fly.

To address this challenge, Sparse VideoGen proposes an Online Profiling Strategy to dynamically identify and exploit these sparse attention patterns with very minimal overhead. This strategy samples a subset of query tokens, and determines the most appropriate sparsity pattern for each head based on the MSE on these sampled query tokens.

We find that a very small number of query tokens (64 out of 120k) is sufficient to accurately predict the optimal sparsity pattern. Meanwhile, since the sampled query token number is small, the overhead of the Online Profiling Strategy is negligible, making it highly efficient.

Online Profiling Strategy Figure
Figure: Online Profiling Strategy dynamically identifies and exploits sparse attention patterns.

Hardware-Efficient Layout Transformation Enables Theoretical Speedup

While exploiting spatial and temporal sparsity improves attention efficiency, a key challenge arises from the non-contiguous memory access patterns inherent in temporal attention. Recall that temporal heads require accessing tokens at the same spatial position across multiple frames, resulting in an attention mask composed of multiple thin, slash-wise patterns.

However, these tokens are often scattered in memory due to the conventional frame-wise token arrangement. Such fragmented memory access leads to suboptimal utilization of GPUs, which are optimized for contiguous memory operations. The actual speedup gain of the sparse attention kernel is much lower than its theoretical speedup given by its sparsity.

To address this, the hardware-efficient layout transformation is introduced. This technique rearranges the tensor layout into a token-wise order, ensuring that tokens required for temporal attention are stored contiguously in memory. By doing so, the layout transformation speedup the attention kernel by 1.7x and achieves theoretical speedup ratio.

Image 1
Figure 1: Hardware-Efficient layout transformation.

Image 2
Figure 2: Temporal head achieves theoretical speedup after transformation.

BibTeX


@article{xi2025sparse,
  title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
  author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and others},
  journal={arXiv preprint arXiv:2502.01776},
  year={2025}
}