Sparse VideoGen:
Accelerating Video Generation with Spatial-Temporal Sparse Attention by 2x with High Fidelity

1 University of California, Berkeley 2 Massachusetts Institute of Technology 3 NVIDIA 4 Tsinghua University
Arxiv Preprint, 2025
*Indicates Equal Contribution

TL;DR

Announcing Sparse VideoGen, a training-free method that accelerates video DiTs by achieving 2× speedup with high fidelity (PSNR = 29). The secret? Unleashing spatial and temporal sparsity in 3D Full Attention. Dive into our paper and code to see the magic.

Sparse VideoGen maintains high fidelity.

Sparse VideoGen achieves 2× speedup in video generation.

Figure description

Overview

In the field of video generation, the state-of-the-art Video Diffusion Transformer models all employ 3D Full Attention. However, their substantial computational demands pose significant challenges for real-world applications. For example, HunyuanVideo takes 30 minutes to generate a 5-second video on 1×H100, which is prohibitively time-consuming due to the O(n^2) computation of 3D Full Attention.

To speed up their inference, we introduce Sparse VideoGen (SVG), a training-free framework that leverages inherent spatial and temporal sparsity in the 3D Full Attention operations. Sparse VideoGen's core contributions include

  • Identifying the spatial and temporal sparsity patterns in video diffusion transformers.
  • Proposing an Online Profiling Strategy to dynamically identify these patterns.
  • Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels
We evaluate Sparse VideoGen with HunyuanVideo and CogVideoX on an H100 with CUDA 12.8 & torch 2.5.1. Results showcase Sparse VideoGen achieves up to 2x speedup while maintaining high fidelity (up to PSNR=29).
Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

3D Full Attention is Extremely Slow

State-of-the-art Video DiTs models (such as HunyuanVideo, CogVideoX) adopt a 3D Full Attention mechanism to capture complex spatial and temporal dependencies in video data.

However, since the computational complexity of Attention increases quadratically with the context length, the inference time becomes excessively long. For example, in HunyuanVideo it takes 29 minutes to generate a 5-second, 720p video on a single H100 GPU, with attention operations consuming over 80% of the runtime. This computational bottleneck motivated us to explore efficient attention mechanisms to accelerate DiTs while maintaining high generation quality and fidelity.

Attention Portion Figure
Figure: Attention Module in HunyuanVideo (120k context length) and CogVideoX (45k context length) takes >80% of time.

Unveiling Inherent Sparsity in 3D Full Attention

We observed two distinct sparsity patterns emerging in video diffusion's attention maps: spatial sparsity and temporal sparsity. We also found that most attention heads can be distinctly classified into one of these two categories. In HunyuanVideo, 29.2% of attention heads are spatial, 66.7% are temporal, and only 4.1% are ambiguous. Check our demo to understand our findings!

Spatial Head, Temporal Head, and Layout Transformation to speedup attention.

Spatial Head focus on spatially-local tokens

The Spatial Head focuses on spatially local tokens within the same frame and adjacent frames, resulting in a block-wise layout of the attention map. Since pixels in a single frame are tokenized into contiguous sequences, the Spatial Head attends to tokens corresponding to neighboring pixels, making the attention mask concentrate around the main diagonal. Spatial Head is essential for maintaining video quality and spatial consistency in generated videos.

Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

Temporal Head focus on same token across frames

The Temporal Head is designed to capture relationships between tokens across different frames, facilitating the modeling of temporal dependencies. It employs a slash-wise layout with a constant stride, targeting tokens at consistent spatial locations over time. This mechanism is crucial for ensuring temporal consistency in the generated video sequences.

Attention Portion Figure
Figure: Sparse VideoGen accelerate HunyuanVideo inference through algorithm-system co-design.

Oracle selection for sparse patterns for each head

While the Spatial and Temporal Heads individually address spatial and temporal consistency, their optimal combination is essential for achieving lossless performance in video generation. By assigning each attention head the attention mask that yields the lower mean squared error (MSE), successfully achieves a PSNR > 28 dB, indicating near-lossless performance.

Achieving High-Fidelity with Our Online Profiling Strategy

Our next question is: how to efficiently select the appropriate sparsity pattern for each head? The theoretical lossless performance above does not lead to real speedup since it requires the computation of the full attention mask. The challenge arises since the oracle sparsity pattern is not static; it actually varies across layers and denoising steps. This dynamic nature necessitates an adaptive and efficient method to determine the sparsity pattern on-the-fly.

To address this challenge, Sparse VideoGen proposes an Online Profiling Strategy to dynamically identify and exploit these sparse attention patterns with very minimal overhead. This strategy samples a subset of query tokens, and determines the most appropriate sparsity pattern for each head based on the MSE on these sampled query tokens.

We find that a very small number of query tokens (64 out of 120k) is sufficient to accurately predict the optimal sparsity pattern. Meanwhile, since the sampled query token number is small, the overhead of the Online Profiling Strategy is negligible, making it highly efficient.

Online Profiling Strategy Figure
Figure: Online Profiling Strategy dynamically identifies and exploits sparse attention patterns.

Hardware-Efficient Layout Transformation Enables Theoretical Speedup

While exploiting spatial and temporal sparsity improves attention efficiency, a key challenge arises from the non-contiguous memory access patterns inherent in temporal attention. Recall that temporal heads require accessing tokens at the same spatial position across multiple frames, resulting in an attention mask composed of multiple thin, slash-wise patterns.

However, these tokens are often scattered in memory due to the conventional frame-wise token arrangement. Such fragmented memory access leads to suboptimal utilization of GPUs, which are optimized for contiguous memory operations. The actual speedup gain of the sparse attention kernel is much lower than its theoretical speedup given by its sparsity.

To address this, the hardware-efficient layout transformation is introduced. This technique rearranges the tensor layout into a token-wise order, ensuring that tokens required for temporal attention are stored contiguously in memory. With our layout transformation technique, our attention kernel achieves an exceptional 5.59x speed up and achieves theoretical speedup ratio.

Image 1
Figure 1: Hardware-Efficient layout transformation.

Image 2
Figure 2: Temporal head achieves theoretical speedup after transformation.

BibTeX


@misc{xi2025sparsevideogenacceleratingvideo,
  title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity}, 
  author={Haocheng Xi and Shuo Yang and Yilong Zhao and Chenfeng Xu and Muyang Li and Xiuyu Li and Yujun Lin and Han Cai and Jintao Zhang and Dacheng Li and Jianfei Chen and Ion Stoica and Kurt Keutzer and Song Han},
  year={2025},
  eprint={2502.01776},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2502.01776}, 
}