Announcing Sparse VideoGen, a training-free method that accelerates video DiTs by achieving 2× speedup with high fidelity (PSNR = 29). The secret? Unleashing spatial and temporal sparsity in 3D Full Attention. Dive into our paper and code to see the magic.
In the field of video generation, the state-of-the-art Video Diffusion Transformer models all
employ 3D Full Attention. However, their substantial computational demands pose significant challenges for
real-world applications. For example, HunyuanVideo takes 30 minutes to generate a 5-second video on
1×H100, which is prohibitively time-consuming due to the O(n^2) computation of 3D Full Attention.
To speed up their inference, we introduce Sparse VideoGen (SVG), a training-free
framework that leverages inherent spatial and temporal sparsity in the 3D Full Attention
operations. Sparse VideoGen's core contributions include
State-of-the-art Video DiTs models (such as HunyuanVideo,
CogVideoX) adopt a 3D Full Attention
mechanism to capture complex spatial and temporal dependencies in video data.
However, since the computational complexity of
Attention increases quadratically with the context length, the inference time becomes excessively long.
For example, in HunyuanVideo it takes 29 minutes to generate a 5-second, 720p video on a single H100 GPU,
with attention operations consuming over 80% of the runtime. This computational bottleneck motivated us to
explore efficient attention mechanisms to accelerate DiTs while maintaining high generation quality and
fidelity.
We observed two distinct sparsity patterns emerging in video diffusion's attention maps: spatial sparsity and temporal sparsity. We also found that most attention heads can be distinctly classified into one of these two categories. In HunyuanVideo, 29.2% of attention heads are spatial, 66.7% are temporal, and only 4.1% are ambiguous. Check our demo to understand our findings!
The Spatial Head focuses on spatially local tokens within the same frame and adjacent frames, resulting in a block-wise layout of the attention map. Since pixels in a single frame are tokenized into contiguous sequences, the Spatial Head attends to tokens corresponding to neighboring pixels, making the attention mask concentrate around the main diagonal. Spatial Head is essential for maintaining video quality and spatial consistency in generated videos.
The Temporal Head is designed to capture relationships between tokens across different frames, facilitating the modeling of temporal dependencies. It employs a slash-wise layout with a constant stride, targeting tokens at consistent spatial locations over time. This mechanism is crucial for ensuring temporal consistency in the generated video sequences.
While the Spatial and Temporal Heads individually address spatial and temporal consistency, their optimal combination is essential for achieving lossless performance in video generation. By assigning each attention head the attention mask that yields the lower mean squared error (MSE), successfully achieves a PSNR > 28 dB, indicating near-lossless performance.
Our next question is: how to efficiently select the appropriate sparsity pattern for each head? The
theoretical lossless performance above does not lead to real speedup since it requires the computation of
the full attention mask. The challenge arises since the oracle sparsity pattern is not
static; it actually varies across layers and denoising steps. This dynamic
nature necessitates an adaptive
and efficient method to determine the sparsity pattern on-the-fly.
To address this challenge, Sparse VideoGen proposes an Online Profiling Strategy to
dynamically identify and exploit these sparse attention patterns with very minimal overhead.
This strategy samples a subset of query tokens, and determines the most appropriate
sparsity pattern for each head based on the MSE on these sampled query tokens.
We find that a very small number of query tokens (64 out of 120k) is sufficient to accurately
predict the
optimal sparsity pattern. Meanwhile, since the sampled query token number is small, the overhead of the
Online Profiling Strategy is negligible, making it highly efficient.
While exploiting spatial and temporal sparsity improves attention efficiency, a key challenge arises
from the non-contiguous memory access patterns inherent in temporal attention. Recall
that temporal heads
require accessing tokens at the same spatial position across multiple frames, resulting in an attention
mask composed of multiple thin, slash-wise patterns.
However, these tokens are often scattered in memory due to the conventional frame-wise token arrangement.
Such fragmented memory access leads to suboptimal utilization of GPUs, which are optimized for contiguous
memory operations. The actual speedup gain of the sparse attention kernel is much lower
than its
theoretical speedup given
by its sparsity.
To address this, the hardware-efficient layout transformation is introduced. This
technique rearranges the tensor layout into a token-wise order, ensuring that tokens required for temporal
attention are stored contiguously in memory. With our layout transformation technique, our attention kernel achieves
an exceptional 5.59x speed up and achieves theoretical speedup ratio.
@misc{xi2025sparsevideogenacceleratingvideo,
title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
author={Haocheng Xi and Shuo Yang and Yilong Zhao and Chenfeng Xu and Muyang Li and Xiuyu Li and Yujun Lin and Han Cai and Jintao Zhang and Dacheng Li and Jianfei Chen and Ion Stoica and Kurt Keutzer and Song Han},
year={2025},
eprint={2502.01776},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.01776},
}