Sparse VideoGen2 is a training-free framework that accelerates video diffusion models inference, using an innovative semantic-aware permutation technique and efficient dynamic attention kernels. It offers pareto-frontier performance with high fidelity and high generation speed.
While current video generation models are powerful, the 3D Full Attention mechanism is
extremely slow and requires significant computation resources. While various sparse
attention methods have been proposed to mitigate the high cost of full attention, they fall short due to
two critical drawbacks.
First, these methods suffer from inaccurate identification. Existing sparse methods often
rely on predefined, static patterns (e.g., local windows or strided attention) to select tokens.
Consequently, the aggregated activations become less representative, making the selection of critical
tokens inaccurate and degrade the video quality.
Second, these methods lead to significant computation waste. Even when these methods
reduce the total number of calculations, the scattered critical tokens introduce irregular and scattered
memory access patterns. Modern hardware like GPUs can not achieve peak performance when
accessing data
from incontiguous blocks of memory.
We propose to rearrange the tokens in a way that brings semantically similar
tokens closer together in
global memory. We employ a lightweight, Semantic-Aware Permutation strategy to achieve
this.
This step happens on-the-fly for each timestep and layers, and is crucial for the efficiency of the
subsequent
clustering and attention approximation stages. We apply different permutation for query and key / value
tokens to further improve the accurateness of the permutation.
Standard block-sparse attention kernels are optimized for fixed-size block sizes, which
is inefficient for our clustered approach where each cluster can have a different number of tokens.
To address this, we developed Efficient Dynamic Block Size Attention Kernels, supporting
both FlashAttention-2 and FlashAttention-3 algorithms.
These custom CUDA kernels are designed to handle variable-sized
blocks of data, allowing for highly efficient computation on the sparse, clustered attention map. This
hardware-level optimization ensures that our theoretical gains from approximation translate into
real-world speedups.
Our kernel has a very loose dependency on the cluster size on key / value tokens, making it highly
efficient even for small cluster size and enables a large number of clusters. For query tokens, we find
that having a larger block size (meaning that the number of blocks is smaller) is important for high
TFLOPs. Ablations on the number of clusters can be found below.
We evaluate the performance of Sparse VideoGen 2 on Wan 2.1 and HunyuanVideo.
@article{yang2025sparse,
title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation},
author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others},
journal={arXiv preprint arXiv:2505.18875},
year={2025}
}