Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

1 University of California, Berkeley 2 Massachusetts Institute of Technology 3 Stanford University 4 Tsinghua University
*Indicates Equal Contribution

Accepted by NeurIPS 2025 as a Spotlight Paper

Gallery - All videos are generated by SVG2!

TL;DR

Sparse VideoGen2 is a training-free framework that accelerates video diffusion models inference, using an innovative semantic-aware permutation technique and efficient dynamic attention kernels. It offers pareto-frontier performance with high fidelity and high generation speed.

Sparse VideoGen 2 achieves state-of-the-art video quality.

Sparse VideoGen 2 achieves 2× speedup in video generation.

Sparse VideoGen2 Logo

Overview

The Sparse VideoGen2 framework is designed to address the computational bottlenecks in state-of-the-art video generation models like Wan and HunyuanVideo. Without requiring any fine-tuning of pre-trained models, it leverages the inherent sparsity within attention maps to significantly speed up the inference process.

Through a co-design of algorithms and systems, Sparse VideoGen2 provides an end-to-end solution that achieves measurable speedups on real-world hardware, offering greater efficiency and lower costs for video generation tasks.

Limitations of Current Sparse Attention Algorithms

While current video generation models are powerful, the 3D Full Attention mechanism is extremely slow and requires significant computation resources. While various sparse attention methods have been proposed to mitigate the high cost of full attention, they fall short due to two critical drawbacks.

First, these methods suffer from inaccurate identification. Existing sparse methods often rely on predefined, static patterns (e.g., local windows or strided attention) to select tokens. Consequently, the aggregated activations become less representative, making the selection of critical tokens inaccurate and degrade the video quality.

Second, these methods lead to significant computation waste. Even when these methods reduce the total number of calculations, the scattered critical tokens introduce irregular and scattered memory access patterns. Modern hardware like GPUs can not achieve peak performance when accessing data from incontiguous blocks of memory.

Semantic-Aware Permutation mitigates the inaccurate identification and computation waste problems
Figure: Semantic-Aware Permutation mitigates the inaccurate identification and computation waste problems.

Efficient Sparse Attention via Semantic-Aware Permutation

We propose to rearrange the tokens in a way that brings semantically similar tokens closer together in global memory. We employ a lightweight, Semantic-Aware Permutation strategy to achieve this.

This step happens on-the-fly for each timestep and layers, and is crucial for the efficiency of the subsequent clustering and attention approximation stages. We apply different permutation for query and key / value tokens to further improve the accurateness of the permutation.

Semantic-Aware Permutation rearranges the tokens in a way that brings semantically similar tokens closer together in memory.

Identify semantically similar tokens using k-means clustering

We apply an efficient k-means clustering algorithm to decide the semantic similarity of the tokens and group them into several clusters, where each cluster represents a set of semantically similar tokens. The centroid of each cluster can be viewed as the representative of the cluster.

More clusters can group the tokens in a more fine-grained manner, but might introduce large overhead when executing the k-means algorithm and reduce the attention kernel's performance. Vice versa. We find that 100~200 clusters for query tokens, and 500~1000 clusters for key / value tokens are a good trade-off. Efficiency results can be found below.

Approximate the attention map through Centroid-based Top-P estimation

With tokens grouped into clusters, we now decide the sparse attention patterns. Since centroids represent the clusters, we can approximate attention weight criticality based on them. Exact attention is computed only between queries centroids and key / value centroids, serving as a proxy for the full attention.

We then use a Centroid-based Top-P Estimation to identify the most important weights. Top-P attention ensures quality while adapting the attention budget. Thus, we only compute exact attention for important weights, significantly reducing computation.
The overall pipeline of Semantic-Aware Permutation
Figure: The overall pipeline of Semantic-Aware Permutation. We use k-means clustering to identify the semantically similar tokens, then identify the sparse attention patterns based on the centroids.

System Optimizations for Semantic-Aware Permutation

To further improve the performance of Semantic-Aware Permutation, we propose Centroids Cache to reduce the overhead of k-means by utilizing redundancy between timesteps. Compared with naively applying k-means for each timestep, Centroids Cache offers 76× speedup.

By introducing semantic-aware permutation, we successfully reached the ideal hardware efficiency on GPUs. Compared with the scattered attention pattern without permutation, dynamic attention kernel offers 1.7× ~ 1.8× speedup.
Centroids Cache and Dynamic Attention Kernel optimizations
Figure: The overall pipeline of Semantic-Aware Permutation. We use k-means clustering to identify the semantically similar tokens, then identify the sparse attention patterns based on the centroids.

Efficient Dynamic Block Size Attention kernels

Standard block-sparse attention kernels are optimized for fixed-size block sizes, which is inefficient for our clustered approach where each cluster can have a different number of tokens.

To address this, we developed Efficient Dynamic Block Size Attention Kernels, supporting both FlashAttention-2 and FlashAttention-3 algorithms. These custom CUDA kernels are designed to handle variable-sized blocks of data, allowing for highly efficient computation on the sparse, clustered attention map. This hardware-level optimization ensures that our theoretical gains from approximation translate into real-world speedups.

Our kernel has a very loose dependency on the cluster size on key / value tokens, making it highly efficient even for small cluster size and enables a large number of clusters. For query tokens, we find that having a larger block size (meaning that the number of blocks is smaller) is important for high TFLOPs. Ablations on the number of clusters can be found below.

Efficient Dynamic Block Size Attention Kernels performance
Figure: Efficient Dynamic Block Size Attention Kernels achieves high performance.

Quantitative Evaluation

We evaluate the performance of Sparse VideoGen 2 on Wan 2.1 and HunyuanVideo.

Quantitative Evaluation Results

BibTeX


@article{yang2025sparse,
  title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation},
  author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others},
  journal={arXiv preprint arXiv:2505.18875},
  year={2025}
}
    
× Enlarge image