Quant VideoGen:
Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

1 University of California, Berkeley 2 Massachusetts Institute of Technology 3 NVIDIA
*Indicates Equal Contribution

TL;DR

Quant VideoGen (QVG) is a training-free KV-cache quantization framework for auto-regressive video diffusion models. It exploits video's spatiotemporal redundancy through Semantic-Aware Smoothing and Progressive Residual Quantization to compress the KV-cache down to 2-bit with near-lossless quality. QVG reduces KV-cache memory by up to 7× with less than 4% latency overhead.

Overview

Auto-regressive video diffusion marks a paradigm shift in video generation: by enforcing temporal causality, models like LongCat-Video, HY-WorldPlay, and Self-Forcing can generate long, streaming videos incrementally. This enables applications ranging from live-streaming generation to interactive world models.

However, auto-regressive inference introduces a critical system and algorithm coupled bottleneck: KV-cache memory. Unlike bidirectional models, auto-regressive models must retain a growing KV-cache to condition on all previously generated frames. This cache grows linearly with history and quickly dominates GPU memory.

More critically, KV-cache is not just an efficiency bottleneck. It is also a capability bottleneck. When memory forces a truncated context window, the model's effective working memory shrinks, directly degrading long-horizon consistency in identity, layout, and motion. Retaining more history leads to substantially better generation quality.

QVG bridges this gap through the following approaches:

  • Semantic-Aware Smoothing exploits spatiotemporal redundancy to produce quantization-friendly residuals.
  • Progressive Residual Quantization is a coarse-to-fine multi-stage scheme that further reduces quantization error.

QVG reduces KV-cache memory by up to about 7× compared to BF16 while adding less than 4% end-to-end latency overhead. It also improves quantization error substantially relative to prior KV-cache quantization baselines at similar compression.

Overview: qualitative comparison on LongCat-Video and HY-WorldPlay with BF16, KIVI, and QVG
Overview: qualitative comparison on LongCat-Video and HY-WorldPlay at 480p. QVG tracks BF16 at a much smaller KV-cache footprint; KIVI shows visible degradation at similar compression.

Why Video KV-Cache Quantization is Challenging

KV-cache quantization is well-studied for LLMs, with methods like KIVI and QuaRot achieving strong results. However, naively applying these techniques to video diffusion leads to severe quality degradation. The root cause lies in fundamentally different activation statistics:

  • Extremely large dynamic ranges: video models exhibit max |K| ~ 1e2 and max |V| ~ 1e3, far exceeding typical LLM ranges.
  • Heterogeneous outlier patterns: channels that are outliers for some tokens may not be outliers for others, because tokens correspond to diverse spatial regions and motion patterns whose relevance evolves over time.

These properties make standard per-channel or per-token quantization schemes brittle for video models, motivating a video-specific approach that explicitly leverages the spatiotemporal structure of video data.

Full KV cache runs out of memory; sliding window degrades quality; QVG keeps quality with efficient KV
Full KV-cache can run out of memory on long rollouts; a sliding window saves memory but hurts quality; QVG keeps long-horizon visual quality with a compact quantized KV-cache.

Semantic-Aware Smoothing

The key observation behind QVG is that video KV-cache exhibits strong spatiotemporal redundancy: tokens that are spatially or temporally adjacent tend to be numerically similar in latent space. A fixed spatial patch across consecutive frames changes slowly (temporal redundancy), and nearby patches within the same frame share similar features (spatial redundancy).

Semantic-Aware Smoothing exploits this redundancy through two steps:

Step 1: Semantic-based grouping. We apply the k-means algorithm on the KV-cache along the sequence dimension to partition tokens into semantically similar groups. Tokens within the same group exhibit significantly more homogeneous value distributions, since they share similar spatial and temporal characteristics.

Step 2: Centroid subtraction. For each group, we subtract its centroid (mean value) to obtain a residual tensor. Since the large-magnitude values are shared across semantically similar tokens and captured by the centroid, the resulting residuals have a much smaller magnitude and more concentrated distribution around zero — an ideal target for low-bit quantization.

This simple yet effective technique reduces quantization error by ~6.9× for keys and ~2.6× for values across all precision choices, without any training or fine-tuning.

Semantic-Aware Smoothing: grouping, centroid subtraction, and distribution of K cache values
Semantic-based grouping and centroid subtraction make the K cache regular and easy to quantize; the residual distribution is sharply peaked near zero.

Progressive Residual Quantization

Inspired by streaming video codecs that progressively encode multi-scale representations, QVG introduces Progressive Residual Quantization — a coarse-to-fine scheme that iteratively refines quantization in multiple stages.

Starting from the initial residual (output of Semantic-Aware Smoothing), each subsequent stage applies Semantic-Aware Smoothing again on the remaining residual to capture finer-grained details. After T stages, the final residual is quantized to low-bit integers, while the centroids and assignment vectors from all stages are stored as compact metadata.

The first stage provides the dominant error reduction (5.83× MSE reduction vs. naive quantization). Subsequent stages continue to improve quality with diminishing returns, enabling a smooth trade-off between quality and compression by simply adjusting the number of stages.

For reconstruction, the process is reversed: starting from the quantized output, centroids are progressively added back from the last stage to the first, faithfully recovering the original tensor.

Overview of Semantic-Aware Smoothing and Progressive Residual Quantization on KV cache
From original KV cache through Semantic-Aware Smoothing to low-bit quantized values; quantization error drops substantially after the pipeline.
MSE reduction factor vs. number of Progressive Residual Quantization stages
MSE reduction vs. naive quantization as a function of PRQ stage count; the first stage dominates, with diminishing returns afterward.

Efficient Algorithm-System Co-design

QVG incorporates several system-level optimizations to minimize latency overhead:

  • Streaming centroid caching: the k-means centroids from the previous video chunk are reused to initialize clustering for the next chunk, reducing k-means overhead by 3×.
  • Fused dequantization kernel: a custom CUDA kernel that dequantizes the tensor and adds back assigned centroids for all Progressive Residual Quantization stages in a single pass, keeping intermediate results in registers to avoid redundant global memory accesses.
  • FP8 scaling factors: scaling factors are stored in FP8 E4M3 format to further reduce memory overhead.

These optimizations ensure that QVG introduces only 1.5%–4.3% end-to-end latency overhead across different models, making it practical for real-time deployment.

Key Results

QVG is evaluated on three auto-regressive video generation models — LongCat-Video-13B, HY-WorldPlay-8B, and Self-Forcing-Wan-1.3B — and consistently outperforms all baselines (RTN, KIVI, QuaRot) in both quality and compression:

Compression & Quality

  • LongCat-Video-13B: 6.94× KV-cache compression with PSNR 28.7. Baselines at similar compression achieve only PSNR ~20–21.
  • HY-WorldPlay-8B: 7.05× compression with PSNR 29.2, while baselines achieve PSNR ~24–25.
  • All VBench perceptual metrics (Background Consistency, Image Quality, Subject Consistency, Aesthetic Quality) remain near-lossless for QVG, while baselines degrade significantly under INT2.

Long-Horizon Stability

  • On Self-Forcing, QVG maintains near-lossless image quality even at 700+ frames, while all baselines experience sharp degradation after ~100 frames.
  • On the same hardware, QVG enables longer effective context lengths, which translates to improved quality that can surpass BF16 under the original cache budget.

Deployment Milestone

  • QVG makes it possible to run HY-WorldPlay-8B on a single RTX 4090 for the first time, achieving PSNR >29 relative to BF16 — previously infeasible due to memory constraints.
  • End-to-end latency overhead: 2.1% (LongCat), 1.5% (HY-WorldPlay), 4.3% (Self-Forcing).
Main benchmark table: compression ratio and quality metrics on LongCat-Video-13B and HY-WorldPlay-8B
Main results on LongCat-Video-13B and HY-WorldPlay-8B at 480p (INT2 and INT4 KV cache): compression ratio vs. fidelity and VBench-style metrics.
Imaging quality vs. frame count on Self-Forcing for INT2 and INT4 KV cache
Self-Forcing: imaging quality along long rollouts under INT2 and INT4 KV cache; QVG and QVG-Pro track the BF16 baseline while prior quantization methods degrade.

BibTeX


@article{xi2026quant,
  title={Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization},
  author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
  journal={arXiv preprint arXiv:2602.02958},
  year={2026}
}