AVIS: Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

Overview

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS).

AVIS and AVIS Flash

AVIS : AR backbone + measurement-consistent initialization + guidance for every chunk.
AVIS Flash : same backbone and initialization + guidance only for the first chunk.

The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Additionally, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. While AVIS enforces measurement consistency for every video chunk, we further introduce a highly accelerated variant that enforces measurement consistency solely on the first video chunk, dubbed AVIS Flash.

Baseline Comparisons

VISION-XL LVTINO AVIS AVIS Flash
Latency (s) ↓ 167 114 4 4
Time (s) ↓ 167 114 68.5 13.7
FPS (frame/s) ↑ 0.49 0.71 1.18 5.91
Efficiency comparison on 81 frames of 480x854 resolution video on a single RTX 4090 GPU.
Our proposed frameworks (AVIS and AVIS Flash) achieve significant improvements across all efficiency metrics. Bold: best, underline: second-best.

Temporal Average

To view in full resolution, please download the videos.

Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash

Spatio-Temporal Average

To view in full resolution, please download the videos.

Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash

Inpainting

To view in full resolution, please download the videos.

Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash

Gaussian Deblur

To view in full resolution, please download the videos.

Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash

Super Resolution

To view in full resolution, please download the videos.

Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash
Measurement
VISION-XL
LVTINO
AVIS
AVIS Flash

Autoregressive Propagation

Autoregressive propagation alone preserves temporal context, but gradually drifts from the desired restoration when later chunks are generated from pure noise (middle row).
AVIS mitigates this drift by initializing reverse diffusion from a measurement-consistent estimate, keeping each chunk closer to the target restoration trajectory (bottom row).

Long Video Restoration

Stable restoration of a 1-minute, 960-frame video by periodically re-injecting measurement consistency to suppress error accumulation over time.