Taming Diffusion Transformer for Real-Time Mobile Video Generation

1Snap Inc., 2Northeastern University
*Equal Contribution
Work done during an internship at Snap Inc.

*1024×576 samples and 288×512 samples are generated from the server model and the mobile model, respectively.

Abstract

Method

Overview of our method
Figure 1. Overview of our method.
VAE PSNR A100 (ms) iPhone (ms) Total Quality Semantic Aesthetic Consistency Flickering
4×16×16 33.1 7900 80.35 82.05 73.54 64.45 26.80 98.59
4×32×32 30.9 920 3315 79.95 82.99 67.83 61.52 27.07 97.46
8×32×32 30.6 380 880 79.80 82.59 68.66 61.80 27.17 97.70
8×64×64 28.2 90 155 78.40 81.79 64.86 55.29 26.11 97.52
Table 1: Scaling VAE compression ratio. VAE PSNR is measured on DAVIS with 33×512×512 resolution. Latencies are reported in ms for one denoising step, and we test with 121×576×1024 resolution for GPU and 49×384×512 for iPhone. VBench scores are also provided for each variant.

Comparisons

In this section, we provide video comparison results of our method with others as mentioned in the paper.

LTX-Video-2B[1] CogVideoX-2B[2] Wan2.1-1.3B[3]

Ours

Prompt: 3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bush.
Prompt: A cat sitting at a grand piano, elegantly playing a classical piece with its paws.
Prompt: A corgi vlogging itself in tropical Maui.
Prompt: A movie trailer featuring the adventures of the 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
Prompt: A pair of lovebirds preening each other's feathers.
Prompt: A skeleton wearing a flower hat and sunglasses dances in the wild at sunset.
[1] Yoav HaCohen et al. "LTX-Video: Realtime Video Latent Diffusion." https://arxiv.org/abs/2501.00103 (2024).
[2] Yang, Zhuoyi, et al. "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." ICLR (2025).
[3] Team Wan, et al. "Wan: Open and Advanced Large-Scale Video Generative Models." https://arxiv.org/abs/2503.20314 (2025).
*videos are generated using Diffusers v0.33.1 implementation.

Quantitative Comparison

Model Params(B) A100(FPS)* iPhone(FPS)* Vbench ↑
Wan2.1 1.3 0.2 83.33
CogVideoX-2B 1.6 0.45 80.91
LTX-Video 1.8 6.1 80.00
Ours (Server) 2.0 6.4 83.09
Ours (Mobile) 0.9 151.3 12.2 81.45

*: The latency is benchmark on NVIDIA A100-SXM4-80GB GPU and iPhone 16 Pro Max per denoising step.

BibTeX