Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds

Yushu Wu^1,2,*,+, Yanyu Li^1,*, Anil Kag¹, Ivan Skorokhodov¹, Willi Menapace¹, Ke Ma¹, Arpit Sahni¹, Ju Hu¹, Aliaksandr Siarohin¹, Dhritiman Sagar¹, Yanzhi Wang², Sergey Tulyakov¹

¹Snap Inc., ²Northeastern University

^*Equal Contribution

Work done during an internship at Snap Inc.

arXiv

*1024×576 samples and 288×512 samples are generated from the server model and the mobile model, respectively.

Abstract

Method

Overview of our method — Figure 1. Overview of our method.

Overview of tile gemm — Figure 2. Illustration for tiled GEMM for a single token and Latency benchmark for tiled GEMM in FFN

bench tile gemm — Figure 2. Illustration for tiled GEMM for a single token and Latency benchmark for tiled GEMM in FFN

Table 1: Scaling VAE compression ratio. VAE PSNR is measured on DAVIS with 33×512×512 resolution. Latencies are reported in ms for one denoising step, and we test with 121×576×1024 resolution on Nvidia-A100 GPU. VBench scores are also provided for each variant.
VAE	PSNR	Latency (ms)	Total	Quality	Semantic	Aesthetic	Consistency	Flickering
4×16×16	33.1	7900	80.35	82.05	73.54	64.45	26.80	98.59
4×32×32	30.9	920	79.95	82.99	67.83	61.52	27.07	97.46
8×32×32	30.6	380	79.80	82.59	68.66	61.80	27.17	97.70
8×64×64	28.2	90	78.40	81.79	64.86	55.29	26.11	97.52

Comparisons

In this section, we provide video comparison results of our method with others as mentioned in the paper.

LTX-Video-2B^[1]	CogVideoX-2B^[2]	Wan2.1-1.3B^[3]	Ours

Prompt: 3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bush.

Prompt: A cat sitting at a grand piano, elegantly playing a classical piece with its paws.

Prompt: A corgi vlogging itself in tropical Maui.

Prompt: A movie trailer featuring the adventures of the 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

Prompt: A pair of lovebirds preening each other's feathers.

Prompt: A skeleton wearing a flower hat and sunglasses dances in the wild at sunset.
[1] Yoav HaCohen et al. "LTX-Video: Realtime Video Latent Diffusion." https://arxiv.org/abs/2501.00103 (2024). [2] Yang, Zhuoyi, et al. "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." ICLR (2025). [3] Team Wan, et al. "Wan: Open and Advanced Large-Scale Video Generative Models." https://arxiv.org/abs/2503.20314 (2025). *videos are generated using Diffusers v0.33.1 implementation.

Quantitative Comparison

Model	Params(B)	A100(FPS)^*	iPhone(FPS)^*	Vbench ↑
				Vbench ↑
Wan2.1	1.3	0.2	✗	83.33
CogVideoX-2B	1.6	0.45	✗	80.91
LTX-Video	1.8	6.1	✗	80.00
Ours (Server)	2.0	6.4	✗	83.09
Ours (Mobile)	0.9	151.3	~15	81.45

*: The latency is benchmark on NVIDIA A100-SXM4-80GB GPU and iPhone 16 Pro Max per denoising step.

BibTeX