SnapGen-V

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

¹Snap Inc., ²Northeastern University, ³Rutgers University

^*Equal Contribution

Abstract

We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

Method

Our objective is to achieve high-fidelity and temporally consistent video generation on mobile devices. However, current text-to-video diffusion models face two key challenges in reaching this goal: (a) the memory and computation requirement is beyond the capability of even the most powerful mobile chips, i.e. iPhone A18 Pro, and (b) denoising with dozens of steps to generate a single output further slows down the process. To address these challenges, we propose a three-stage framework to accelerate video diffusion models on the mobile platform. First, we prune from a pre-trained text-to-image diffusion model to obtain an efficient spatial backbone. Second, we inflate the spatial backbone with a novel combination of temporal modules which are searched out with our mobile-oriented metrics. Finally, through adversarial training, our efficient model attains the capability to generate high-quality videos in only 4 steps.

Overview of our method.

Comparisons

In this section, we provide video comparison results of our method with others as mentioned in the paper.
OpenSora-v1.2	CogVideoX-2B	Ours

Model	Type	Steps	Params(B)	A100(s)^*	iPhone(s)^*	Vbench ↑
OpenSora-v1.2	DiT	30	1.2	31.00	✗	79.76
CogVideoX-2B	DiT	50	1.6	54.09	✗	80.91
AnimateDiff-V2	UNet	25	1.2	9.04	✗	80.27
AnimateDiff-LCM	UNet	4	1.2	1.77	✗	79.42
SnapGen-V	UNet	4	0.6	0.47	4.12	81.14

Model

Type

Steps

Params(B)

A100(s)^*

iPhone(s)^*

Vbench ↑

OpenSora-v1.2

DiT

1.2

31.00

✗

79.76

CogVideoX-2B

DiT

1.6

54.09

✗

80.91

AnimateDiff-V2

UNet

1.2

9.04

✗

80.27

AnimateDiff-LCM

UNet

1.2

1.77

✗

79.42

SnapGen-V

UNet

0.6

0.47

4.12

81.14

BibTeX

@InProceedings{Wu_2025_CVPR, author = {Wu, Yushu and Zhang, Zhixing and Li, Yanyu and Xu, Yanwu and Kag, Anil and Sui, Yang and Coskun, Huseyin and Ma, Ke and Lebedev, Aleksei and Hu, Ju and Metaxas, Dimitris N. and Wang, Yanzhi and Tulyakov, Sergey and Ren, Jian}, title = {SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {2479-2490} }

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Abstract

Method

Comparisons

Quantitative Comparison

Mobile Demo on iPhone 16 Pro Max

BibTeX