SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

1Snap Inc., 2Northeastern University, 3Rutgers University
*Equal Contribution

Abstract

We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

Method

Our objective is to achieve high-fidelity and temporally consistent video generation on mobile devices. However, current text-to-video diffusion models face two key challenges in reaching this goal: (a) the memory and computation requirement is beyond the capability of even the most powerful mobile chips, i.e. iPhone A18 Pro, and (b) denoising with dozens of steps to generate a single output further slows down the process. To address these challenges, we propose a three-stage framework to accelerate video diffusion models on the mobile platform. First, we prune from a pre-trained text-to-image diffusion model to obtain an efficient spatial backbone. Second, we inflate the spatial backbone with a novel combination of temporal modules which are searched out with our mobile-oriented metrics. Finally, through adversarial training, our efficient model attains the capability to generate high-quality videos in only 4 steps.

overview
Overview of our method.

Comparisons

In this section, we provide video comparison results of our method with others as mentioned in the paper.

OpenSora-v1.2

CogVideoX-2B

Ours

Quantitative Comparison

Model Type Steps Params(B) A100(s)* iPhone(s)* Vbench ↑
OpenSora-v1.2 DiT 30 1.2 31.00 79.76
CogVideoX-2B DiT 50 1.6 54.09 80.91
AnimateDiff-V2 UNet 25 1.2 9.04 80.27
AnimateDiff-LCM UNet 4 1.2 1.77 79.42
SnapGen-V UNet 4 0.6 0.47 4.12 81.14

*: The latency is benchmark on NVIDIA A100-SXM4-80GB GPU and iPhone 16 Pro Max.

Mobile Demo on iPhone 16 Pro Max

BibTeX

        
        @misc{wu2024snapgenvgeneratingfivesecondvideo,
          title={SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device},
          author={Yushu Wu and Zhixing Zhang and Yanyu Li and Yanwu Xu and Anil Kag and Yang Sui and Huseyin Coskun and Ke Ma and Aleksei Lebedev and Ju Hu and Dimitris Metaxas and Yanzhi Wang and Sergey Tulyakov and Jian Ren},
          year={2024},
          eprint={2412.10494},
          archivePrefix={arXiv},
          url={https://arxiv.org/abs/2412.10494},
        }