Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Willi Menapace1,2,* Aliaksandr Siarohin1 Ivan Skorokhodov1 Ekaterina Deyneka1 Tsai-Shien Chen1,3,*
Anil Kag1 Yuwei Fang1 Aleksei Stoliar1 Elisa Ricci2,4 Jian Ren1 Sergey Tulyakov1

Snap Inc.1 University of Trento2 UC Merced3 Fondazione Bruno Kessler4 Work performed while interning at Snap Inc.*

Paper Overview Stories

Our Stories

Snap Video can assist designers in the generation of long stories. We make use of an LLM to generate a story plot, video prompts for different scenes, and scripts for audio narrations. We generate all video assets using our model while tuning the video prompts to obtain the desired visuals, and synthesize the audio narration.

Postproduction software is used to assemble the final video. The generated video assets are trimmed and composed into a sequence to form the video track, to which text overlays are added. Background music is inserted and the synthesized audio narration is aligned to the video content to generate the final result.

Invasion

Detective Whiskers

Medieval Rom-com

Breakfast Burrito

Back Next