Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Willi Menapace^1,2,* Aliaksandr Siarohin¹ Ivan Skorokhodov¹ Ekaterina Deyneka¹ Tsai-Shien Chen^1,3,*
Anil Kag¹ Yuwei Fang¹ Aleksei Stoliar¹ Elisa Ricci^2,4 Jian Ren¹ Sergey Tulyakov¹

Snap Inc.¹ University of Trento² UC Merced³ Fondazione Bruno Kessler⁴ Work performed while interning at Snap Inc.^*

Our Stories

Snap Video can assist designers in the generation of long stories. We make use of an LLM to generate a story plot, video prompts for different scenes, and scripts for audio narrations. We generate all video assets using our model while tuning the video prompts to obtain the desired visuals, and synthesize the audio narration.

Postproduction software is used to assemble the final video. The generated video assets are trimmed and composed into a sequence to form the video track, to which text overlays are added. Background music is inserted and the synthesized audio narration is aligned to the video content to generate the final result.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Willi Menapace^1,2,* Aliaksandr Siarohin¹ Ivan Skorokhodov¹ Ekaterina Deyneka¹ Tsai-Shien Chen^1,3,*
Anil Kag¹ Yuwei Fang¹ Aleksei Stoliar¹ Elisa Ricci^2,4 Jian Ren¹ Sergey Tulyakov¹

Our Stories

Invasion

Detective Whiskers

Medieval Rom-com

Breakfast Burrito

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Willi Menapace1,2,* Aliaksandr Siarohin1 Ivan Skorokhodov1 Ekaterina Deyneka1 Tsai-Shien Chen1,3,* Anil Kag1 Yuwei Fang1 Aleksei Stoliar1 Elisa Ricci2,4 Jian Ren1 Sergey Tulyakov1

Our Stories

Invasion

Detective Whiskers

Medieval Rom-com

Breakfast Burrito

Willi Menapace^1,2,* Aliaksandr Siarohin¹ Ivan Skorokhodov¹ Ekaterina Deyneka¹ Tsai-Shien Chen^1,3,*
Anil Kag¹ Yuwei Fang¹ Aleksei Stoliar¹ Elisa Ricci^2,4 Jian Ren¹ Sergey Tulyakov¹