Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Willi Menapace1,2,* Aliaksandr Siarohin1 Ivan Skorokhodov1 Ekaterina Deyneka1 Tsai-Shien Chen1,3,*
Anil Kag1 Yuwei Fang1 Aleksei Stoliar1 Elisa Ricci2,4 Jian Ren1 Sergey Tulyakov1

Snap Inc.1 University of Trento2 UC Merced3 Fondazione Bruno Kessler4 Work performed while interning at Snap Inc.*

Paper Overview Stories

Hierarchical Generation

We devise a hierarchical generation strategy to increase video duration and framerate where we adopt the reconstruction guidance method of "Video Diffusion Models" to condition the video generator on previously generated frames. We define a hierarchy of progressively increasing framerates and start by autoregressively generating a video of the desired length at the lowest framerate, at each step using the last generated frame as the conditioning. Subsequently, for each successive framerate in the hierarchy, we autoregressively generate a video of the same length but conditioning the model on all frames that have already been generated at the lower framerates.

We show a selection of 32 frames videos sampled at 12fps.

Hover the cursor on the video to reveal the prompt.

A chihuahua in astronaut suit floating in space, photo realistic, 8k, cinematic lighting, hd, atmospheric, hyperdetailed, photography, glow effect.
In a high-tech control room, otters operate an imaginary spaceship console, embarking on an interstellar adventure. Cinematic lighting effects enhance the futuristic setting, and the camera executes quick cuts to showcase the excitement of their space journey.
In Macro len style, a photograph of a knight in shining armor holding a basketball
A corgi ice skating in winter wonderland, photorealistic
In a potter's studio, skilled hands mold clay into a delicate sculpture. Utilize sweeping arcs to highlight the shaping process, emphasizing the intricate details emerging from the artist's touch.
In Macro len style, a photograph of a knight in shining armor holding a basketball
Back Next