4Real-Video
Learning Generalizable Photo-Realistic 4D Video Diffusion

1Snap Inc., 2Umass Amherst, 3KAUST
*main contributor, project lead

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.

Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D

Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

(Click on the images below to select input scenes and prompts.)

"a ghostly toy dancing."

Animate 3D Assets

4Real-Video can also animate 3D assets seamlessly across multiple views.

(Click on the images below to select input 3D assets.)

Generate 4D Videos from Text

Finally, 4Real can create 4D videos directly from text input.

"A panda eating ice-cream."
A panda eating ice-cream.
A panda eating ice-cream.
"A bulldog wearing a black pirate hat eating candy."
A bulldog wearing a black pirate hat eating candy.
A bulldog wearing a black pirate hat eating candy.

4Real-Video Overview: Left: we initialize the grid of frames with a (generated or real) fixed-viewpoint video in the first row and a freeze-time video in the first column. Middle: our architecture consists of two parallel token streams. The top part processes \(\mathbf{x}_l^\text{v}\) with viewpoint updates and the bottom part processes \(\mathbf{x}_l^\text{t}\) with temporal updates. Subsequently, a synchronization layer computes the new tokens \(\mathbf{x}_{l+1}^\text{v}\) \(\mathbf{x}_{l+1}^\text{t}\) for the next layer in the diffusion transformer architecture. Right: we propose two implementations of the synchronization layer: hard and soft synchronization.

Placeholder Image

More Text-4D video Generation Results

Compare to MotionCtrl and SV4D

MotionCtrl struggles to generate temporally coherent videos because its freeze-time frames are produced independently, neglecting temporal dependencies. Additionally, it often generates minimal camera motion even when provided with high input speed. Conversely, SV4D produces poor visual quality as realistic scene generation falls outside its training domain. In contrast, our method generates realistic and temporally coherent frame grids, ensuring high-quality video outputs.

4Real-Video (Ours)

MotionCtrl

SV4D

Acknowledgement

We extend our gratitude to Heng Yu, Moayed Haji Ali, Sherwin Bahmani, Jiahao Luo, and Guochen Qian for their valuable assistance with data preparation and model pretraining.