4Real-Video
Learning Generalizable Photo-Realistic 4D Video Diffusion

Chaoyang Wang^1,*†, Peiye Zhuang^1,*, Tuan Duc Ngo^1,2,*, Willi Menapace¹, Aliaksandr Siarohin¹, Michael Vasilkovsky¹, Ivan Skorokhodov¹, Sergey Tulyakov¹, Peter Wonka^1,3, Hsin-Ying Lee¹

¹Snap Inc., ²Umass Amherst, ³KAUST

^*main contributor, ^†project lead

Paper

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.
Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D
Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

(Click on the images below to select input scenes and prompts.)

"a yellow toy building brick turn into a cat."

"move the truck to the right."

"a cat pillow dancing."

"a ghostly toy dancing."

Generate 4D Videos from Text
Finally, 4Real can create 4D videos directly from text input.

"A panda eating ice-cream."

A panda eating ice-cream.

"A bulldog wearing a black pirate hat eating candy."

A bulldog wearing a black pirate hat eating candy.

4Real-Video Overview: Left: we initialize the grid of frames with a (generated or real) fixed-viewpoint video in the first row and a freeze-time video in the first column. Middle: our architecture consists of two parallel token streams. The top part processes \(\mathbf{x}_l^\text{v}\) with viewpoint updates and the bottom part processes \(\mathbf{x}_l^\text{t}\) with temporal updates. Subsequently, a synchronization layer computes the new tokens \(\mathbf{x}_{l+1}^\text{v}\) \(\mathbf{x}_{l+1}^\text{t}\) for the next layer in the diffusion transformer architecture. Right: we propose two implementations of the synchronization layer: hard and soft synchronization.

Compare to MotionCtrl and SV4D

MotionCtrl struggles to generate temporally coherent videos because its freeze-time frames are produced independently, neglecting temporal dependencies. Additionally, it often generates minimal camera motion even when provided with high input speed. Conversely, SV4D produces poor visual quality as realistic scene generation falls outside its training domain. In contrast, our method generates realistic and temporally coherent frame grids, ensuring high-quality video outputs.

4Real-Video (Ours)	MotionCtrl	SV4D

4Real-Video
Learning Generalizable Photo-Realistic 4D Video Diffusion

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.
Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D
Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

(Click on the images below to select input scenes and prompts.)

Animate 3D Assets
4Real-Video can also animate 3D assets seamlessly across multiple views.

(Click on the images below to select input 3D assets.)

Generate 4D Videos from Text
Finally, 4Real can create 4D videos directly from text input.

Interactive 4D Viewer

Click on the thumbnails below to explore the reconstructed 4D scene with deformable 3D Gaussian Splats directly in your browser, powered by Brush.

More Text-4D video Generation Results

fixed-view videos

freeze-time videos

fixed view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

Compare to MotionCtrl and SV4D

4Real-Video (Ours)

MotionCtrl

SV4D

Acknowledgement

We extend our gratitude to Heng Yu, Moayed Haji Ali, Sherwin Bahmani, Jiahao Luo, and Guochen Qian for their valuable assistance with data preparation and model pretraining. The interactive 4D viewer is borrowed from CAT4D website.

4Real-Video Learning Generalizable Photo-Realistic 4D Video Diffusion

Chaoyang

Peiye

Tuan

Willi & Aliaksandr

Michael

Ivan & Sergey

Peter

Hsin-Ying

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes. Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

(Click on the images below to select input scenes and prompts.)

Animate 3D Assets 4Real-Video can also animate 3D assets seamlessly across multiple views.

(Click on the images below to select input 3D assets.)

Generate 4D Videos from Text Finally, 4Real can create 4D videos directly from text input.

Interactive 4D Viewer

Click on the thumbnails below to explore the reconstructed 4D scene with deformable 3D Gaussian Splats directly in your browser, powered by Brush.

More Text-4D video Generation Results

fixed-view videos

freeze-time videos

fixed view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

fixed-view videos

freeze-time videos

Compare to MotionCtrl and SV4D

4Real-Video (Ours)

MotionCtrl

SV4D

Acknowledgement

We extend our gratitude to Heng Yu, Moayed Haji Ali, Sherwin Bahmani, Jiahao Luo, and Guochen Qian for their valuable assistance with data preparation and model pretraining. The interactive 4D viewer is borrowed from CAT4D website.

4Real-Video
Learning Generalizable Photo-Realistic 4D Video Diffusion

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.
Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D
Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

Animate 3D Assets
4Real-Video can also animate 3D assets seamlessly across multiple views.

Generate 4D Videos from Text
Finally, 4Real can create 4D videos directly from text input.