4Real Towards Photorealistic 4D Scene Generation via Video Diffusion Models

1Snap Inc., 2Carnegie Mellon University
A hamster wearing virtual reality headsets is a DJ in a disco.
A bulldog wearing a black pirate hat eating candy.
A cat eating chicken and watching TV.
A crocodile playing drum.
Time lapse of a flower blooming open.

4Real generates 4D gaussian splatting from input text prompts, and renders dynamic scene that can be viewed from different view points.

(Hover on a video to see the prompt.)

Abstract

Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

Space-time view synthesis from generated 4D Gaussian splats

Compare with 4Dfy. Click here for complete comparison


Ours
4Dfy
Ours
4Dfy

Compare with Dream-in-4D. Click here for complete comparison


Ours
Dream-in-4D
Ours
Dream-in-4D

Click here for ablation study