4Real-Video-V2 is capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Its architecture has two main components, a 4D video diffusion model and a feedforward reconstruction model.
This represents a major upgrade over 4Real-Video, introducing a new 4D video diffusion model architecture that adds no additional parameters to the base video model. The key to the new design is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. This design makes it easily scalable to large pre-trained video models, efficient to train and offers good generalization.
Explore videos generated by the 4D video diffusion model. Click on a thumbnail to view the corresponding fixed-view and frozen-time video demonstrations.
Fixed View:
Frozen Time:
Explore videos generated by the 4D video diffusion model. Click on a thumbnail to view the corresponding fixed-view and frozen-time video demonstrations.
Fixed View:
Frozen Time:
Explore videos generated by the proposed 4D video diffusion model. Click on a thumbnail to view the corresponding fixed-view and frozen-time video demonstrations.
Fixed View:
Frozen Time:
Ours - Fixed View
Ours - Frozen Time
4Real-Video - Fixed View
4Real-Video - Frozen Time
Visual comparison of different architectures on sample Objaverse scenes.