OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Abstract

Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33% in multiview NVS LLFF dataset, 60% in dynamic NVS Neural 3D Video benchmark, 20% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model.

Teaser

Watch our comprehensive demonstration showcasing OmniView's capabilities across all supported tasks. This video presents conditioning inputs alongside generated results for dynamic novel view synthesis, static novel view synthesis, image-to-video generation, keyframe interpolation, and video-to-video translation.

Dynamic Novel View Synthesis

Camera-controlled video synthesis from input video. Given a video sequence captured from one viewpoint, our model generates the same scene from novel camera perspectives while preserving temporal dynamics. This task requires maintaining 3D consistency across viewpoints and temporal coherence across frames, ensuring that moving objects and camera motion are properly disentangled and rendered from new angles.

Conditioning

Generated

Play and Freeze Time

A unique capability where temporal and spatial controls are seamlessly combined. The video initially plays forward in time, then freezes at a specific moment while the camera moves to explore the scene from different viewpoints, and finally resumes temporal progression. This demonstrates our model's ability to decouple time and space, allowing independent control over when and where to observe the scene.

Conditioning

Generated

Image-to-Video Generation w/ Camera Control

Transform static images into dynamic videos with full camera control. Starting from a single input image, our model generates temporal dynamics while simultaneously allowing camera movement through the scene. This task requires hallucinating plausible 3D structure from 2D observations and animating the scene with realistic object motion, lighting changes, and camera trajectories.

Conditioning

Generated

Video-to-Video Generation w/ Camera Control

Extend short video clips into longer sequences with camera control. Given an input video segment, our model predicts future frames while allowing the camera trajectory to be specified independently. This enables creative control over both the temporal evolution of the scene and the viewing perspective, useful for applications like cinematography and content creation where specific camera movements are desired.

Conditioning

Generated

Keyframe Interpolation

Generate smooth video sequences between specified keyframe images with camera control. Given two or more keyframe images that define the start and end states, our model creates natural temporal transitions while maintaining 3D consistency. This is particularly useful for animation workflows where artists specify key poses or states, and the model fills in the intermediate frames with realistic motion and camera movement.

First Keyframe

Generated

Last Keyframe

Method

Our approach unifies multiple 4D consistency tasks through a shared framework that separately models space, time, and camera viewpoint conditions.

Citation

@misc{fan2025omniviewallseeingdiffusionmodel,
      title={OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis}, 
      author={Xiang Fan and Sharath Girish and Vivek Ramanujan and Chaoyang Wang and Ashkan Mirzaei and Petr Sushko and Aliaksandr Siarohin and Sergey Tulyakov and Ranjay Krishna},
      year={2025},
      eprint={2512.10940},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10940}, 
}