Wonderland: Navigating 3D Scenes from a Single Image

* Equal Contribution.
Image 1
Image 2
Image 3

Given a single image, Wonderland generates 3D scenes from the latent space of a camera-guided video diffusion model in a feed-forward manner. Above videos are rendered from built 3DGS models.

Abstract

This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

3D Scenes Generation in One-shot

Given single images, Wonderland generates high-fidelity and wide-scope 3D scenes in one-shot. Demos are rendered from generated 3DGS models.

Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 1
Image 2
Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 6
Image 1
Image 2

Extensive Navigation over 3D Scenes by Autoregressive Generation

Given single images, Wonderland fulfills extensive navigations over 3D scenes by autoregressive generation. Demos are rendered from generated 3DGS models.

Image 1
Image 2
Image 3
Image 4
Image 1
Image 2

Camera-guided Video Generations

From single images, the proposed camera-guided video diffusion model generates videos precisely following specified camera trajectories. Demos are generated videos with different image prompts using same camera trajectories.

Camera

Camera

Extensive Exploration over Scenes using Multiple Camera Trajectories

From single images, the proposed camera-guided video diffusion model extensively explores scenes precisely following specified camera trajectories. Demos are generated videos conditioned on single images and multiple camera trajectories.

Input Image

Generated Video

Input Image

Generated Video

Method

Given a single image, a camera-guided video diffusion model follows the camera trajectory and generates a 3D-aware video latent, which is leveraged by the latent-based large reconstruction model to construct the 3D scene in a feed-forward manner. The video diffusion model involves dual-branch camera conditioning to fulfill precise pose control. The LaLRM operates in latent space and efficiently reconstructs a wide-scope and high-fidelity 3D scene.

Citation

Acknowledgements

We would like to thank Hsin-Ying Lee, Chaoyang Wang, Peiye Zhuang, Yu Hong, Yujing Duan, and Siyu Zhang for their valuable discussions and assistance in the development of this work.