AC3D: Analyzing and Improving 3D Camera Control
in Video Diffusion Transformers

Sherwin Bahmani*1,2,3   Ivan Skorokhodov*3   Guocheng Qian3     Aliaksandr Siarohin3  
Willi Menapace3   Andrea Tagliasacchi1,4   David B. Lindell1,2   Sergey Tulyakov3  
1University of Toronto 2Vector Institute 3Snap Inc. 4SFU
* equal contribution

arXiv 2024

arXiv

"Three fluffy sheep sit side by side at a rustic wooden table, each eagerly digging into their bowls of spaghetti."

Abstract

In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.


Method

VDiT-CC model with ControlNet camera conditioning built on top of VDiT. Video synthesis is performed by large 4,096-dimensional DiT-XL blocks of the frozen VDiT backbone, while VDiT-CC only processes and injects the camera information through lightweight 128-dimensional DiT-XS blocks (FC stands for fully-connected layers).

  architecture

Our Results

We visualize a sequence of 8 different camera trajectories (40 seconds total) shared across all prompts.
 
In a sophisticated art studio, a cat wearing a beret sits at an easel, delicately painting on a tiny canvas.
In a futuristic kitchen, an astronaut expertly cooks with a pan over a small, controlled flame. There is a pond with a group of curious ducks that swim nearby.
A teddy bear diligently washes dishes in a cozy kitchen.
A golden retriever, sitting on the sand at a tropical beach, eagerly devours an ice cream cone. The sun sets in the background, casting a golden hue over the calm waves.
A squirrel sits contentedly on a park bench, nibbling on a juicy burger with its tiny paws. The park around it is filled with trees and flowers in full bloom, and a few curious birds watch from nearby branches.
An otter, expertly operating an espresso machine in a cozy, warmly lit café, moves its tiny paws with great precision as it grinds fresh coffee beans and steams milk.
In a chic urban kitchen, a cat wearing a small chef's hat expertly kneads dough on a sleek marble countertop.
An astronaut cooking with a pan in the kitchen.
A cyborg koala, wearing a pair of headphones and standing in front of a high-tech turntable, DJs on a rooftop in a futuristic, neon-lit Tokyo. The rain falls in sheets around it, creating a shimmering effect as it mixes beats.
Cats, dressed in formal attire, sit around an elaborate chessboard, each pondering their next strategic move in the tense match.
Amidst the ruined remnants of a once-thriving city, a lone robot scavenger sifts through the debris, its metallic fingers reaching through broken concrete and twisted metal in search of valuable salvage.
A mouse dressed in Renaissance attire, holding a slice of cheese delicately between its paws and eating it.

Citation

@article{bahmani2024ac3d,
  author = {Bahmani, Sherwin and Skorokhodov, Ivan and Qian, Guocheng and Siarohin, Aliaksandr and Menapace, Willi and Tagliasacchi, Andrea and Lindell, David B. and Tulyakov, Sergey},
  title = {AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers},
  journal = {arXiv preprint arXiv:2411.18673},
  year = {2024},
}