Unsupervised Volumetric Animation

¹Snap Inc. ²University of Trento ³KAUST

*Work done while interning at Snap.

Abstract

We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns the underlying object geometry and parts decomposition in an entirely unsupervised manner. This allows it to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. We primarily evaluate the framework on two video datasets: VoxCeleb 256² and TEDXPeople 256². In addition, on the Cats 256² image dataset, we show it even learns compelling 3D geometry from still images. Finally, we show our model can obtain animatable 3D objects from a single or few images.

Animation Results:

Here we show animation results rendered under novel views within the [-15°,15°] range on VoxCeleb and Tedx. Our method uses a single image and a driving sequence to synthesize animations. Additionally, we show depth, normals and parts (LBS weights) predicted by our method in an unsupervised way. Note, that our method successfully renders wide range pose changes and diverse object shapes.

Comparison with SOTA methods:

Here we compare our method with two state-of-the-art animation 2D methods LIA and MRAA. We report typical examples of animations generated by each of the methods. Since our method uses a 3D representation for animation it better preserves face proportions and head poses. MRAA significantly alters the shape and does not faithfully convey expressions. LIA slightly changes the proportions, while smoothing the image and changing its color.

Comparison of direct pose prediction and PnP-based:

We argue that the proposed framework involving differentiable PnP favors discovery of correct 3D geometry. To show it we give qualitative samples of the model using PnP and the one predicting the pose of each part directly, using a neural network. In this experiment we use the result of G-phase, where only a single part is learned. The Direct method learned flat geometry, while our PnP-based method produced plausible geometry with small details, including hair and wrinkles.

Random generation:

Our framework learns a canonical latent space, allowing us to synthesize new unseen identities. Here we sample random identity from the latent space (i.e. use standard normal noise as embedding), and animate it using the driving from the test set. Note, that our framework is auto-decoder-based, hence, the quality of the texture is inferior to GAN-based methods. Recall that our framework is non-adversarial and is trained using reconstruction losses only. Despite this artifact the method is able to generate reasonable geometry even for random samples.

BibTeX

@article{siarohin2023unsupervised,
    author  = {Siarohin, Aliaksandr and Menapace, Willi and Skorokhodov, Ivan and Olszewski, Kyle and Lee, Hsin-Ying and Ren, Jian and  Chai, Menglei and Tulyakov, Sergey},
    title   = {Unsupervised Volumetric Animation},
    journal = {arXiv preprint arXiv:2301.11326},
    year    = {2023},
}