T2Bs: Text-to-Character Blendshapes via Video Generation

ICCV 2025

Jiahao Luo, Chaoyang Wang, Michael Vasilkovsky, Vladislav Shakhrai, Di Liu, Peiye Zhuang, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee, James Davis, Jian Wang

Snap Inc., University of California, Santa Cruz

Abstract

We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable 3D Gaussian splatting to align static 3D assets with video outputs. By constraining motion with static geometry and employing a view-dependent deformation MLP, T2Bs (i) outperforms existing 4D generation methods in accuracy and expressiveness while reducing video artifacts and view inconsistencies, and (ii) reconstructs smooth, coherent, fully registered 3D geometries designed to scale for building morphable models with diverse, realistic facial motions. This enables synthesizing expressive, animatable character heads that surpass current 4D generation techniques.

Method

In the first part, we illustrate the generation of multi-view videos from text prompts. A static 3D mesh is first created using an off-the-shelf text-to-3D generator~\cite{trellis3d}, followed by rendering a fixed-time video with the camera moving in a circular path. We define a canonical view (v=0) and an augmented prompt to generate a fixed-view video. A 4D video generation method is then applied to produce multi-view videos. In the second part, starting from a static 3D asset, we define static Gaussians $G_{0, 0}$, control points $p$ and blending weights $w$. During deformation, we predict view-dependent transformations of control points to model local non-rigid deformations, along with global transformations to capture overall pose changes. We interpolate Gaussian positions and orientations with Linear Blend Skinning (LBS), with rendering optimized through image-space loss minimization. After training, we extract a mesh for each frame, defined in the canonical view (v=0). We repeat this process with multiple prompts, and build a blendshape model using hundreds of samples

BibTeX

@inproceedings{luo2025t2bs, title={T2Bs: Text-to-Character Blendshapes via Video Generation}, author={Luo, Jiahao and Wang, Chaoyang and Vasilkovsky, Michael and Shakhrai, Vladislav and Liu, Di and Zhuang, Peiye and Tulyakov, Sergey and Wonka, Peter and Lee, Hsin-Ying and Davis, James and others}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={13625--13637}, year={2025} } }