(Supplementary) T2Bs: Text-to-Character Blendshapes via Video Generation

1. T2Bs models

1.1 Blendshape Gallery

Text-to-character blendshapes (T2Bs) is capable of creating animatable blendshapes to synthesize diverse expressions of a virtual character generated solely from text prompts. We show the geometry of sample eigen expressions of 4 characters in the following videos.

Dog
Frog
Donkey
Fox

1.2 Web Demo

We demonstrate demos of the first 10 eigen expressions of sample virtual character models as vidualized in the following video. Note that for later model evaluation and retargeting we use 100 eigen expressions.





1.3 Model Expressiveness

A robust statistical expression model should effectively generalize to new data while remaining closely aligned with the specific object it represents. We fit the T2Bs model to captures outside the model’s training set. We compared the rendered Gaussians and the mesh geometry and calculate the error shown in the following. For each identity, the first column is model fitting, the second column is a held-out video capture, and the third column are image-space and object-space reconstruction error map. The color scale from blue to yellow represents the RGB error, ranging from 0 to 0.5, while the scale from green to red denotes the 3D point-to-point error, ranging from 0 to 1/500 of the bounding box size. The bounding box size is approximately the maximum possible distance within the geometry. The learned blendshape can faithfully reconstruct meshes with held-out expressions.

Model fitting (Fig. 10)

1.4 Retargeting

We further show the expressiveness of our model by adding a simple landmark-base retargeting approach, in which we align the human (FLAME) landmarks with 20 anotated animal landmarks on eyes and mouth regions. We show both the retargeting results with the eigen expressions of a human model (FLAME), and with the real facial capture from videos.

Model retargeting (FLAME eigen expressions)
Model retargeting (real capture)





2. View-conditioned Deformable Gaussian Splatting (VCDGS)


2.1 Comparison with baseline methods in novel view synthesis

Qualitative comparison of 4D generation methods. We compare the 4D generation results of our method with DreamGaussian4D (DG4D) [38], SV4D [64], and 4Real-Video [53]. All methods take a monocular video as input. In addition, DG4D incorporates our accurate static Gaussian representation, whereas SV4D and 4Real-Video rely on freeze-time renderings. Our method takes the results of 4Real-Video as input and further enhances the novel view synthesis. Notably, none of the baseline methods produce high-quality 3D geometry, whereas ours does. Viewpoints are displayed at ±60 degrees relative to the original perspective (frontal view) used for generating the monocular video. We also show the rendering of the static mesh from the same viewpoint as 'static reference'. Among all methods, our approach achieves the most visually consistent and appealing results.


Monocular Input DG4D SV4D 4Real-Video Ours Static Reference






2.2 Ablation Studies

2.2.1 Ablation study on view dependency

A qualitative comparison between VCDGS with and without camera view information as input. The model without view dependency produces noticeable artifacts, as it struggles to regress into a coherent geometry from 3D inconsistencies across views in multiview videos during the 4D generation process.

2.2.2 Ablation study on source of multi-view video

We optimize VCDGS using multiple videos generated by 4Real-video. To evaluate its effectiveness, we conduct an ablation study using relatively lower-quality 4D videos as guidance. Specifically, we train VCDGS on the output of SV4D while keeping all other experimental settings unchanged. Figure~\ref{fig:sv4d} demonstrates the improvements achieved by incorporating VCDGS into SV4D. In our application, SV4D tends to produce blurry novel views. Our model not only remains robust to blurry inputs but also generates high-quality renderings with well-defined geometry.


Ablation study on view dependency (Fig. 6)

Input Ours w/o view dependency
Ablation study on source of multi-view video (Fig. 7)

Input SV4D SV4D + VCDGS




2.2.3 Ablation study on number of control points

We use pre-define control points from static asset rather than jointly optimizing them with all expression videos for scalability. This allowing new expression videos to be incorporated without requiring re-optimization of previously processed videos. This modular approach avoids the need for computationally expensive joint optimization across an ever-growing dataset. We use 2000 control points by uniformly sample the static mesh. We show parameter analysis on the the number of control points in figure X. 2000 control points captures fine-grained motions, such as tongue (1st row) and eyelid (2nd row) movements, better than 200, 500, or 1000 joints.

Ablation study on number of control points

Input 200 500 1000 2000