Text-to-character blendshapes (T2Bs) is capable of creating animatable blendshapes to synthesize diverse expressions of a virtual character generated solely from text prompts. We show the geometry of sample eigen expressions of 4 characters in the following videos.
We demonstrate demos of the first 10 eigen expressions of sample virtual character models as vidualized in the following video. Note that for later model evaluation and retargeting we use 100 eigen expressions.
A robust statistical expression model should effectively generalize to new data while remaining closely aligned with the specific object it represents. We fit the T2Bs model to captures outside the model’s training set. We compared the rendered Gaussians and the mesh geometry and calculate the error shown in the following. For each identity, the first column is model fitting, the second column is a held-out video capture, and the third column are image-space and object-space reconstruction error map. The color scale from blue to yellow represents the RGB error, ranging from 0 to 0.5, while the scale from green to red denotes the 3D point-to-point error, ranging from 0 to 1/500 of the bounding box size. The bounding box size is approximately the maximum possible distance within the geometry. The learned blendshape can faithfully reconstruct meshes with held-out expressions.
We further show the expressiveness of our model by adding a simple landmark-base retargeting approach, in which we align the human (FLAME) landmarks with 20 anotated animal landmarks on eyes and mouth regions. We show both the retargeting results with the eigen expressions of a human model (FLAME), and with the real facial capture from videos.
Qualitative comparison of 4D generation methods. We compare the 4D generation results of our method with DreamGaussian4D (DG4D) [38], SV4D [64], and 4Real-Video [53]. All methods take a monocular video as input. In addition, DG4D incorporates our accurate static Gaussian representation, whereas SV4D and 4Real-Video rely on freeze-time renderings. Our method takes the results of 4Real-Video as input and further enhances the novel view synthesis. Notably, none of the baseline methods produce high-quality 3D geometry, whereas ours does. Viewpoints are displayed at ±60 degrees relative to the original perspective (frontal view) used for generating the monocular video. We also show the rendering of the static mesh from the same viewpoint as 'static reference'. Among all methods, our approach achieves the most visually consistent and appealing results.
A qualitative comparison between VCDGS with and without camera view information as input. The model without view dependency produces noticeable artifacts, as it struggles to regress into a coherent geometry from 3D inconsistencies across views in multiview videos during the 4D generation process.
We optimize VCDGS using multiple videos generated by 4Real-video. To evaluate its effectiveness, we conduct an ablation study using relatively lower-quality 4D videos as guidance. Specifically, we train VCDGS on the output of SV4D while keeping all other experimental settings unchanged. Figure~\ref{fig:sv4d} demonstrates the improvements achieved by incorporating VCDGS into SV4D. In our application, SV4D tends to produce blurry novel views. Our model not only remains robust to blurry inputs but also generates high-quality renderings with well-defined geometry.
We use pre-define control points from static asset rather than jointly optimizing them with all expression videos for scalability. This allowing new expression videos to be incorporated without requiring re-optimization of previously processed videos. This modular approach avoids the need for computationally expensive joint optimization across an ever-growing dataset. We use 2000 control points by uniformly sample the static mesh. We show parameter analysis on the the number of control points in figure X. 2000 control points captures fine-grained motions, such as tongue (1st row) and eyelid (2nd row) movements, better than 200, 500, or 1000 joints.