SnapMoGen: Human Motion Generation from Expressive Texts

1 Snap Inc. 2Seoul National University
Project Lead


Abstract


Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (versus 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen.



Gallery of SnapMoGen Dataset

* Motion clips are temporally continuous.

Approach Overview


MoMask++ Generation Results

* Unless otherwise mentioned, prompts are rewritten into expressive text descriptions before being fed into MoMask++.

Ablation Analysis

Impact on VQ Reconstruction

We investigate the impact of the number of residual layers and the number of tokens on VQ reconstruction quality. Additionally, we compare our method against the 6-layer VQ used in MoMask. The number of tokens is calculated based on the encoding of a 320-frame motion sequence. Results show that our approach effectively captures high-fidelity motion details by increasing the number of layers and tokens, enabling better modeling of holistic motion patterns compared to the RVQ used in MoMask.

Generation

We analyze the effect of residual tokens, multi-scale quantization, and prompt rewriting on the final motion generation quality. As shown below, using only a single VQ token sequence (w/o residual VQ) or multiple full-scale token sequences of the same length (w/o multi-scale VQ) results in limited understanding of nuanced text prompts. Moreover, when casual user prompts are directly used (w/o prompt rewriting) for generation without rewriting, the model exhibits significant semantic degradation.

"Someone pretends to be a bird taking flight."
"The person crouches low with knees bent and arms extended sideways like wings. They begin with small hops, gradually increasing height and breadth of their arm flaps. Their torso leans forward as they simulate taking off, rising onto the balls of their feet and stretching limbs outward. Movements are fluid and soaring, embodying the effort and grace of flight."
Ours
w/o prompt rewriting
w/o residual vq
w/o multi-scale vq

"Walking like a robot."
"The person walks in rigid, mechanical fashion. Each leg lifts unnaturally high and plants down flat. Arms swing stiffly at 90-degree angles, pausing slightly between each step. Their torso remains upright with minimal rotation. Occasionally, they make jerky turns or freeze mid-step, mimicking the exact, unnatural cadence of a malfunctioning robot."
Ours
w/o prompt rewriting
w/o residual vq
w/o multi-scale vq

Comparisons


We show one example (#1) from SnapMoGen test set, and two examples using in-the-wild user prompts (#2, 3). For the later two cases, all models take the re-rewritted prompts as input.

Real-world Application


SnapMoGen has led to a launched text2motion feature in LensStudio (v5.11.0) of Snap VR.

Limitations


"A person is dramatically dodging laser beams while crawling forward."

"The person drops low, crawling forward on hands and knees with urgency. They weave their torso and duck their head side to side as if narrowly avoiding invisible laser beams. Arms stretch out to maintain balance while legs push powerfully, body tense with alertness. Their movements are fluid but deliberate, moving forward cautiously with sharp, sudden dodges."
"Motion artifacts persist (e.g., sliding, jittering)."
"Skipping forward while juggling imaginary balls."

"The person skips forward energetically, bouncing on alternating feet with light, rhythmic hops. Their arms move in circular patterns as if juggling several invisible balls, tossing them from hand to hand. Their torso sways rhythmically, and they occasionally look upward or to the side to track the imaginary objects, ending the sequence with a playful spin."
"Missing semantic cues (e.g., junggling balls)."
"A yoga sun salutation."

"Standing tall, the person reaches both arms toward the sky with a deep inhale. They bend forward slowly at the waist, touching the ground with fingertips. Then they step one leg back into a lunge, lifting the arms overhead in a stretch. They transition into downward dog, hold it briefly, then step forward and return to standing."
"Fail on rare motions."

Related Motion Generation Works 🚀🚀


Text2Motion: Diverse text-driven motion generation using temporal variational autoencoder.
TM2T: Learning text2motion and motion2text reciprocally through discrete token and language model.
TM2D: Learning dance generation with textual instruction.
Action2Motion: Diverse action-conditioned motion generation.
MoMask: generative masked modeling of 3D human motions.

BibTeX

NA