*1024×576 samples and 288×512 samples are generated from the server model and the mobile model, respectively.
VAE | PSNR | A100 (ms) | iPhone (ms) | Total | Quality | Semantic | Aesthetic | Consistency | Flickering |
---|---|---|---|---|---|---|---|---|---|
4×16×16 | 33.1 | 7900 | ✗ | 80.35 | 82.05 | 73.54 | 64.45 | 26.80 | 98.59 |
4×32×32 | 30.9 | 920 | 3315 | 79.95 | 82.99 | 67.83 | 61.52 | 27.07 | 97.46 |
8×32×32 | 30.6 | 380 | 880 | 79.80 | 82.59 | 68.66 | 61.80 | 27.17 | 97.70 |
8×64×64 | 28.2 | 90 | 155 | 78.40 | 81.79 | 64.86 | 55.29 | 26.11 | 97.52 |
In this section, we provide video comparison results of our method with others as mentioned in the paper.
LTX-Video-2B[1] | CogVideoX-2B[2] | Wan2.1-1.3B[3] | Ours |
---|---|---|---|
Prompt:
3D animation of a small, round, fluffy creature with big, expressive eyes
explores a vibrant,
enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel,
has soft blue fur and
a bush.
|
|||
Prompt:
A cat sitting at a grand piano, elegantly playing a classical piece with its
paws.
|
|||
Prompt:
A corgi vlogging itself in tropical Maui.
|
|||
Prompt:
A movie trailer featuring the adventures of the 30-year-old spaceman wearing a
red wool knitted
motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film,
vivid colors.
|
|||
Prompt:
A pair of lovebirds preening each other's feathers.
|
|||
Prompt:
A skeleton wearing a flower hat and sunglasses dances in the wild at sunset.
|
|||
[1] Yoav HaCohen et al. "LTX-Video: Realtime Video Latent Diffusion."
https://arxiv.org/abs/2501.00103 (2024). [2] Yang, Zhuoyi, et al. "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." ICLR (2025). [3] Team Wan, et al. "Wan: Open and Advanced Large-Scale Video Generative Models." https://arxiv.org/abs/2503.20314 (2025). *videos are generated using Diffusers v0.33.1 implementation. |
Model | Params(B) | A100(FPS)* | iPhone(FPS)* | Vbench ↑ |
---|---|---|---|---|
Wan2.1 | 1.3 | 0.2 | ✗ | 83.33 |
CogVideoX-2B | 1.6 | 0.45 | ✗ | 80.91 |
LTX-Video | 1.8 | 6.1 | ✗ | 80.00 |
Ours (Server) | 2.0 | 6.4 | ✗ | 83.09 |
Ours (Mobile) | 0.9 | 151.3 | 12.2 | 81.45 |
*: The latency is benchmark on NVIDIA A100-SXM4-80GB GPU and iPhone 16 Pro Max per denoising step.