*1024×576 samples and 288×512 samples are generated from the server model and the mobile model, respectively.
                            
                                
                            | VAE | PSNR | Latency (ms) | Total | Quality | Semantic | Aesthetic | Consistency | Flickering | 
|---|---|---|---|---|---|---|---|---|
| 4×16×16 | 33.1 | 7900 | 80.35 | 82.05 | 73.54 | 64.45 | 26.80 | 98.59 | 
| 4×32×32 | 30.9 | 920 | 79.95 | 82.99 | 67.83 | 61.52 | 27.07 | 97.46 | 
| 8×32×32 | 30.6 | 380 | 79.80 | 82.59 | 68.66 | 61.80 | 27.17 | 97.70 | 
| 8×64×64 | 28.2 | 90 | 78.40 | 81.79 | 64.86 | 55.29 | 26.11 | 97.52 | 
In this section, we provide video comparison results of our method with others as mentioned in the paper.
| LTX-Video-2B[1] | CogVideoX-2B[2] | Wan2.1-1.3B[3] | Ours  | 
                                
                            
|---|---|---|---|
| 
                                     
                                        Prompt:
                                        3D animation of a small, round, fluffy creature with big, expressive eyes
                                        explores a vibrant,
                                        enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel,
                                        has soft blue fur and
                                        a bush.
                                     
                                 | 
                            |||
| 
                                     
                                        Prompt:
                                        A cat sitting at a grand piano, elegantly playing a classical piece with its
                                        paws.
                                     
                                 | 
                            |||
| 
                                     
                                        Prompt:
                                        A corgi vlogging itself in tropical Maui.
                                     
                                 | 
                            |||
| 
                                     
                                        Prompt:
                                        A movie trailer featuring the adventures of the 30-year-old spaceman wearing a
                                        red wool knitted
                                        motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film,
                                        vivid colors.
                                     
                                 | 
                            |||
| 
                                     
                                        Prompt:
                                        A pair of lovebirds preening each other's feathers.
                                     
                                 | 
                            |||
| 
                                     
                                        Prompt:
                                        A skeleton wearing a flower hat and sunglasses dances in the wild at sunset.
                                     
                                 | 
                            |||
| 
                                    [1] Yoav HaCohen et al. "LTX-Video: Realtime Video Latent Diffusion."
                                    https://arxiv.org/abs/2501.00103 (2024). [2] Yang, Zhuoyi, et al. "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." ICLR (2025). [3] Team Wan, et al. "Wan: Open and Advanced Large-Scale Video Generative Models." https://arxiv.org/abs/2503.20314 (2025). *videos are generated using Diffusers v0.33.1 implementation.  | 
                            |||
| Model | Params(B) | A100(FPS)* | iPhone(FPS)* | Vbench ↑ | 
|---|---|---|---|---|
| Wan2.1 | 1.3 | 0.2 | ✗ | 83.33 | 
| CogVideoX-2B | 1.6 | 0.45 | ✗ | 80.91 | 
| LTX-Video | 1.8 | 6.1 | ✗ | 80.00 | 
| Ours (Server) | 2.0 | 6.4 | ✗ | 83.09 | 
| Ours (Mobile) | 0.9 | 151.3 | ~15 | 81.45 | 
*: The latency is benchmark on NVIDIA A100-SXM4-80GB GPU and iPhone 16 Pro Max per denoising step.