Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT—a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.
|
Unlike prior mobile video diffusion systems that rely on extremely compressed latent spaces, S2DiT operates in a moderately compressed latent representation that preserves more spatial and temporal detail. This choice improves fidelity but significantly increases the number of latent tokens, making conventional DiT architectures too slow for mobile deployment. An Efficient Sandwich Diffusion Transformer. Our approach addresses this challenge through a structured architectural pattern that alternates high-resolution and low-resolution processing stages. At a high level, Sandwich DiT interleaves two complementary attention modules: LinConv Hybrid Attention (LCHA) for high-resolution modeling that preserves detail at linear cost, and Stride Self-Attention (SSA) for low-resolution modeling that aggregates global context at reduced token count. The architectural layout and the precise allocation of these modules are determined through a budget-aware search. 2-in-1 Distillation Pipeline. In addition, distillation from a powerful teacher model plays a central role, supplying rich semantic and structural guidance that leads to better visual quality and motion consistency. On top of Sandwich DiT, we introduce a two-in-one distillation pipeline that (i) aligns the student with a strong teacher model through offline cached distillation, and (ii) incorporates Distribution-Matching Distillation (DMD) and self-forcing to support few-step auto-regressive generation. |
|
| Overview of our S2DiT. |
| We provide video comparison results of S2DiT versions with others. | ||||
LTX-2B |
Wan-1.3B |
S2DiT-Pretrained |
S2DiT-KD |
S2DiT-AR |
|
Prompt: A bustling cityscape at sunset with skyscrapers reflecting golden light, people walking, and traffic moving swiftly.
|
||||
|
Prompt: In a bustling restaurant kitchen, showcase the chaos of chefs preparing a gourmet feast. Utilize tight close-ups and quick cuts to highlight the sizzling pans, chopping knives, and intricate plating details, creating a visually immersive experience.
|
||||
|
Prompt: In a grand concert hall, focus on an otter gracefully playing a piano with remarkable skill. Showcase the 4K details of water droplets elegantly splashing as its paws dance across the keys, camera smoothly gliding from one side of the piano to the other.
|
||||
|
Prompt: In a well-appointed study, a cat sits behind a desk, 'typing' on a miniature laptop. Detailed close-ups capture the tiny paws navigating the keyboard. Camera executes a precise tilt, emphasizing the amusing sight of a cat at work.
|
||||
|
Prompt: A dirt bike navigates through a dense forest trail. 4K close-up of the bike maneuvering through foliage, camera follows the rider's perspective, creating an immersive experience of the off-road adventure.
|
||||
|
Prompt: A luxury car elegantly cruises through a well-lit mountain tunnel. Cinematic tracking shot of the car's sleek design, the camera gliding smoothly alongside, capturing the play of light on its polished exterior in high detail.
|
||||
| Model | Vbench ↑ | Mobile (length/latency) * | Mobile Streaming (FPS) * |
|---|---|---|---|
| Wan2.1-1.3B | 83.31 | ✗ | ✗ |
| LTX-2B | 80.00 | ✗ | ✗ |
| SnapGenV | 81.14 | 5s / 4s | ✗ |
| SnapGenV-DiT | 81.45 | 5s / 4s | ✗ |
| NeoDragon | 81.61 | 2s / 6.7s | ✗ |
| S2DiT | 83.26 | ✓ | ∼ 11 FPS |
*: The latency is benchmarked on iPhone 16 Pro Max.
|