S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

1Snap Inc., 2Northeastern University
*Equal Contribution, Corresponding author
The 512×288 and 288×512 samples are generated by our S2DiT-KD model under the bidirectional full-step setting.
The 512×288 and 288×512 samples are generated by our S2DiT-AR model under the streaming 4-step setting.

Abstract

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT—a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

Method

Unlike prior mobile video diffusion systems that rely on extremely compressed latent spaces, S2DiT operates in a moderately compressed latent representation that preserves more spatial and temporal detail. This choice improves fidelity but significantly increases the number of latent tokens, making conventional DiT architectures too slow for mobile deployment.

An Efficient Sandwich Diffusion Transformer. Our approach addresses this challenge through a structured architectural pattern that alternates high-resolution and low-resolution processing stages. At a high level, Sandwich DiT interleaves two complementary attention modules: LinConv Hybrid Attention (LCHA) for high-resolution modeling that preserves detail at linear cost, and Stride Self-Attention (SSA) for low-resolution modeling that aggregates global context at reduced token count. The architectural layout and the precise allocation of these modules are determined through a budget-aware search.

2-in-1 Distillation Pipeline. In addition, distillation from a powerful teacher model plays a central role, supplying rich semantic and structural guidance that leads to better visual quality and motion consistency. On top of Sandwich DiT, we introduce a two-in-one distillation pipeline that (i) aligns the student with a strong teacher model through offline cached distillation, and (ii) incorporates Distribution-Matching Distillation (DMD) and self-forcing to support few-step auto-regressive generation.

overview
Overview of our S2DiT.

Qualitative Comparisons

We provide video comparison results of S2DiT versions with others.

LTX-2B

Wan-1.3B

S2DiT-Pretrained

S2DiT-KD

S2DiT-AR

Prompt: A bustling cityscape at sunset with skyscrapers reflecting golden light, people walking, and traffic moving swiftly.
Prompt: In a bustling restaurant kitchen, showcase the chaos of chefs preparing a gourmet feast. Utilize tight close-ups and quick cuts to highlight the sizzling pans, chopping knives, and intricate plating details, creating a visually immersive experience.
Prompt: In a grand concert hall, focus on an otter gracefully playing a piano with remarkable skill. Showcase the 4K details of water droplets elegantly splashing as its paws dance across the keys, camera smoothly gliding from one side of the piano to the other.
Prompt: In a well-appointed study, a cat sits behind a desk, 'typing' on a miniature laptop. Detailed close-ups capture the tiny paws navigating the keyboard. Camera executes a precise tilt, emphasizing the amusing sight of a cat at work.
Prompt: A dirt bike navigates through a dense forest trail. 4K close-up of the bike maneuvering through foliage, camera follows the rider's perspective, creating an immersive experience of the off-road adventure.
Prompt: A luxury car elegantly cruises through a well-lit mountain tunnel. Cinematic tracking shot of the car's sleek design, the camera gliding smoothly alongside, capturing the play of light on its polished exterior in high detail.

Quantitative Comparison

Model Vbench ↑ Mobile (length/latency) * Mobile Streaming (FPS) *
Wan2.1-1.3B 83.31
LTX-2B 80.00
SnapGenV 81.14 5s / 4s
SnapGenV-DiT 81.45 5s / 4s
NeoDragon 81.61 2s / 6.7s
S2DiT 83.26 ∼ 11 FPS

*: The latency is benchmarked on iPhone 16 Pro Max.

Mobile Demo on iPhone 16 Pro Max

Mobile Demo

BibTeX