S2DiT

S²DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

¹Snap Inc., ²Northeastern University

^*Equal Contribution, ^†Corresponding author

Abstract

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S²DiT—a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S²DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S²DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

Method

Unlike prior mobile video diffusion systems that rely on extremely compressed latent spaces, S²DiT operates in a moderately compressed latent representation that preserves more spatial and temporal detail. This choice improves fidelity but significantly increases the number of latent tokens, making conventional DiT architectures too slow for mobile deployment.

An Efficient Sandwich Diffusion Transformer. Our approach addresses this challenge through a structured architectural pattern that alternates high-resolution and low-resolution processing stages. At a high level, Sandwich DiT interleaves two complementary attention modules: LinConv Hybrid Attention (LCHA) for high-resolution modeling that preserves detail at linear cost, and Stride Self-Attention (SSA) for low-resolution modeling that aggregates global context at reduced token count. The architectural layout and the precise allocation of these modules are determined through a budget-aware search.

2-in-1 Distillation Pipeline. In addition, distillation from a powerful teacher model plays a central role, supplying rich semantic and structural guidance that leads to better visual quality and motion consistency. On top of Sandwich DiT, we introduce a two-in-one distillation pipeline that (i) aligns the student with a strong teacher model through offline cached distillation, and (ii) incorporates Distribution-Matching Distillation (DMD) and self-forcing to support few-step auto-regressive generation.

Overview of our S²DiT.

Qualitative Comparisons

We provide video comparison results of S²DiT versions with others.
LTX-2B	Wan-1.3B	S²DiT-Pretrained	S²DiT-KD	S²DiT-AR

Prompt: A bustling cityscape at sunset with skyscrapers reflecting golden light, people walking, and traffic moving swiftly.

Prompt: In a bustling restaurant kitchen, showcase the chaos of chefs preparing a gourmet feast. Utilize tight close-ups and quick cuts to highlight the sizzling pans, chopping knives, and intricate plating details, creating a visually immersive experience.

Prompt: In a grand concert hall, focus on an otter gracefully playing a piano with remarkable skill. Showcase the 4K details of water droplets elegantly splashing as its paws dance across the keys, camera smoothly gliding from one side of the piano to the other.

Prompt: In a well-appointed study, a cat sits behind a desk, 'typing' on a miniature laptop. Detailed close-ups capture the tiny paws navigating the keyboard. Camera executes a precise tilt, emphasizing the amusing sight of a cat at work.

Prompt: A dirt bike navigates through a dense forest trail. 4K close-up of the bike maneuvering through foliage, camera follows the rider's perspective, creating an immersive experience of the off-road adventure.

Prompt: A luxury car elegantly cruises through a well-lit mountain tunnel. Cinematic tracking shot of the car's sleek design, the camera gliding smoothly alongside, capturing the play of light on its polished exterior in high detail.

Quantitative Comparison