SnapGen++

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

¹Snap Inc. ²University of Melbourne ³MBZUAI

Abstract

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global–local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop K-DMD (Knowledge-Guided Distribution Matching Distillation), a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.

Efficient DiT Architecture

Our efficient DiT consists of three stages: Down, Middle and Up (left). Down and Up blocks operate on high-resolution latent while using our novel Adaptive Sparse Self-Attention (ASSA) layers (right). Middle blocks operate at latents downsampled by 2x2 window and use standard Self-Attention (SA) layers. Other layers in the blocks are Cross-Attention (CA) for modulating with input text conditioning and Feed-Forward (FFN) layer. Our ASSA layer consists of two parallel attention processing branches: (i) coarse-grained key-value compression for overall structure, and (ii) fine-grained blockwise neighborhood attention features.

Elastic Training

We design an Elastic DiT framework that enables a single diffusion transformer to flexibly scale its capacity according to available computational resources. We identify a structural decomposition that allows parameter sharing across subnetworks of different widths, slicing the projection matrices in the attention and FFN layers along the hidden dimension to sample subnetworks of varying sizes from a single supernetwork. During training, we sample sub-networks uniformly and supervise them using the output from the supernetwork. In addition, we use standard diffusion loss on all granularities. This leads to more stable training and imparts knowledge to sub-networks.

Multi-stage Distillation

Following the SnapGen pipeline, we first perform large-scale pretraining and knowledge distillation (KD) to substantially enhance the capacity of small student models. Once KD is done, we then performance step distillation for efficient inference. To stablize step distillation, we propose Knowledge-guided DMD (K-DMD), which extends DMD-based step distillation by incorporating knowledge distillation (KD) objective from a few-step teacher.

BibTeX

@article{hu2026snapgenplusplus, title = {SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices}, author = {Dongting Hu and Aarush Gupta and Magzhan Gabidolla and Arpit Sahni and Huseyin Coskun and Yanyu Li and Yerlan Idelbayev and Ahsan Mahmood and Aleksei Lebedev and Dishani Lahiri and Anujraaj Goyal and Ju Hu and Mingming Gong and Sergey Tulyakov and Anil Kag}, journal = {arXiv:2601.08303 [cs.CV]}, year = {2026} }

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Abstract

Demo on iPhone 16 Pro Max

Efficient DiT Architecture

Elastic Training

Multi-stage Distillation

Quantitative Comparison

Qualitative Results

BibTeX