SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

1Snap Inc. 2University of Melbourne 3MBZUAI
SnapGen++ is the first diffusion transformer (DiT) model (0.4B) that can generate high-fidelity images (1024x1024) on mobile devices in just 1.8s, and achieve 85.2% on DPG Bench and 0.70 on GenEval.
SnapGen++ high-fidelity image generation demo

Abstract

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global–local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop K-DMD (Knowledge-Guided Distribution Matching Distillation), a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.

Demo on iPhone 16 Pro Max

Efficient DiT Architecture

Our efficient DiT consists of three stages: Down, Middle and Up (left). Down and Up blocks operate on high-resolution latent while using our novel Adaptive Sparse Self-Attention (ASSA) layers (right). Middle blocks operate at latents downsampled by 2x2 window and use standard Self-Attention (SA) layers. Other layers in the blocks are Cross-Attention (CA) for modulating with input text conditioning and Feed-Forward (FFN) layer. Our ASSA layer consists of two parallel attention processing branches: (i) coarse-grained key-value compression for overall structure, and (ii) fine-grained blockwise neighborhood attention features.

SnapGen++ efficient DiT architecture diagram

Elastic Training

We design an Elastic DiT framework that enables a single diffusion transformer to flexibly scale its capacity according to available computational resources. We identify a structural decomposition that allows parameter sharing across subnetworks of different widths, slicing the projection matrices in the attention and FFN layers along the hidden dimension to sample subnetworks of varying sizes from a single supernetwork. During training, we sample sub-networks uniformly and supervise them using the output from the supernetwork. In addition, we use standard diffusion loss on all granularities. This leads to more stable training and imparts knowledge to sub-networks.

SnapGen++ elastic training diagram

Multi-stage Distillation

Following the SnapGen pipeline, we first perform large-scale pretraining and knowledge distillation (KD) to substantially enhance the capacity of small student models. Once KD is done, we then performance step distillation for efficient inference. To stablize step distillation, we propose Knowledge-guided DMD (K-DMD), which extends DMD-based step distillation by incorporating knowledge distillation (KD) objective from a few-step teacher.

SnapGen++ step distillation diagram

Quantitative Comparison

Comparison with existing T2I models across various benchmarks: Scores are reported on DPG-Bench, GenEval, T2I-CompBench, and CLIP (COCO). Throughput/FPS (samples/s) is measured on a single 80GB A100 GPU using the largest batch size that fits for 1024x1024 images. Latency (ms) is measured on iPhone 16 Pro Max with one forward pass.

SnapGen++ quantitative results

Qualitative Results

Few Step Visualization: Comparison of images produced by our tiny (0.3B), small (0.4B), and full (1.6B) models under 28-step (w/o K-DMD) and 4-step (w/ K-DMD) settings. Numbers in the corners denote DPG / GenEval scores.

SnapGen++ few step results

More Visualization Comparison:

SnapGen++ additional visual resutls

BibTeX

        
            @article{hu2026snapgenplusplus,
                title   = {SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices},
                author  = {Dongting Hu and Aarush Gupta and Magzhan Gabidolla and Arpit Sahni and Huseyin Coskun and Yanyu Li and Yerlan Idelbayev and Ahsan Mahmood and Aleksei Lebedev and Dishani Lahiri and Anujraaj Goyal and Ju Hu and Mingming Gong and Sergey Tulyakov and Anil Kag},
                journal = {arXiv:2601.08303 [cs.CV]},
                year    = {2026}
            }