We found that DiTs waste substantial compute by allocating it uniformly across pixels, despite large variation in regional difficulty. ELIT addresses this by introducing a variable-length set of latent tokens and two lightweight cross-attention layers (Read & Write) that concentrate computation on the most important input regions, delivering up to 53% FID and 58% FDD improvements on ImageNet-1K at 512px. At inference time, the number of latent tokens becomes a user-controlled knob, providing a smooth quality–FLOPs trade-off while enabling ~33% cheaper guidance out of the box.
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores.
Compute is concentrated where it matters — Read cross-attention learns to allocate more latent tokens to challenging image regions rather than spreading compute uniformly.
Thanks to the non-uniform adaptive compute allocation of ELIT latent tokens, ELIT consistently converges faster than the baseline.
A single set of weights enables a spectrum of latency-quality trade-offs by simply selecting the number of latent tokens at inference.
Only two lightweight Read/Write layers are added, while keeping everything else unchanged, making ELIT compatible with standard DiT training pipelines. Demonstrated on DiT, U-ViT, HDiT, and MM-DiT.
Standard DiTs allocate computation uniformly across all spatial tokens regardless of content difficulty. In a synthetic experiment, we show that DiTs cannot reallocate compute from trivial regions to informative ones — even when given extra tokens. ELIT solves this by learning where to spend compute.
Uniform compute in DiTs. (Blue) DiT-B/2-Synth has 4× more tokens (zero-padded) but fails to improve, matching DiT-B/2 but underperforming DiT-B/1 at similar compute. (Red) ELIT-DiT-B/2-Synth repurposes compute from zeroed regions to enhance generation, matching DiT-B/1 quality.
ELIT introduces a minimal change to DiT-like architectures: a latent interface — a variable-length token sequence — coupled with lightweight Read and Write cross-attention layers.
We create a latent domain by instantiating a latent interface of K tokens. A lightweight Read cross-attention layer pulls information from spatial tokens into the latent interface, prioritizing harder regions. This forms a compact latent domain on which most transformer blocks operate. A Write cross-attention layer maps the latent updates back to the spatial grid. Grouped cross-attention partitions spatial tokens into G non-overlapping groups, reducing cost from O(NK) to O(NK/G). Additionally, we randomly drop tail latents during training, making the latent interface importance-ordered. At inference, the number of latents serves as a user-controlled compute knob.
How does the Read layer decide where to focus? And what does each latent token actually encode? Below we visualize ELIT's internals, revealing that the architecture learns meaningful spatial and importance structure — without any explicit supervision beyond the standard flow-matching loss.
The Read cross-attention layer pulls information from spatial tokens into the latent interface. Visualizing the aggregated Read attention (middle row) alongside the per-patch flow-matching loss (bottom row) reveals a striking pattern: latent tokens autonomously learn to attend most strongly to the spatial regions that contribute most to the loss — object boundaries, fine textures, and semantically complex areas. Easy background regions receive minimal attention, allowing the latent core to focus its compute budget where it matters most.
Top: Input images. Middle: Aggregated Read attention — brighter regions attract more latent tokens. Bottom: Per-patch flow-matching loss — higher-loss regions (object detail, boundaries) align with higher attention, confirming that ELIT concentrates compute on the most informative content.
Thanks to tail-dropping during training, latent tokens form an importance-ordered sequence. We visualize individual token attention maps overlaid on the input. The first (most important) tokens attend broadly to global structure — capturing the background, subject, and key shapes. The last (least important) tokens attend to finer details — information that can be dropped at lower budgets with graceful quality degradation.
Left: Input image. Center: First three latent tokens attend to the most salient regions (subject, global structure). Right: Last three latent tokens attend to residual detail and background. This ordering emerges naturally from tail-dropping training and enables smooth variable-budget inference.
Beyond improved generation quality, ELIT's latent interface unlocks a range of practical benefits — from faster training to cheaper guidance — all from a single model with no extra training objectives.
By concentrating compute on informative regions, ELIT reaches the same FDD as DiT in roughly 4× fewer training steps at 512px. The adaptive latent interface accelerates learning across both resolutions.
ELIT adds only two lightweight cross-attention layers — minimal parameter overhead. Yet its improvements grow with model and resolution scale: FDD gains increases with larger models and resolutions, suggesting the ELIT is suitabel for large-scale high resolution generation.
When targeting a specific FLOPs budget, ELIT's token-count knob provides a superior compute–quality trade-off compared to simply reducing the number of sampling steps.
Variable budgets give ELIT a built-in weaker model. Using a low-budget forward pass as the guidance term yields AutoGuidance (AG) for free. Combining this with class dropping produces Cheap CFG (CCFG), which improves quality across all metrics while cutting guidance cost by ~33%.
ELIT is fully orthogonal to training-free acceleration. When combined with TeaCache — which caches and reuses network outputs across similar timesteps — ELIT attains comparable speedup gains as the baseline, stacking both sources of efficiency.
We evaluate FID↓, FDD↓, and IS↑ without classifier-free guidance (–G). TFLOPs (TF) indicate single training iteration TFLOPs. Superscripts show percentage improvement of ELIT MultiBudget (MB) relative to the baseline.
| Model | ImageNet 256×256 | ImageNet 512×512 | ||||||
|---|---|---|---|---|---|---|---|---|
| FID↓ | FDD↓ | IS↑ | TF | FID↓ | FDD↓ | IS↑ | TF | |
| DiT-XL | 13.0 | 346.3 | 66.2 | 182 | 18.8 | 339.2 | 53.0 | 806 |
| ↳ ELIT | 8.2 | 200.2 | 93.0 | 188 | 11.1 | 175.6 | 80.0 | 831 |
| ↳ ELIT-MB | 7.8 −40% | 203.7 −41% | 99.0 +50% | 190 | 10.1 −46% | 164.1 −52% | 88.8 +68% | 804 |
| UViT-XL | 8.3 | 220.2 | 84.4 | 196 | 11.6 | 202.7 | 72.5 | 861 |
| ↳ ELIT | 7.5 | 203.8 | 95.2 | 202 | 8.9 | 155.3 | 85.8 | 886 |
| ↳ ELIT-MB | 7.1 −14% | 203.2 −8% | 100.3 +19% | 204 | 7.7 −34% | 135.8 −33% | 98.0 +35% | 858 |
| HDiT-XL | 12.8 | 361.6 | 68.7 | 182 | 13.0 | 260.3 | 69.4 | 776 |
| ↳ ELIT | 9.4 | 272.2 | 89.5 | 188 | 10.1 | 164.1 | 88.8 | 801 |
| ↳ ELIT-MB | 9.4 −27% | 271.8 −25% | 92.3 +34% | 191 | 9.6 −26% | 171.2 −34% | 94.7 +36% | 791 |
All metrics reported without guidance (–G). ELIT delivers consistent improvements across all architectures with gains becoming more pronounced at 512px where pixel redundancy is greater.
ELIT produces images with better structure, fine details, and class fidelity compared to the DiT baseline at the same compute budget.
ELIT applied on top of Qwen-Image (20B MM-DiT) enables multi-budget inference at scale. ELIT-Qwen-Image cuts sampling FLOPs by up to 63% (~2.7× speedup) while gracefully trading off speed for quality. On DPG-Bench, average score gracefully degrades from 90.45 (full budget) to 88.02 (12.5% tokens).
| Model | Tokens | FLOPs | Entity | Relation | Attribute | Global | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen-Image | 4096 | 1× | 90.51 | 92.21 | 91.03 | 91.70 | 91.27 |
| ELIT 100% | 4096 | 0.69× | 90.30 | 92.18 | 88.97 | 89.18 | 90.45 |
| ELIT 50% | 2048 | 0.49× | 90.15 | 89.94 | 89.05 | 89.06 | 89.81 |
| ELIT 25% | 1024 | 0.41× | 89.31 | 91.87 | 89.71 | 84.79 | 89.79 |
| ELIT 12.5% | 512 | 0.37× | 91.20 | 90.35 | 88.77 | 79.84 | 88.02 |
ELIT-Qwen-Image evaluation on DPG-Bench. CCFG guidance enables ~33% FLOPs savings vs standard CFG at comparable quality.
If you find this work useful in your research, please consider citing:
@inproceedings{hajiali2026elit,
title={One Model, Many Budgets: Elastic Latent Interfaces
for Diffusion Transformers},
author={Moayed Haji-Ali and Willi Menapace and Ivan Skorokhodov
and Dogyun Park and Anil Kag and Michael Vasilkovsky
and Sergey Tulyakov and Vicente Ordonez
and Aliaksandr Siarohin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR)},
year={2026}
}