CVPR 2026

One Model, Many Budgets:
Elastic Latent Interfaces for Diffusion Transformers

1 Rice University 2 Snap Inc.
† Work partially done during an internship at Snap Inc.
Rice University Snap Inc.
TL;DR

We found that DiTs waste substantial compute by allocating it uniformly across pixels, despite large variation in regional difficulty. ELIT addresses this by introducing a variable-length set of latent tokens and two lightweight cross-attention layers (Read & Write) that concentrate computation on the most important input regions, delivering up to 53% FID and 58% FDD improvements on ImageNet-1K at 512px. At inference time, the number of latent tokens becomes a user-controlled knob, providing a smooth quality–FLOPs trade-off while enabling ~33% cheaper guidance out of the box.

ELIT Teaser — Flexible compute allocation

Flexible compute allocation with ELIT. Starting from a vanilla DiT, we add a variable-length set of latent tokens — the latent interface — and two lightweight cross-attention layers, Read and Write. At inference, the number of latent tokens is a user-controlled knob that yields a smooth quality–FLOPs trade-off across DiT, U-ViT, HDiT, and MM-DiT backbones.

📄 Abstract

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores.

🎯 Key Contributions

Adaptive Computation

Compute is concentrated where it matters — Read cross-attention learns to allocate more latent tokens to challenging image regions rather than spreading compute uniformly.

🚀

Faster
Convergence

Thanks to the non-uniform adaptive compute allocation of ELIT latent tokens, ELIT consistently converges faster than the baseline.

🎛️

Variable Inference Budget

A single set of weights enables a spectrum of latency-quality trade-offs by simply selecting the number of latent tokens at inference.

🔌

Drop-in Training

Only two lightweight Read/Write layers are added, while keeping everything else unchanged, making ELIT compatible with standard DiT training pipelines. Demonstrated on DiT, U-ViT, HDiT, and MM-DiT.

💡 Motivation

Standard DiTs allocate computation uniformly across all spatial tokens regardless of content difficulty. In a synthetic experiment, we show that DiTs cannot reallocate compute from trivial regions to informative ones — even when given extra tokens. ELIT solves this by learning where to spend compute.

Synthetic experiment: DiT cannot reallocate compute

Uniform compute in DiTs. (Blue) DiT-B/2-Synth has 4× more tokens (zero-padded) but fails to improve, matching DiT-B/2 but underperforming DiT-B/1 at similar compute. (Red) ELIT-DiT-B/2-Synth repurposes compute from zeroed regions to enhance generation, matching DiT-B/1 quality.

🔬 Method

ELIT introduces a minimal change to DiT-like architectures: a latent interface — a variable-length token sequence — coupled with lightweight Read and Write cross-attention layers.

ELIT Architecture Overview

ELIT Architecture

We create a latent domain by instantiating a latent interface of K tokens. A lightweight Read cross-attention layer pulls information from spatial tokens into the latent interface, prioritizing harder regions. This forms a compact latent domain on which most transformer blocks operate. A Write cross-attention layer maps the latent updates back to the spatial grid. Grouped cross-attention partitions spatial tokens into G non-overlapping groups, reducing cost from O(NK) to O(NK/G). Additionally, we randomly drop tail latents during training, making the latent interface importance-ordered. At inference, the number of latents serves as a user-controlled compute knob.

🔍 Looking Inside ELIT

How does the Read layer decide where to focus? And what does each latent token actually encode? Below we visualize ELIT's internals, revealing that the architecture learns meaningful spatial and importance structure — without any explicit supervision beyond the standard flow-matching loss.

Read Attention

Attention Follows Difficulty

The Read cross-attention layer pulls information from spatial tokens into the latent interface. Visualizing the aggregated Read attention (middle row) alongside the per-patch flow-matching loss (bottom row) reveals a striking pattern: latent tokens autonomously learn to attend most strongly to the spatial regions that contribute most to the loss — object boundaries, fine textures, and semantically complex areas. Easy background regions receive minimal attention, allowing the latent core to focus its compute budget where it matters most.

ELIT Read attention maps aligned with per-patch loss

Top: Input images. Middle: Aggregated Read attention — brighter regions attract more latent tokens. Bottom: Per-patch flow-matching loss — higher-loss regions (object detail, boundaries) align with higher attention, confirming that ELIT concentrates compute on the most informative content.

Token Importance

Importance-Ordered Latent Representations

Thanks to tail-dropping during training, latent tokens form an importance-ordered sequence. We visualize individual token attention maps overlaid on the input. The first (most important) tokens attend broadly to global structure — capturing the background, subject, and key shapes. The last (least important) tokens attend to finer details — information that can be dropped at lower budgets with graceful quality degradation.

Per-token attention maps showing importance ordering

Left: Input image. Center: First three latent tokens attend to the most salient regions (subject, global structure). Right: Last three latent tokens attend to residual detail and background. This ordering emerges naturally from tail-dropping training and enables smooth variable-budget inference.

🎛️ Elastic Inference Perks

Beyond improved generation quality, ELIT's latent interface unlocks a range of practical benefits — from faster training to cheaper guidance — all from a single model with no extra training objectives.

🚀 Training

Faster Convergence

By concentrating compute on informative regions, ELIT reaches the same FDD as DiT in roughly 4× fewer training steps at 512px. The adaptive latent interface accelerates learning across both resolutions.

ELIT converges ~4× faster than DiT
📈 Scaling

Scalability

ELIT adds only two lightweight cross-attention layers — minimal parameter overhead. Yet its improvements grow with model and resolution scale: FDD gains increases with larger models and resolutions, suggesting the ELIT is suitabel for large-scale high resolution generation.

ELIT gains scale with model size
🎛️ Flexibility

Better Compute-Quality Tradeoff

When targeting a specific FLOPs budget, ELIT's token-count knob provides a superior compute–quality trade-off compared to simply reducing the number of sampling steps.

ELIT provides better quality-FLOPs trade-off than step reduction
🎯 Guidance

Free Autoguidance & Cheap CFG

Variable budgets give ELIT a built-in weaker model. Using a low-budget forward pass as the guidance term yields AutoGuidance (AG) for free. Combining this with class dropping produces Cheap CFG (CCFG), which improves quality across all metrics while cutting guidance cost by ~33%.

CCFG improves quality while reducing FLOPs by ~33%
⚡ Acceleration

TeaCache Compatible

ELIT is fully orthogonal to training-free acceleration. When combined with TeaCache — which caches and reuses network outputs across similar timesteps — ELIT attains comparable speedup gains as the baseline, stacking both sources of efficiency.

TeaCache provides orthogonal speedups on top of ELIT

📊 Results

Quantitative Comparison on ImageNet-1K

We evaluate FID↓, FDD↓, and IS↑ without classifier-free guidance (–G). TFLOPs (TF) indicate single training iteration TFLOPs. Superscripts show percentage improvement of ELIT MultiBudget (MB) relative to the baseline.

Model ImageNet 256×256 ImageNet 512×512
FID↓ FDD↓ IS↑ TF FID↓ FDD↓ IS↑ TF
DiT-XL 13.0 346.3 66.2 182 18.8 339.2 53.0 806
↳ ELIT 8.2 200.2 93.0 188 11.1 175.6 80.0 831
↳ ELIT-MB 7.8 −40% 203.7 −41% 99.0 +50% 190 10.1 −46% 164.1 −52% 88.8 +68% 804
UViT-XL 8.3 220.2 84.4 196 11.6 202.7 72.5 861
↳ ELIT 7.5 203.8 95.2 202 8.9 155.3 85.8 886
↳ ELIT-MB 7.1 −14% 203.2 −8% 100.3 +19% 204 7.7 −34% 135.8 −33% 98.0 +35% 858
HDiT-XL 12.8 361.6 68.7 182 13.0 260.3 69.4 776
↳ ELIT 9.4 272.2 89.5 188 10.1 164.1 88.8 801
↳ ELIT-MB 9.4 −27% 271.8 −25% 92.3 +34% 191 9.6 −26% 171.2 −34% 94.7 +36% 791

All metrics reported without guidance (–G). ELIT delivers consistent improvements across all architectures with gains becoming more pronounced at 512px where pixel redundancy is greater.

🖼️ Qualitative Results

Class-Conditional Image Generation (ImageNet-1K 512px)

ELIT produces images with better structure, fine details, and class fidelity compared to the DiT baseline at the same compute budget.

ELIT vs DiT qualitative comparison on ImageNet-1K 512px

Large-Scale Text-to-Image: ELIT-Qwen-Image (20B MM-DiT)

ELIT applied on top of Qwen-Image (20B MM-DiT) enables multi-budget inference at scale. ELIT-Qwen-Image cuts sampling FLOPs by up to 63% (~2.7× speedup) while gracefully trading off speed for quality. On DPG-Bench, average score gracefully degrades from 90.45 (full budget) to 88.02 (12.5% tokens).

ELIT-Qwen-Image qualitatives at various budgets
Model Tokens FLOPs Entity Relation Attribute Global Avg.
Qwen-Image 4096 90.51 92.21 91.03 91.70 91.27
ELIT 100% 4096 0.69× 90.30 92.18 88.97 89.18 90.45
ELIT 50% 2048 0.49× 90.15 89.94 89.05 89.06 89.81
ELIT 25% 1024 0.41× 89.31 91.87 89.71 84.79 89.79
ELIT 12.5% 512 0.37× 91.20 90.35 88.77 79.84 88.02

ELIT-Qwen-Image evaluation on DPG-Bench. CCFG guidance enables ~33% FLOPs savings vs standard CFG at comparable quality.

📝 Citation

If you find this work useful in your research, please consider citing:

@inproceedings{hajiali2026elit,
  title={One Model, Many Budgets: Elastic Latent Interfaces 
    for Diffusion Transformers},
  author={Moayed Haji-Ali and Willi Menapace and Ivan Skorokhodov 
    and Dogyun Park and Anil Kag and Michael Vasilkovsky 
    and Sergey Tulyakov and Vicente Ordonez 
    and Aliaksandr Siarohin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer 
    Vision and Pattern Recognition (CVPR)},
  year={2026}
}