ELIT — Elastic Latent Interface Transformer

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores.

Model	ImageNet 256×256				ImageNet 512×512
Model	FID↓	FDD↓	IS↑	TF	FID↓	FDD↓	IS↑	TF
DiT-XL	13.0	346.3	66.2	182	18.8	339.2	53.0	806
↳ ELIT	8.2	200.2	93.0	188	11.1	175.6	80.0	831
↳ ELIT-MB	7.8 −40%	203.7 −41%	99.0 +50%	190	10.1 −46%	164.1 −52%	88.8 +68%	804

UViT-XL	8.3	220.2	84.4	196	11.6	202.7	72.5	861
↳ ELIT	7.5	203.8	95.2	202	8.9	155.3	85.8	886
↳ ELIT-MB	7.1 −14%	203.2 −8%	100.3 +19%	204	7.7 −34%	135.8 −33%	98.0 +35%	858

HDiT-XL	12.8	361.6	68.7	182	13.0	260.3	69.4	776
↳ ELIT	9.4	272.2	89.5	188	10.1	164.1	88.8	801
↳ ELIT-MB	9.4 −27%	271.8 −25%	92.3 +34%	191	9.6 −26%	171.2 −34%	94.7 +36%	791

Model	Tokens	FLOPs	Entity	Relation	Attribute	Global	Avg.
Qwen-Image	4096	1×	90.51	92.21	91.03	91.70	91.27
ELIT 100%	4096	0.69×	90.30	92.18	88.97	89.18	90.45
ELIT 50%	2048	0.49×	90.15	89.94	89.05	89.06	89.81
ELIT 25%	1024	0.41×	89.31	91.87	89.71	84.79	89.79
ELIT 12.5%	512	0.37×	91.20	90.35	88.77	79.84	88.02

One Model, Many Budgets:
Elastic Latent Interfaces for Diffusion Transformers

📄 Abstract

🎯 Key Contributions

Adaptive Computation

Faster
Convergence

Variable Inference Budget

Drop-in Training

💡 Motivation

🔬 Method

ELIT Architecture

🔍 Looking Inside ELIT

Attention Follows Difficulty

Importance-Ordered Latent Representations

🎛️ Elastic Inference Perks

Faster Convergence

Scalability

Better Compute-Quality Tradeoff

Free Autoguidance & Cheap CFG

TeaCache Compatible

📊 Results

Quantitative Comparison on ImageNet-1K

🖼️ Qualitative Results

Class-Conditional Image Generation (ImageNet-1K 512px)

Large-Scale Text-to-Image: ELIT-Qwen-Image (20B MM-DiT)

📝 Citation

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

📄 Abstract

🎯 Key Contributions

Adaptive Computation

Faster Convergence

Variable Inference Budget

Drop-in Training

💡 Motivation

🔬 Method

ELIT Architecture

🔍 Looking Inside ELIT

Attention Follows Difficulty

Importance-Ordered Latent Representations

🎛️ Elastic Inference Perks

Faster Convergence

Scalability

Better Compute-Quality Tradeoff

Free Autoguidance & Cheap CFG

TeaCache Compatible

📊 Results

Quantitative Comparison on ImageNet-1K

🖼️ Qualitative Results

Class-Conditional Image Generation (ImageNet-1K 512px)

Large-Scale Text-to-Image: ELIT-Qwen-Image (20B MM-DiT)

📝 Citation

One Model, Many Budgets:
Elastic Latent Interfaces for Diffusion Transformers

Faster
Convergence