Shortcut-Rerouted Adapter Training

Abstract

Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt.

In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference.

When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should not be learned.

Method

Motivation

An image, as the adage goes, is worth a thousand words; more precisely, an image encodes an entire constellation of attributes—identity, style, geometry, camera parameters, lighting, and beyond. In most cases, however, we wish to learn to encode only some specific attributes, and a thousand words are simply too many. The reconstruction loss, being agnostic to this distinction, indiscriminately incentivizes the adapter to reproduce all visual factors present in the image. As a consequence, the adapter entangles the target factor with myriad incidental ones (i.e. shortcuts).

Figure 2: Common adapter training is susceptible to learning undesired shortcuts. The common single-image reconstruction objective used in adapter training inadvertently encourages the adapter to pick up all the attributes in the adapter input (e.g. pose, expression, background, distribution) and leak them into the generation. While some confounding attributes like background can be factored out using masking, many other cannot. This makes learning a pure "identity" adapter challenging.

The Solution: Shortcut Rerouting

SR Module Pretraining

Pretrain auxiliary modules (LoRA, ControlNet) to handle confounding factors like distribution shift, pose, and expression

Shortcut Rerouting Adapter Training

Train the adapter with shortcuts provided—confounds are routed through auxiliary modules, so the adapter learns only the target attribute

Adapter Inference w/o SR Module

Remove auxiliary modules at inference—the adapter now injects only the target attribute (e.g., identity), while text prompts control the confounds

Figure 3: Shortcut-Rerouting Framework. The SR-Module (e.g., ControlNet, LoRA) serves as a generic shortcut adapter that absorbs confounding factors during training. At inference time, the SR module is removed, restoring independent control from the text prompt alone.

Two Key Instantiations

🎨

SR-LoRA

Addresses: Distribution shift between foundation model and finetuning data

How: LoRA absorbs dataset-specific style, lighting, and low-level features

Result: Adapter focuses on identity, not domain characteristics

🎭

SR-ControlNet

Addresses: Pose and expression leakage from input images

How: ControlNet handles pose/expression maps during training

Result: Text prompts can control pose and expression independently

Results

We demonstrate the effectiveness of Shortcut-Rerouted Adapter Training on two challenging tasks: facial identity preservation and full-body personalization.

Face Adapters: Disentangling Identity from Pose & Expression

Qualitative comparison of different "face" adapters. Top: close-up portraits with varied expressions. Bottom: full-body generations. Our approach preserves the visual prior more faithfully, enabling expressive and identity-consistent personalized image generation.

✓

Better Prior Preservation: Image layout, texture, and quality remain aligned with the foundation model's prior

✓

Text-Guided Expression Control: Prompts like "snarling" are respected, not ignored in favor of input expression

✓

Pose Independence: Generated images follow prompt-specified poses, not input pose

Body Adapters: Holistic Identity with Pose Control

Qualitative comparison of different "body" adapters. Our approach shows much stronger identity preservation and much better adherence to the prior with enhanced image quality.

✓

Holistic Identity: Captures body type, clothing, and limb proportions—not just face

✓

Pose Controllability: Enables text-based reposing of subjects

✓

Prior Consistency: Outputs align with prior image layout and appearance

Modular Combinations: Mix & Match Shortcuts

SR-Training is a versatile framework supporting different combinations of shortcut modules. The LoRA shortcut mitigates quality degradation, the ControlNet shortcut preserves pose priors, and the background shortcut prevents lighting leakage. Combinations like SR-LoRA-CN-BG isolate and inject only the target identity.

Implementation Details

📊 Dataset

Large-scale internal dataset of millions of high-quality human images, filtered for single subjects and quality

🏗️ Architecture

Built on FLUX.1 [Dev] with DiT backbone and Conditional Flow Matching objective

⚙️ Training

8×A100 GPUs, AdamW optimizer, 250K iterations with batch size 32

🎯 Baselines

Compared against InfU, PuLID, IP-Adapter, and InstantX models

Preventing Shortcuts in Adapter Training
via Providing the Shortcuts

Abstract

💡 Key Insight

Method

Motivation

The Solution: Shortcut Rerouting

SR Module Pretraining

Shortcut Rerouting Adapter Training

Adapter Inference w/o SR Module

Two Key Instantiations

SR-LoRA

SR-ControlNet

Results

Face Adapters: Disentangling Identity from Pose & Expression

Body Adapters: Holistic Identity with Pose Control

Modular Combinations: Mix & Match Shortcuts

Implementation Details

📊 Dataset

🏗️ Architecture

⚙️ Training

🎯 Baselines

Citation

Preventing Shortcuts in Adapter Trainingvia Providing the Shortcuts

Abstract

💡 Key Insight

Method

Motivation

The Solution: Shortcut Rerouting

SR Module Pretraining

Shortcut Rerouting Adapter Training

Adapter Inference w/o SR Module

Two Key Instantiations

SR-LoRA

SR-ControlNet

Results

Face Adapters: Disentangling Identity from Pose & Expression

Body Adapters: Holistic Identity with Pose Control

Modular Combinations: Mix & Match Shortcuts

Implementation Details

📊 Dataset

🏗️ Architecture

⚙️ Training

🎯 Baselines

Citation

Preventing Shortcuts in Adapter Training
via Providing the Shortcuts