Preventing Shortcuts in Adapter Training
via Providing the Shortcuts

Shortcut-Rerouted Adapter Training for Text-to-Image Personalization

Anujraaj Argo Goyal Guocheng Gordon Qian* Huseyin Coskun Aarush Gupta Himmy Tam Daniil Ostashev Ju Hu Dhritiman Sagar Sergey Tulyakov Kfir Aberman Kuan-Chieh Jackson Wang*
Snap Inc.
*Corresponding authors
NeurIPS 2025
Teaser showing comparison of methods

Shortcut Rerouting re-enables text control of pose and expression after adapter training. Without shortcut rerouting, the adapter overfits to the reference image and reproduces its pose and expression, ignoring the prompt. With Shortcut-Rerouted Adapter Training, the adapter disentangles identity from other factors, allowing the model to respond faithfully to prompt-specified expressions and head poses.

Abstract

Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt.

In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference.

When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should not be learned.

đź’ˇ Key Insight

When training models to focus on what matters, it is often most effective to explicitly route away what does not.

This principle extends beyond personalization: absorbing confounding variation through targeted pathways can inform future approaches to modular, interpretable, and more controllable generative systems.

Method

Motivation

An image, as the adage goes, is worth a thousand words; more precisely, an image encodes an entire constellation of attributes—identity, style, geometry, camera parameters, lighting, and beyond. In most cases, however, we wish to learn to encode only some specific attributes, and a thousand words are simply too many. The reconstruction loss, being agnostic to this distinction, indiscriminately incentivizes the adapter to reproduce all visual factors present in the image. As a consequence, the adapter entangles the target factor with myriad incidental ones (i.e. shortcuts).

Problem illustration

Figure 2: Common adapter training is susceptible to learning undesired shortcuts. The common single-image reconstruction objective used in adapter training inadvertently encourages the adapter to pick up all the attributes in the adapter input (e.g. pose, expression, background, distribution) and leak them into the generation. While some confounding attributes like background can be factored out using masking, many other cannot. This makes learning a pure "identity" adapter challenging.

The Solution: Shortcut Rerouting

1

SR Module Pretraining

Pretrain auxiliary modules (LoRA, ControlNet) to handle confounding factors like distribution shift, pose, and expression

2

Shortcut Rerouting Adapter Training

Train the adapter with shortcuts provided—confounds are routed through auxiliary modules, so the adapter learns only the target attribute

3

Adapter Inference w/o SR Module

Remove auxiliary modules at inference—the adapter now injects only the target attribute (e.g., identity), while text prompts control the confounds

Method diagram

Figure 3: Shortcut-Rerouting Framework. The SR-Module (e.g., ControlNet, LoRA) serves as a generic shortcut adapter that absorbs confounding factors during training. At inference time, the SR module is removed, restoring independent control from the text prompt alone.

Two Key Instantiations

🎨

SR-LoRA

Addresses: Distribution shift between foundation model and finetuning data

How: LoRA absorbs dataset-specific style, lighting, and low-level features

Result: Adapter focuses on identity, not domain characteristics

🎭

SR-ControlNet

Addresses: Pose and expression leakage from input images

How: ControlNet handles pose/expression maps during training

Result: Text prompts can control pose and expression independently

Results

We demonstrate the effectiveness of Shortcut-Rerouted Adapter Training on two challenging tasks: facial identity preservation and full-body personalization.

Face Adapters: Disentangling Identity from Pose & Expression

Face adapter qualitative results

Qualitative comparison of different "face" adapters. Top: close-up portraits with varied expressions. Bottom: full-body generations. Our approach preserves the visual prior more faithfully, enabling expressive and identity-consistent personalized image generation.

âś“

Better Prior Preservation: Image layout, texture, and quality remain aligned with the foundation model's prior

âś“

Text-Guided Expression Control: Prompts like "snarling" are respected, not ignored in favor of input expression

âś“

Pose Independence: Generated images follow prompt-specified poses, not input pose

Body Adapters: Holistic Identity with Pose Control

Body adapter qualitative results

Qualitative comparison of different "body" adapters. Our approach shows much stronger identity preservation and much better adherence to the prior with enhanced image quality.

âś“

Holistic Identity: Captures body type, clothing, and limb proportions—not just face

âś“

Pose Controllability: Enables text-based reposing of subjects

âś“

Prior Consistency: Outputs align with prior image layout and appearance

Modular Combinations: Mix & Match Shortcuts

Ablation results

SR-Training is a versatile framework supporting different combinations of shortcut modules. The LoRA shortcut mitigates quality degradation, the ControlNet shortcut preserves pose priors, and the background shortcut prevents lighting leakage. Combinations like SR-LoRA-CN-BG isolate and inject only the target identity.

Implementation Details

📊 Dataset

Large-scale internal dataset of millions of high-quality human images, filtered for single subjects and quality

🏗️ Architecture

Built on FLUX.1 [Dev] with DiT backbone and Conditional Flow Matching objective

⚙️ Training

8Ă—A100 GPUs, AdamW optimizer, 250K iterations with batch size 32

🎯 Baselines

Compared against InfU, PuLID, IP-Adapter, and InstantX models

Citation

@article{shortcut-rerouting2025,
  title={Preventing Shortcuts in Adapter Training via Providing the Shortcuts},
  author={Goyal, Anujraaj Argo and Qian, Guocheng Gordon and Coskun, Huseyin and 
          Gupta, Aarush and Tam, Himmy and Ostashev, Daniil and Hu, Ju and 
          Sagar, Dhritiman and Tulyakov, Sergey and Aberman, Kfir and 
          Wang, Kuan-Chieh Jackson},
  journal={NeurIPS},
  year={2025}
}