Shortcut-Rerouted Adapter Training for Text-to-Image Personalization
Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt.
In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference.
When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should not be learned.
When training models to focus on what matters, it is often most effective to explicitly route away what does not.
This principle extends beyond personalization: absorbing confounding variation through targeted pathways can inform future approaches to modular, interpretable, and more controllable generative systems.
An image, as the adage goes, is worth a thousand words; more precisely, an image encodes an entire constellation of attributes—identity, style, geometry, camera parameters, lighting, and beyond. In most cases, however, we wish to learn to encode only some specific attributes, and a thousand words are simply too many. The reconstruction loss, being agnostic to this distinction, indiscriminately incentivizes the adapter to reproduce all visual factors present in the image. As a consequence, the adapter entangles the target factor with myriad incidental ones (i.e. shortcuts).
Figure 2: Common adapter training is susceptible to learning undesired shortcuts. The common single-image reconstruction objective used in adapter training inadvertently encourages the adapter to pick up all the attributes in the adapter input (e.g. pose, expression, background, distribution) and leak them into the generation. While some confounding attributes like background can be factored out using masking, many other cannot. This makes learning a pure "identity" adapter challenging.
Pretrain auxiliary modules (LoRA, ControlNet) to handle confounding factors like distribution shift, pose, and expression
Train the adapter with shortcuts provided—confounds are routed through auxiliary modules, so the adapter learns only the target attribute
Remove auxiliary modules at inference—the adapter now injects only the target attribute (e.g., identity), while text prompts control the confounds
Figure 3: Shortcut-Rerouting Framework. The SR-Module (e.g., ControlNet, LoRA) serves as a generic shortcut adapter that absorbs confounding factors during training. At inference time, the SR module is removed, restoring independent control from the text prompt alone.
Addresses: Distribution shift between foundation model and finetuning data
How: LoRA absorbs dataset-specific style, lighting, and low-level features
Result: Adapter focuses on identity, not domain characteristics
Addresses: Pose and expression leakage from input images
How: ControlNet handles pose/expression maps during training
Result: Text prompts can control pose and expression independently
We demonstrate the effectiveness of Shortcut-Rerouted Adapter Training on two challenging tasks: facial identity preservation and full-body personalization.
Qualitative comparison of different "face" adapters. Top: close-up portraits with varied expressions. Bottom: full-body generations. Our approach preserves the visual prior more faithfully, enabling expressive and identity-consistent personalized image generation.
Better Prior Preservation: Image layout, texture, and quality remain aligned with the foundation model's prior
Text-Guided Expression Control: Prompts like "snarling" are respected, not ignored in favor of input expression
Pose Independence: Generated images follow prompt-specified poses, not input pose
Qualitative comparison of different "body" adapters. Our approach shows much stronger identity preservation and much better adherence to the prior with enhanced image quality.
Holistic Identity: Captures body type, clothing, and limb proportions—not just face
Pose Controllability: Enables text-based reposing of subjects
Prior Consistency: Outputs align with prior image layout and appearance
SR-Training is a versatile framework supporting different combinations of shortcut modules. The LoRA shortcut mitigates quality degradation, the ControlNet shortcut preserves pose priors, and the background shortcut prevents lighting leakage. Combinations like SR-LoRA-CN-BG isolate and inject only the target identity.
Large-scale internal dataset of millions of high-quality human images, filtered for single subjects and quality
Built on FLUX.1 [Dev] with DiT backbone and Conditional Flow Matching objective
8Ă—A100 GPUs, AdamW optimizer, 250K iterations with batch size 32
Compared against InfU, PuLID, IP-Adapter, and InstantX models
@article{shortcut-rerouting2025,
title={Preventing Shortcuts in Adapter Training via Providing the Shortcuts},
author={Goyal, Anujraaj Argo and Qian, Guocheng Gordon and Coskun, Huseyin and
Gupta, Aarush and Tam, Himmy and Ostashev, Daniil and Hu, Ju and
Sagar, Dhritiman and Tulyakov, Sergey and Aberman, Kfir and
Wang, Kuan-Chieh Jackson},
journal={NeurIPS},
year={2025}
}