MoA : Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

TL;DR - We introduce a new architecture for personalization of generative models that disentangles the generation of given subjects and the context from the prior.

Abstract

We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable.

Interpolate start reference image.

Start Frame

Loading...
Interpolation end reference image.

End Frame

Drag to traverse the initial random noise, which changes the context consistently across different subject pairs. **Github pages can be slow to load the images. Slide a few times to see smooth transition!


 

Mixture-of-Attention

Our key observation is that, existing personalization methods often need to trade-off between "prior preservation" for better prompt consistency and "personalization finetuning" for idenity fidelity. We augment the attention layer using an architecture inspired by the Mixture-of-Expert (MoE) where a router is introduced to distribute tasks among different experts. In our case, we keep a "prior expert" frozen during finetuning to preserve the prior, and finetune a personalized expert.

Fig2.


This MoA is scattered across all attention layers in a pretrained U-Net, and finetuned on a small dataset (FFHQ). In the paper, we also disucss how the router is trained, and propose a regularization term that encourages the personalization branch to minimally affect the overall image.

Fig2.

When we visualize the router predictions of MoA, we can see that the routers put the background pixels to the prior branch, and most of the foreground pixels to the personalization branch. This behavior explains why MoA can achieve disentangled subject-context control. The detailed behavior of the router differs across layers and timesteps. This allows the personalization branch to focus on different regions within the subject (i.e. the face, the body, and so on) at different timesteps.

Fig3.

As a result, MoA enables personalized generation while leveraging the powerful generative nature of the prior model, and can inject different subjects seamlessly.

Fig4.

Applications

Real-Image Subject Swap

The combination of MoA and DDIM Inversion enables replacing the subject in a real image.


Fig7.
Fig8.


Controllable Generation

The combination of MoA and ControlNet enables personalized generation with pose control.


Fig6.


Subject Swap (Face and Body)

MoA is able to swap subjects despite very different body shapes. Notice in the Yokozuna (top) results, he blocks the background completely, while in the images with Dalle-3 generated man (bottom), we can see through the gap between his arm and body.


Fig5.


Subject Morph

MoA is able to easily morph between subjects by interpolating the image embeddings.


Fig9.