ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control. It uses attribute-specific tokenization and multi-attribute cross-reference training to enable state-of-the-art personalized generation with fine-grained and disengtangled attribute control.
We introduce ComposeMe, which employs attribute-specific tokenization to represent identity, hair, and garment across multiple subjects. Attribute-specific image prompts are encoded separately, and the resulting embeddings are merged and injected into a pre-trained diffusion model.
Our method consists of two-stage training:
Conventional single-reference adapter training treats an identity with multiple attributes as a single, indivisible object, cropped directly from the target image. This approach often leads to undesirable copy-paste artifacts when generating new content. ComposeMe uses this conventional training strategy as the first stage to learn the appearance of each attribute.
Our Multi-Attribute Cross-Reference Training is attribute-aware: it decomposes each identity into distinct visual attributes (e.g., face, hairstyle, clothing), sourcing each from different input images and predicting separate images as targets. This approach curates pair of misaligned inputs and target images, thus enabling the generation of naturally aligned, coherent outputs even from misaligned attribute inputs in inference.
Ablation Study on Multi-Attribute Cross-Reference Training. Cross-reference training for the face and hair enables effective control over expression and head pose, while cross-reference training for clothing mitigates pose leakage originating from clothing region. When applied across all parts, multi-attribute cross-reference training allows ComposeMe to achieve high-fidelity generation from misaligned attribute-specific visual prompts.
The authors would like to acknowledge Ke Ma and Huseyin Coskun for infrastructure support; Ruihang Zhang, Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the help of gallery video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.
@article{qian2025composeme,
title={ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation},
author={Qian, Guocheng and Ostashev, Daniil and Nemchinov, Egor and Assouline, Avihay and Tulyakov, Sergey and Wang, Kuan-Chieh and Aberman, Kfir},
journal={arXiv preprint arXiv:2502.xxxxx},
year={2025}
}