ComposeMe

Scroll to explore

ComposeMe:
Attribute-Specific Image Prompts for Controllable Human Image Generation

Guocheng Qian Daniil Ostashev Egor Nemchinov Avihay Assouline Sergey Tulyakov Kuan-Chieh Wang Kfir Aberman
Snap Inc., USA
SIGGRAPH Asia 2025
ComposeMe Teaser

TL;DR

ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control. It uses attribute-specific tokenization and multi-attribute cross-reference training to enable state-of-the-art personalized generation with fine-grained and disengtangled attribute control.

ComposeMe Results

ComposeMe Pipeline

We introduce ComposeMe, which employs attribute-specific tokenization to represent identity, hair, and garment across multiple subjects. Attribute-specific image prompts are encoded separately, and the resulting embeddings are merged and injected into a pre-trained diffusion model.

ComposeMe Training Strategy

Our method consists of two-stage training:

ComposeMe Training Strategy
1

Single-Reference Copy-Pasting Training

Conventional single-reference adapter training treats an identity with multiple attributes as a single, indivisible object, cropped directly from the target image. This approach often leads to undesirable copy-paste artifacts when generating new content. ComposeMe uses this conventional training strategy as the first stage to learn the appearance of each attribute.

2

Multi-Attribute Cross-Reference Training

Our Multi-Attribute Cross-Reference Training is attribute-aware: it decomposes each identity into distinct visual attributes (e.g., face, hairstyle, clothing), sourcing each from different input images and predicting separate images as targets. This approach curates pair of misaligned inputs and target images, thus enabling the generation of naturally aligned, coherent outputs even from misaligned attribute inputs in inference.

Experiments

Ablation Study

Ablation Study on Multi-Attribute Cross-Reference Training. Cross-reference training for the face and hair enables effective control over expression and head pose, while cross-reference training for clothing mitigates pose leakage originating from clothing region. When applied across all parts, multi-attribute cross-reference training allows ComposeMe to achieve high-fidelity generation from misaligned attribute-specific visual prompts.

Ablation Study

Acknowledgements

The authors would like to acknowledge Ke Ma and Huseyin Coskun for infrastructure support; Ruihang Zhang, Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the help of gallery video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.

BibTeX Citation

@article{qian2025composeme,
  title={ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation},
  author={Qian, Guocheng and Ostashev, Daniil and Nemchinov, Egor and Assouline, Avihay and Tulyakov, Sergey and Wang, Kuan-Chieh and Aberman, Kfir},
  journal={arXiv preprint arXiv:2502.xxxxx},
  year={2025}
}