ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control. It uses attribute-specific tokenization and multi-attribute cross-reference training to enable state-of-the-art personalized generation with fine-grained and disengtangled attribute control.
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation.
A million-level multi-image-per-identity dataset, containing 6M identities and more than 30M samples
A novel Multi-Attribution Cross-Reference Training, intentionally curating misalignment across references to enforce disentanglement
A new Attribute-Specific Image Prompts Adapter that allows flexible, attribute-wise conditioning, enabling composition of identity, hairstyle, and garments across multiple subjects
When reference images come from different and misaligned sources, standard personalization tends to copy-paste artifacts, i.e. expressions, head orientation, and poses in generation are leaked from the reference images and are not controlled by user prompts. We tackle this head-on.
We introduce ComposeMe, which employs attribute-specific tokenization to represent identity, hair, and garment across multiple subjects. Attribute-specific image prompts are tokenized separately, and the resulting embeddings are merged and injected into a pre-trained diffusion model.
Our method consists of two-stage training:
Conventional single-reference adapter training treats an identity with multiple attributes as a single, indivisible object, cropped directly from the target image. This approach often leads to undesirable copy-paste artifacts when generating new content. ComposeMe uses this conventional training strategy as the first stage to learn the appearance of each attribute.
Our Multi-Attribute Cross-Reference Training is attribute-aware: it decomposes each identity into distinct visual attributes (e.g., face, hairstyle, clothing), sourcing each from different input images and predicting separate images as targets. This approach curates pair of misaligned inputs and target images, thus enabling the generation of naturally aligned, coherent outputs even from misaligned attribute inputs in inference.
Ablation Study on Multi-Attribute Cross-Reference Training. Cross-reference training for the face and hair enables effective control over expression and head pose, while cross-reference training for clothing mitigates pose leakage originating from clothing region. When applied across all parts, multi-attribute cross-reference training allows ComposeMe to achieve high-fidelity generation from misaligned attribute-specific visual prompts.
CVPR 2025
A novel facial representation designed specifically for generative tasks, encoding holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation.
Read Paper →
arXiv 2025
An interactive framework for personalized, multi-subject text-to-image generation that introduces a layered canvas representation and locking mechanism for precise spatial control.
Read Paper →The authors would like to acknowledge Ke Ma and Huseyin Coskun for infrastructure support; Ruihang Zhang, Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the help of gallery video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.
@article{qian2025composeme,
title={ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation},
author={Qian, Guocheng Gordon and Ostashev, Daniil and Nemchinov, Egor and Assouline, Avihay and Tulyakov, Sergey and Wang, Kuan-Chieh Jackson and Aberman, Kfir},
journal={arXiv preprint arXiv:2509.18092},
year={2025}
}