TL;DR: LayerComposer enables Photoshop-like control for multi-human text-to-image generation, allowing users to compose group photos by placing and resizing subjects on a layered canvas with interactive, scalable, and high-fidelity personalization.
Despite their impressive visual fidelity, existing personalized generation lack interactive control over spatial composition and scale poorly to multiple humans. To address these limitations, we present LayerComposer, an interactive and scalable framework for multi-human personalized generation. Inspired by professional image-editing software, LayerComposer provides intuitive reference-based human injection, where users can place and resize multiple subjects directly on a layered digital canvas to guide personalized generation. The core of our approach is the layered canvas, a novel representation where each subject is placed on a distinct layer, enabling interactive and occlusion-free composition. We further introduce a transparent latent pruning mechanism that improves scalability by decoupling computation cost from the number of subjects, and a layerwise cross-reference training strategy that mitigates copy–paste artifacts. Extensive experiments demonstrate that LayerComposer achieves superior spatial control, coherent composition, and identity preservation compared to state-of-the-art methods in multi-human personalized image generation.
We propose an interactive personalization paradigm that allows users to intuitively compose multi-human scenes by placing and resizing subjects in their full-body, portrait, or cropped-head forms.
We introduce the layered canvas, a novel representation that resolves occlusion issues and improves scalability through our transparent latent pruning strategy.
We present layerwise cross-reference training, which effectively disentangles redundant visual information (e.g., poses) and mitigates copy-paste artifacts.
We develop LayerComposer, a simple framework that enables clean multi-human reference injection and layout control without auxiliary inputs or modules. Extensive experiments demonstrate that LayerComposer achieves state-of-the-art compositional control and visual fidelity across multi-human personalization benchmarks.
Previous methods offer limited interactivity and scale poorly to multiple subjects. They rely on passive embedding injection, allow only text control, and suffer from linear growth of memory and computational cost as the number of subjects increases.
Whereas our approach introduces an interactive personalization paradigm that enables intuitive, layer-based control over spatial composition and subject preservation.
The layered canvas is represented by a set of RGBA layers L = {l₁, ⋯, lₙ}. Each RGBA layer lᵢ encodes one subject (human segment or background), where RGB channels provide visual reference and the alpha channel defines spatial masks for valid regions. This design ensures LayerComposer faithfully follows the user's compositional design: allowing intuitive placement of multiple subjects, specifying their spatial arrangement through simple dragging, and preserving subject-specific visual attributes while producing a globally coherent image.
LayerComposer builds on a pretrained diffusion transformer and conditions on both text prompts and the layered canvas. Each layer is encoded using VAE, then layerwise positional embeddings are added: each layer gets a unique index [jᵢ, x, y] where jᵢ distinguishes different layers while (x, y) encode spatial coordinates. Transparent latent pruning selectively retains only the latent tokens from valid spatial locations (non-zero alpha), discarding transparent regions. This makes computation cost proportional to content area, not the number of subjects, achieving nearly constant memory and enabling scalable multi-human generation.
During training, LayerComposer employs a layerwise cross-reference data sampling strategy. We curate a multi-image-per-scene dataset where each group of identities contains multiple images. For each training sample, an image is randomly selected as the generation target, while the layered canvas (humans and background) is sampled from other images within the same scene. Each sampled human is resized and positioned according to the target bounding boxes. This cross-reference strategy introduces intentional mismatches between inputs and targets, encouraging the model to disentangle redundant factors (pose, illumination) across layers and mitigating copy-paste artifacts.
To demonstrate the effectiveness of layerwise cross-reference training, we compare it against conventional single-reference training. In conventional training, the layered canvas is constructed by cropping human segments directly from the target image, resulting in pixel-level correspondence that leads to copy-paste artifacts. Our layerwise cross-reference approach samples each layer from different images within the same scene, encouraging the model to disentangle redundant factors like pose and illumination, producing more natural and coherent results.
Without the layered canvas, the model is trained on a single collage image as the conditioning input, shown as "Inputs" in the figure. As seen in the "w/o layered canvas" column, e.g., occlusion in the collage causes missing information. For example, the ball on the Christmas hat disappears from the left woman. By contrast, our layered canvas explicitly handles occlusion and prevents such artifacts.
By adjusting the position of each subject within its own layer in the Layered Canvas, LayerComposer enables intuitive spatial layout control.
CVPR 2025
A novel facial representation designed specifically for generative tasks, encoding holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation.
Read Paper →
SIGGRAPH Asia 2025
A human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
Read Paper →The authors would like to acknowledge Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the demo video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.
@article{qian2025layercomposer,
author = {Guocheng Gordon Qian and Ruihang Zhang and Tsai-Shien Chen and Yusuf Dalva and Anujraaj Goyal and Willi Menapace and Ivan Skorokhodov and Daniil Ostashev and Meng Dong and Arpit Sahni and Ju Hu and Sergey Tulyakov and Kuan-Chieh Jackson Wang},
title = {LayerComposer: Multi-Human Personalized Generation via Layered Canvas},
journal = {arXiv},
year = {2025},
}