LayerComposer - Multi-Human Personalized Generation

Abstract

Despite their impressive visual fidelity, existing personalized generation lack interactive control over spatial composition and scale poorly to multiple humans. To address these limitations, we present LayerComposer, an interactive and scalable framework for multi-human personalized generation. Inspired by professional image-editing software, LayerComposer provides intuitive reference-based human injection, where users can place and resize multiple subjects directly on a layered digital canvas to guide personalized generation. The core of our approach is the layered canvas, a novel representation where each subject is placed on a distinct layer, enabling interactive and occlusion-free composition. We further introduce a transparent latent pruning mechanism that improves scalability by decoupling computation cost from the number of subjects, and a layerwise cross-reference training strategy that mitigates copy–paste artifacts. Extensive experiments demonstrate that LayerComposer achieves superior spatial control, coherent composition, and identity preservation compared to state-of-the-art methods in multi-human personalized image generation.

Contributions:

Interactive Personalization Paradigm

We propose an interactive personalization paradigm that allows users to intuitively compose multi-human scenes by placing and resizing subjects in their full-body, portrait, or cropped-head forms.

Layered Canvas Representation

We introduce the layered canvas, a novel representation that resolves occlusion issues and improves scalability through our transparent latent pruning strategy.

Layerwise Cross-Reference Training

We present layerwise cross-reference training, which effectively disentangles redundant visual information (e.g., poses) and mitigates copy-paste artifacts.

State-of-the-Art Performance

We develop LayerComposer, a simple framework that enables clean multi-human reference injection and layout control without auxiliary inputs or modules. Extensive experiments demonstrate that LayerComposer achieves state-of-the-art compositional control and visual fidelity across multi-human personalization benchmarks.

Motivation

Previous methods offer limited interactivity and scale poorly to multiple subjects. They rely on passive embedding injection, allow only text control, and suffer from linear growth of memory and computational cost as the number of subjects increases.

Limitations of Existing Personalization Methods

Whereas our approach introduces an interactive personalization paradigm that enables intuitive, layer-based control over spatial composition and subject preservation.

Methodology

Layered Canvas

The layered canvas is represented by a set of RGBA layers L = {l₁, ⋯, lₙ}. Each RGBA layer lᵢ encodes one subject (human segment or background), where RGB channels provide visual reference and the alpha channel defines spatial masks for valid regions. This design ensures LayerComposer faithfully follows the user's compositional design: allowing intuitive placement of multiple subjects, specifying their spatial arrangement through simple dragging, and preserving subject-specific visual attributes while producing a globally coherent image.

LayerComposer Pipeline

LayerComposer builds on a pretrained diffusion transformer and conditions on both text prompts and the layered canvas. Each layer is encoded using VAE, then layerwise positional embeddings are added: each layer gets a unique index [jᵢ, x, y] where jᵢ distinguishes different layers while (x, y) encode spatial coordinates. Transparent latent pruning selectively retains only the latent tokens from valid spatial locations (non-zero alpha), discarding transparent regions. This makes computation cost proportional to content area, not the number of subjects, achieving nearly constant memory and enabling scalable multi-human generation.

Layerwise Cross-Reference Training

During training, LayerComposer employs a layerwise cross-reference data sampling strategy. We curate a multi-image-per-scene dataset where each group of identities contains multiple images. For each training sample, an image is randomly selected as the generation target, while the layered canvas (humans and background) is sampled from other images within the same scene. Each sampled human is resized and positioned according to the target bounding boxes. This cross-reference strategy introduces intentional mismatches between inputs and targets, encouraging the model to disentangle redundant factors (pose, illumination) across layers and mitigating copy-paste artifacts.

Experiments

Four-Person (4P) Personalization

While state-of-the-art baselines frequently distort, omit subjects, or produce unnatural copy-pasted artifacts, LayerComposer consistently generates high-fidelity and coherent compositions, faithfully preserving identities and spatial arrangement of all subjects.

UniPortrait ID-Patch UNO OmniGen2 Ours Inputs

Two-Person (2P) Personalization

When personalizing an image with two subjects, competing methods often fail to compose a coherent, interactive scene. LayerComposer produces visually coherent, high-fidelity scenes where both subjects are present and naturally interacting.

UniPortrait StoryMaker ID-Patch UNO DreamO OmniGen2 Ours Inputs

Single-Person (1P) Personalization

State-of-the-art 1P personalization approaches tend to inject the reference face identity with limited flexibility, resulting in copy-pasted effects. LayerComposer generates realistic outputs, faithful to both the human identity and text prompt.

UniPortrait StoryMaker ID-Patch UNO DreamO OmniGen2 Ours Inputs

Quantitative Results

LayerComposer ranks among the top two methods in image quality across benchmarks according to HPSv3. On multi-subject benchmarks, it substantially outperforms other leading approaches in identity preservation as measured by ArcFace.

Ablation Study

Layerwise Cross-Reference Training

To demonstrate the effectiveness of layerwise cross-reference training, we compare it against conventional single-reference training. In conventional training, the layered canvas is constructed by cropping human segments directly from the target image, resulting in pixel-level correspondence that leads to copy-paste artifacts. Our layerwise cross-reference approach samples each layer from different images within the same scene, encouraging the model to disentangle redundant factors like pose and illumination, producing more natural and coherent results.

Layered Canvas

Without the layered canvas, the model is trained on a single collage image as the conditioning input, shown as "Inputs" in the figure. As seen in the "w/o layered canvas" column, e.g., occlusion in the collage causes missing information. For example, the ball on the Christmas hat disappears from the left woman. By contrast, our layered canvas explicitly handles occlusion and prevents such artifacts.

By adjusting the position of each subject within its own layer in the Layered Canvas, LayerComposer enables intuitive spatial layout control.

Also Read

CVPR25

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

CVPR 2025

A novel facial representation designed specifically for generative tasks, encoding holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation.

Read Paper →

SIGGRAPH ASIA25

ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

SIGGRAPH Asia 2025

A human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.

Read Paper →

Acknowledgements

The authors would like to acknowledge Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the demo video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.

BibTeX Citation

@article{qian2025layercomposer,
  author    = {Guocheng Gordon Qian and Ruihang Zhang and Tsai-Shien Chen and Yusuf Dalva and Anujraaj Goyal and Willi Menapace and Ivan Skorokhodov and Daniil Ostashev and Meng Dong and Arpit Sahni and Ju Hu and Sergey Tulyakov and Kuan-Chieh Jackson Wang},
  title     = {LayerComposer: Multi-Human Personalized Generation via Layered Canvas},
  journal = {arXiv},   
  year      = {2025},
}