LayerComposer:
Interactive Personalized T2I via Spatially-Aware Layered Canvas

* Equal contribution
Snap Inc., USA
arXiv 2025

LayerComposer

TL;DR: LayerComposer enables Photoshop-like control for multi-subject text-to-image generation, allowing users to compose scenes by placing, resizing, and locking elements in a layered canvas with high fidelity.

0:00 / 0:00

Abstract

Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in human-centric personalized image generation.

Contributions:

1

Interactive Personalization Paradigm

We propose an interactive personalization paradigm for T2I generation, empowering users to act as active directors by directly placing, resizing, and locking subjects on a canvas.

2

Layered Canvas Representation

We introduce layered canvas, a novel layered input representation that addresses the scalability bottleneck through transparent latent pruning, and handles occlusion issues by its layered design.

3

Locking Mechanism

A new locking function is provided such that locked subjects are preserved with only necessary lighting adjustments, while unlocked subjects are flexibly injected into the scene with variations guided by the text prompt.

4

State-of-the-Art Performance

Through comprehensive evaluations, we demonstrate that LayerComposer achieves state-of-the-art compositional control and fidelity compared to state-of-the-art personalization methods.

Motivation

Previous methods offer limited interactivity and scale poorly to multiple subjects. They rely on passive embedding injection, allow only text control, and suffer from linear growth of memory and computational cost.

Limitations of Existing Personalization Methods

Whereas our approach introduces an interactive personalization paradigm that enables intuitive, layer-based control over spatial composition and subject preservation.

Methodology

Layered Canvas

The layered canvas is represented by a set of RGBA layers L = {l₁, ⋯, lₙ} and binary locking flags B = {b₁, ..., bₙ}. Each RGBA layer lᵢ encodes one subject, where RGB channels provide visual reference and the alpha channel defines spatial masks for valid regions.

We proposed locking-aware data sampling. Locking-aware data sampling assigns locked layers directly from the target image and unlocked layers from other images in the same scene. This design compels the model to preserve the fidelity of locked content to the maximum extent, while allowing variation in the unlocked layers.

Layered Canvas Data Sampling

LayerComposer Pipeline

LayerComposer conditions a diffusion model on both text prompts and layered canvas. Each layer is encoded using VAE, then positional embeddings are added: locked layers share embeddings [0, x, y], while unlocked layers get unique indices [j, x, y]. Transparent latent pruning retains only non-transparent regions for scalable generation.

LayerComposer Pipeline

Experiments

Ablation Study

LayerComposer Ablation Study
1

Locking Mechanism

To demonstrate the effect, we progressively lock each input layer. A locked layer preserves the pose of the subject—while the model applies only outpainting and subtle lighting changes. We highlight that this is different from the masked inference, where the masked regions will not be updated at all. In terms of our unlocked layers, they will be flexibly adjusted based on the locked ones and the broader context.

2

Layered Canvas

Without the layered canvas, the model is trained on a single collage image as the conditioning input, shown as "Inputs" in the figure. As seen in the "w/o layered canvas" column, e.g., occlusion in the collage causes missing information. For example, the ball on the Christmas hat disappears from the left woman. By contrast, our layered canvas explicitly handles occlusion and prevents such artifacts.

Layer Reordering for Spatial Control

By adjusting the position of each subject within its own layer in the Layered Canvas, LayerComposer enables intuitive spatial layout control.

Also Read

Acknowledgements

The authors would like to acknowledge Or Patashnik and Daniel Cohen-Or for their feedback on the paper; Maya Goldenberg for the demo video; the anonymous reviewers for their constructive comments; and other members of the Snap Creative Vision team for their valuable feedback and discussions throughout the project.

BibTeX Citation

@article{qian2025layercomposer,
  author    = {Guocheng Gordon Qian and Ruihang Zhang and Tsai-Shien Chen and Yusuf Dalva and Anujraaj Goyal and Willi Menapace and Ivan Skorokhodov and Daniil Ostashev and Meng Dong and Arpit Sahni and Ju Hu and Sergey Tulyakov and Kuan-Chieh Jackson Wang},
  title     = {LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas},
  journal = {arXiv},   
  year      = {2025},
}