VisualComposer

Introducing VisualComposer

Given a set of visual-prompts that consist of both foreground and background elemnts, such as a boat, a person, and a backgorund river, we train VisualComposer, a diffusion model that generates realistic compositions of these visual prompts. Similar to text prompts, these visual prompts enable creating semantically coherent compositions across a variety of styles and scenes without the need to provide a predefined layout.

Abstract

We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Mixing Keys and Values

Image Prompt Adapters capture visual information from images to guide the generation process. The feature extractor's bottleneck size (left sub-figure) determines the level of detail in the extracted Key-Value (KV) features. Using only coarse KVs (left) sacrifices identity preservation, while using only finegrained KVs (middle) limits scene variation. In contrast, combining mixed-granularity KVs (right) achieves diverse scene representation without compromising identity preservation.

How Does It Work?

Our method begins by encoding all input visual prompts through two separate branches: an appearance branch (top row, shown in orange) that uses a Fine-Grained encoder followed by an Appearance adapter to encode per-prompt appearance tokens, and a layout branch (bottom row, shown in blue) that uses a Coarse encoder followed by a Layout adapter to encode per-prompt layout tokens. Once the appearance and layout tokens are extracted from the input visual prompts, they are injected into the U-Net through Object-Centric KV-Mixed Cross Attention layers. The layout tokens are input as keys and determine the spatial influence of each individual visual prompt in the final image, as visualized by the per-object attention masks. The appearance tokens are input as values after attention mask is computed and hence only influence the appearance and the identity.

Results

For each row, the input visual prompts are shown on the left

The images on the right show the compositional image generated by Visual Composer

Acknowledgements

This research was performed while Gaurav Parmar was interning at Snap. We would like to thank Sheng-Yu Wang, Nupur Kumari, Maxwell Jones, and Kangle Deng for their fruitful discussions and valuable input which helped improve this work. We also thank Ruihan Gao and Ava Pun for feedback about the manuscript.

BibTeX

@misc{parmar2025viscomposer, title={Object-level Visual Prompts for Compositional Image Generation}, author={Parmar, Gaurav and Patashnik, Or and Wang, Kuan-Chieh and Ostashev, Daniil and Narasimhan, Srinivasa and Cohen-Or, Daniel and Zhu, Jun-Yan and Aberman, Kfir}, year={2025}, eprint={2501.01424}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.01424} }