Nested Attention: Semantic-aware Attention Values for Concept Personalization

1Tel-Aviv University   2Snap Research

TL;DR: We introduce Nested Attention, a new attention mechanism that produces localized attention values for concept personalization. Our encoder-based method achieves an effective balance between identity preservation and prompt alignment.

Abstract

Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.

How does it work?

We replace the value of the personalized token s* with the result of an attention operation between the query of the existing cross-attention layer and the nested keys and values produced by the encoder, resulting in a query-dependent value.

* Note the nest notation above the nested keys and values.

Our personalization method is encoder-based. The input image is passed through an encoder that produces multiple tokens to represent it. These tokens are projected to form the keys and values of the nested attention layers. The result of each nested attention layer is a new set of per-query values, Vq[s*], which then replace the cross-attention values of the token s* representing the subject. One nested attention layer is added to each of the cross-attention layers of the model.

What do nested attention layers learn?

Analyzing the query-dependent values (Vq[s*]) from a nested attention layer. For three queries of the generated image (purple, orange, blue points), we first show their attention maps in a nested attention layer (graph). There, each point corresponds to a token produced by the encoder. In each graph, 1-2 tokens dominate the attention. To analyze the information encoded in the most dominant token, we show the Q-Former attention map of its corresponding learned query. These show the semantic alignment between the probed query, and the source of values assigned to it.

Results

BibTeX


      @misc{patashnik2025nested,
        title={Nested Attention: Semantic-aware Attention Values for Concept Personalization},
        author={Or Patashnik and Rinon Gal and Daniil Ostashev and Sergey Tulyakov and Kfir Aberman and Daniel Cohen-Or},
        year={2025},
        eprint={2501.01407},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }