Nested Attention: Semantic-aware Attention Values for Concept Personalization

Or Patashnik^1,2 Rinon Gal¹ Daniil Ostashev² Sergey Tulyakov² Kfir Aberman² Daniel Cohen-Or^1,2

¹Tel-Aviv University ²Snap Research

SIGGRAPH 2025

TL;DR: We introduce Nested Attention, a new attention mechanism that produces localized attention values for concept personalization. Our encoder-based method achieves an effective balance between identity preservation and prompt alignment.

Abstract

Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.

Our personalization method is encoder-based. The input image is passed through an encoder that produces multiple tokens to represent it. These tokens are projected to form the keys and values of the nested attention layers. The result of each nested attention layer is a new set of per-query values, V_q[s^*], which then replace the cross-attention values of the token s^* representing the subject. One nested attention layer is added to each of the cross-attention layers of the model.

What do nested attention layers learn?

Analyzing the query-dependent values (V_q[s^*]) from a nested attention layer. For three queries of the generated image (purple, orange, blue points), we first show their attention maps in a nested attention layer (graph). There, each point corresponds to a token produced by the encoder. In each graph, 1-2 tokens dominate the attention. To analyze the information encoded in the most dominant token, we show the Q-Former attention map of its corresponding learned query. These show the semantic alignment between the probed query, and the source of values assigned to it.

Results

BibTeX


      @inproceedings{patashnik2025nested,
        author = {Patashnik, Or and Gal, Rinon and Ostashev, Daniil and Tulyakov, Sergey and Aberman, Kfir and Cohen-Or, Daniel},
        title = {Nested Attention: Semantic-aware Attention Values for Concept Personalization},
        year = {2025},
        publisher = {Association for Computing Machinery},
        url = {https://doi.org/10.1145/3721238.3730634},
        booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
        articleno = {6},
        numpages = {12},
        series = {SIGGRAPH Conference Papers '25}
      }