LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization

Abstract

Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.

Sample results demonstrating LooseRoPE's ability to seamlessly blend cropped objects into new scenes. Our method preserves the identity of pasted objects while achieving natural integration with the surrounding context.

How does it work?

The Challenge: Neglect vs. Suppression

While editing models such as Flux Kontext have demonstrated powerful editing capabilities, when tasked with our semantic harmonization task — blending a cropped-and-pasted object into a new scene — they often suffer from one of two failure modes:

Neglect: The pasted region is barely modified, leaving visible seams and unnatural boundaries. The model fails to blend the inserted object with its new context.
Suppression: The pasted object disappears entirely, as the model's generative prior overrides its appearance and identity.

We find that these failures are caused by attention to the input image being either too localized (causing neglect) or too semantic/diffuse (causing suppression).

Failure modes of Flux Kontext. Left: Neglect — the pasted object (traffic cone, swan) remains unblended with visible artifacts. Right: Suppression — the pasted object (turtle pattern on face, parrot) is completely removed by the model.

Our Solution: Saliency-Guided RoPE Modulation

We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE — a saliency-guided modulation of Rotary Positional Encoding (RoPE) that acts as a continuous controller of the attention field of view.

Our key insight is that effective blending requires an adaptive balance:

🔥 Semantically important regions (e.g., faces, distinctive features) should attend locally to preserve their identity
💧 Less salient regions (e.g., backgrounds, uniform textures) should attend broadly to achieve visual coherence

By "loosening" the positional constraints of RoPE based on a saliency map, we smoothly steer the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object.

LooseRoPE overview. We estimate a saliency map for the pasted region, then use it to modulate RoPE during inference. High-saliency queries attend locally to preserve identity, while low-saliency queries attend broadly for seamless blending.

Our method consists of two key components:

📍 Saliency Estimation: We use a pre-trained instance detection network to extract feature activations that highlight semantically meaningful regions (e.g., facial features, object-defining details) while assigning low values to redundant regions.

🔧 Content-Aware Attention Manipulation: We modulate the attention weights between output queries and input keys according to saliency. This is done through (1) RoPE-based manipulation that controls the effective spatial range of attention, and (2) a crop attention factor that scales attention weights to prevent suppression of salient regions.

🤖 VLM-Based Parameter Steering: We optionally leverage a vision-language model to automatically detect signs of neglect or suppression early in the diffusion process and adaptively adjust parameters for optimal results.

LooseRoPE

Content-aware Attention Manipulation for Semantic Harmonization

Abstract

Sample Results

How does it work?

The Challenge: Neglect vs. Suppression

Our Solution: Saliency-Guided RoPE Modulation

BibTeX

Acknowledgements


Input Image	Flux Kontext	LooseRoPE (Ours)


Input Image	Flux Kontext	LooseRoPE (Ours)


Input Image	Flux Kontext	LooseRoPE (Ours)


Input Image	Flux Kontext	LooseRoPE (Ours)