TL;DR: We introduce LooseRoPE,
a training-free method for semantic harmonization — seamlessly blending
cropped-and-pasted objects into new scenes while preserving their identity
and achieving visual coherence. No text prompts needed.
Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.
Sample results demonstrating LooseRoPE's ability to seamlessly blend cropped objects into new scenes. Our method preserves the identity of pasted objects while achieving natural integration with the surrounding context.
While editing models such as Flux Kontext have demonstrated powerful editing capabilities, when tasked with our semantic harmonization task — blending a cropped-and-pasted object into a new scene — they often suffer from one of two failure modes:
We find that these failures are caused by attention to the input image being either too localized (causing neglect) or too semantic/diffuse (causing suppression).
Failure modes of Flux Kontext. Left: Neglect — the pasted object (traffic cone, swan) remains unblended with visible artifacts. Right: Suppression — the pasted object (turtle pattern on face, parrot) is completely removed by the model.
We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE — a saliency-guided modulation of Rotary Positional Encoding (RoPE) that acts as a continuous controller of the attention field of view.
Our key insight is that effective blending requires an adaptive balance:
By "loosening" the positional constraints of RoPE based on a saliency map, we smoothly steer the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object.
LooseRoPE overview. We estimate a saliency map for the pasted region, then use it to modulate RoPE during inference. High-saliency queries attend locally to preserve identity, while low-saliency queries attend broadly for seamless blending.
Our method consists of two key components:
📍 Saliency Estimation: We use a pre-trained instance detection network to extract feature activations that highlight semantically meaningful regions (e.g., facial features, object-defining details) while assigning low values to redundant regions.
🔧 Content-Aware Attention Manipulation: We modulate the attention weights between output queries and input keys according to saliency. This is done through (1) RoPE-based manipulation that controls the effective spatial range of attention, and (2) a crop attention factor that scales attention weights to prevent suppression of salient regions.
🤖 VLM-Based Parameter Steering: We optionally leverage a vision-language model to automatically detect signs of neglect or suppression early in the diffusion process and adaptively adjust parameters for optimal results.
@misc{..,
}
We thank Omer Dahary and Jackson Wang for their helpful discussions and feedback.