Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Snap Research
IISc Bangalore
Tel Aviv University

TLDR

We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of continuous control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner.

'Transform the dress as if it is made of shiny gold material'
0.0 1.0
Edited Result

Abstract

Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

Strength Controlled Image Editing

'Turn on the lamp with bright blue light'
0.0 1.0
Blue Lamp Result
'Reduce the size of the object'
0.0 1.0
Resize Car Result
'Make them laugh'
0.0 1.0
Dog Boy Laugh Result

Motivation

We extend an instruction-driven image editing model to accept additional scalar input s representing the extent of the editalong with the instruction. We formulate our approach as a simple supervised training problem, where we first generate a synthetic dataset of image-edit-instruction-strength tuples using existing generative models, followed by fine-tuning a state-of-the-art instruction-driven image editing model to obtain fine-grained strength control.

Dataset Construction

As obtaining a real dataset of image edit and strength pairs is challenging, we generate a synthetic dataset using generative models to obtain this training signal. Our dataset construction pipeline consists of three main steps:

Step a): Generating image edting pairs

We start with a dataset of diverse source images and generate edit instructions using a Vision Language Model (VLM). We use these instructions to edit the source images using Flux Kontext, creating our initial image-instruction pairs.

Image Collection & Instruction Generation

Step b): Generating edits with intermediate strength using image morphing

In the second step, we use the source and edited images to generate interpolations between them using a diffusion-based image morphing model. This creates smooth transitions across different strength levels.

Diffusion-Based Image Morphing

Step c): Filtering poor quality samples

The diffusion-based image morphing can generate inconsistent transitions or artifacts and suffer from identity preservation issues. To address these inconsistencies, we apply extensive filtering to remove poor quality samples. Finally, we obtain 60K high-quality image edit sequences with varying editing categories.

Quality Filtering & Refinement

Method

Kontinuous Kontext extends a state-of-the-art image editing model Flux Kontext to accept an additional scalar input representing edit strength, enabling explicit control over the extent of edit while maintaining high-quality results. In our experiments, we found that the modulation space of the Flux Kontext model is highly semantic and allows for control over the edit strengths. We design a lightweight projector network that maps the scalar strength values to the adjustments in the modulation parameters of the text tokens. The projector additionally takes the pooled clip embeddings of the edit instruction as input to make the modulation adaptive based on the edit type. To make the model more expressive and learn from our rich dataset, we train a LoRA on all the attention layers along with the projector network with a standard diffusion denoising loss.

Method Architecture

Results

'Transform his jacket into a blue fluffy fur jacket'
0.0 1.0
Fur Jacket Bike Result
'Grow vegetation on the walls of the buildings on both sides'
0.0 1.0
Venice Vegetation Result
'Transform the scene into an autumn season with dense leaves falling and on the ground'
0.0 1.0
Tibet Autumn Result
'Transform the panda into a husky dog'
0.0 1.0
Panda Husky Result
'Reimagine the scene as if it is captured in daytime with heavy sunlight'
0.0 1.0
Sunlight Result
'Transform the glasses into aviator sunglasses'
0.0 1.0
Aviator Glasses Result
'Transform the scene into a Van Gogh style painting'
0.0 1.0
Ghibli Style Result
'Transform the scene into a winter season with heavy snowfall'
0.0 1.0
Winter Snow Result