Results Gallery

Input

Output

change the cloudy sky to a vibrant sunset with golden hour lighting.

Input

Output

Transform it into a glamorous woman adorned with vibrant green and gold makeup, featuring dripping paint effects and sparkling glitter, rendered in a highly detailed, photorealistic style.

Input

Reference

Output

Transform the scene into a surreal, infinite highway that stretches into a starry void, where the trees are shadows of galaxies, and the road leads to a portal in the sky.

Input

Mask

Output

remove the lamb in the mask

Input

Output

The lamp smoothly turns on lighting up the room.

Input

Output

Turn the grainy video back into a normal video

Results Gallery - Continuation

Input

Output

Make them smile, shake hands, and kiss.

Input

Output

Render the man and van in a retro 1970s psychedelic style, with swirling patterns, vibrant color gradients, and a halo effect around the van’s windows.

Input

Output

add a glowing orange lantern hanging in the background

Input

Output

change the screen text to read “DEEP WORK”

Input

Output

change the woman's business suit into glossy leather

Input

Output

remove the flock of birds flying in the lower right corner.

Input

Output

Change the water color to deep navy blue and add a light fog to the mountains.

Results Gallery - Continuation2

Input

Output

change the peacock’s feathers to shades of gold.

Input

Output

Render all elements with photorealistic accuracy.

Input

Output

replace the puff of breath with a small, floating heart.

Input

Output

Turn an old man into a wizard with a long robe and staff

Input

Output

Colorize it

Image Editing

Input

Output

Remove the yacht in the image.

Comparisons

Input

Lucy Edit

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Input

Señorita-2M

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Input

InsViE

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Quantitative Results

VLM Evaluation

Quantitative comparison of our method against baselines.

User Study

Win/Tie/Loss ratios for our method against each baseline across three criteria: Instruction Alignment (how well edits follow the text prompt), Preservation of Unedited Region (temporal consistency in unchanged areas), and Video Quality (overall visual fidelity).

Introduction

What makes a good instruction-based video editor? We argue that three components govern performance: data, architecture, and control. This paper analyzes the design space of these components and distills a recipe that works great in practice. The result is a lightweight model that reaches state-of-the-art quality while accepting flexible inputs.

Training-free video editors adapt pretrained generators but are fragile and slow. Training-based approaches improve stability, yet many target narrow tasks such as ControlNet-style conditioning, video inpainting, or reenactment. General instruction-based video editors handle a wider range of edits, yet still lag behind image-based counterparts in visual fidelity and control. We set out to narrow this gap. The Figure below motivates our design philosophy: modern video models already know how to transform videos. To unlock this emerging capability with minimal adaptation, we conduct a comprehensive investigation into data curation, architectural design, and instruction-control.

A pretrained text-to-video model can mimic common editing effects without finetuning. This suggests that much of the "how" of video editing already lives inside modern backbones.

Data. We see three data strategies. (A) One generalist model renders all edits and is used to self-train an editor. This essentially requires a single teacher model that already solves the problem in high quality. (B) Design and train new experts for specific edit types, then synthesize pairs at scale. This yields higher per-task fidelity and better coverage of hard skills, but training and maintaining many specialists is expensive and slows iteration and adaptation to future base models. We propose a new strategy: (C) Select existing experts and compose them. We focus on experts with a fast inverse (e.g., edge↔video, depth↔video) and compose more complex experts from them. This makes supervision easy to obtain, and experts are readily available as off-the-shelf models. It keeps costs low and diversity high by standing on strong off-the-shelf modules; the drawback is heterogeneous artifacts across experts, which we mitigate through filtering and by favoring experts with reliable inverses. Concretely, we combine off-the-shelf video/image experts for stylization, local edits, insertion/removal, and human animation. We also convert image edit pairs into supervision via two routes: single-frame training and pseudo video-to-video (V2V) pairs created by applying the same smooth camera transforms to source and edited images. Lifting image-to-image (I2I) data increases scale and instruction variety; its limitation is weak motion supervision, and shared camera trajectories reintroduce temporal structure without changing semantics. We further leverage video continuation: we derive V2V pairs from densely captioned text-to-video (T2V) datasets by sampling input clips outside each captioned interval and target clips within it, while converting the caption into an instruction using an LLM. Continuation data teaches action and transition edits that are scarce in typical V2V corpora, at the cost of more careful caption-to-instruction normalization. We collect and create an extensive set of V2V pairs, the most comprehensive among published works, and conduct a detailed study of how editing capability emerges from each training source.

Architecture. Using a pretrained video backbone, we study two strategies to inject the source video: channel concatenation and sequence concatenation. Channel concatenation is faster in practice because it uses fewer tokens, but sequence concatenation consistently yields higher edit quality. The trade-off is efficiency versus separation: channel concatenation keeps context short but entangles source and target signals, while sequence concatenation costs more tokens but preserves clean roles for each stream, improving instruction following and local detail. Our final design adds small, zero-initialized patch-embedding routes for the source video and an edit mask, used for spatiotemporal control of the edit extent. It reuses the frozen video VAE for all modalities and fine-tunes only with LoRA. Full finetuning can help when massive, heterogeneous data are available, but it risks catastrophic forgetting and is costly to scale. With fewer than 10 million paired videos, LoRA transfers faster, reduces overfitting, and preserves pretrained knowledge while supporting later backbone swaps. Its main downside is a slight loss in potential quality when unlimited data and compute are present. We inject the mask by token addition and concatenate tokens from the source and optional reference image along the sequence. This approach preserves pretraining benefits, keeps token budgets tight (by not introducing new tokens for the mask), and makes the model easily portable to future backbones. Addition for masks is simple and fast; while it carries less capacity than a dedicated mask-token stream, we find it sufficient for precise region and schedule control without context bloat. We use an optional reference frame during training and testing to leverage strong image editors when available. References boost specificity and style adherence when present; randomly dropping the reference during training keeps the model robust when references are absent or noisy.

Flexible control. Prior work explores control by skeletons, segmentation, depth, and masks. A key signal is still missing: when the edit happens. Users often want an edit to appear gradually (e.g., “set the house on fire starting at 1.5s, then let the flames grow gradually”). We unify spatial and temporal control with a single mask video. Pixels mark where to edit; frames mark when to edit, and how the effect evolves. Alternatives include keyframe prompts or token schedules, which are flexible but harder to author and to align with motion. A single mask video is direct, differentiable, and composes well with text and optional references. The cost is requiring a mask sequence, which we keep lightweight and editable.

Combining these design choices results in a unified editor, EasyV2V, which supports flexible input combinations, including video + text, video + mask + text, and video + mask + reference + text. Despite its simplicity, our recipe consistently improves edit quality, motion, and instruction following over recently published methods. In summary, our contributions are:

A clarified design space for instruction-based video editing and a consistent strategy across data, architecture, and control that achieves state-of-the-art results.
A reusable data engine built from composable experts with trivial inverses, lifted high-quality image edits, and video continuation, with per-expert/per-source ablations.
A lightweight architecture that minimally modifies a pretrained video backbone: zero-init patch-embeddings for source and mask, frozen VAE reuse, and LoRA finetuning, plus optional reference frames.
Temporal control as a first-class signal, unified with spatial control via a single mask video that schedules when edits start and how they evolve.

Acknowledgements. The authors want to thank Runjia Li, Maxwell Jones for helpful discussion and data. Maya Goldenberg for help with visualization.

EasyV2V: A High-quality Instruction-based Video Editing Framework

Results Gallery

Results Gallery - Continuation

Results Gallery - Continuation2

Image Editing

Comparisons

Quantitative Results

VLM Evaluation

User Study

Introduction