EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai1,2*, Chaoyang Wang2, Guocheng Gordon Qian2, Willi Menapace2, Sergey Tulyakov2, Bernard Ghanem1, Peter Wonka2,1, Ashkan Mirzaei2*†

KAUST 1 KAUST Snap Inc. 2 Snap Inc.

* Core Contributors, † Project Lead

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video + text, video + mask + text, video + mask + reference + text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems.

State-of-the-Art Video Editing
State-of-the-Art Image Editing
Controllable Video Editing
Scalable Data Generation Engine

Spotlight video showcasing EasyV2V's diverse video editing capabilities across multiple edit types.

(a) Overview of our video editing architecture. The frozen video VAE encodes control signals. Mask tokens are added to the input video tokens and concatenated with the noisy latent tokens and optional reference image. The DiT is trained using LoRA. (b) Input formats used for different dataset types during training. When object masks are available, the spatiotemporal mask input equals the input mask. Otherwise, the mask indicates transitions from the unedited to the edited video.

⚠️ All videos are compressed for faster loading! ⚠️

Image Editing

Input

Output

Remove the yacht in the image.

Comparisons

Input

Lucy Edit

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Input

Señorita-2M

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Input

InsViE

Ours

Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.

Quantitative Results

VLM Evaluation

Quantitative comparison of our method against baselines.

User Study

Win/Tie/Loss ratios for our method against each baseline across three criteria: Instruction Alignment (how well edits follow the text prompt), Preservation of Unedited Region (temporal consistency in unchanged areas), and Video Quality (overall visual fidelity).

Introduction

What makes a good instruction-based video editor? We argue that three components govern performance: data, architecture, and control. This paper analyzes the design space of these components and distills a recipe that works great in practice. The result is a lightweight model that reaches state-of-the-art quality while accepting flexible inputs.

Training-free video editors adapt pretrained generators but are fragile and slow. Training-based approaches improve stability, yet many target narrow tasks such as ControlNet-style conditioning, video inpainting, or reenactment. General instruction-based video editors handle a wider range of edits, yet still lag behind image-based counterparts in visual fidelity and control. We set out to narrow this gap. The Figure below motivates our design philosophy: modern video models already know how to transform videos. To unlock this emerging capability with minimal adaptation, we conduct a comprehensive investigation into data curation, architectural design, and instruction-control.

A pretrained text-to-video model can mimic common editing effects without finetuning. This suggests that much of the "how" of video editing already lives inside modern backbones.

Data. We see three data strategies. (A) One generalist model renders all edits and is used to self-train an editor. This essentially requires a single teacher model that already solves the problem in high quality. (B) Design and train new experts for specific edit types, then synthesize pairs at scale. This yields higher per-task fidelity and better coverage of hard skills, but training and maintaining many specialists is expensive and slows iteration and adaptation to future base models. We propose a new strategy: (C) Select existing experts and compose them. We focus on experts with a fast inverse (e.g., edge↔video, depth↔video) and compose more complex experts from them. This makes supervision easy to obtain, and experts are readily available as off-the-shelf models. It keeps costs low and diversity high by standing on strong off-the-shelf modules; the drawback is heterogeneous artifacts across experts, which we mitigate through filtering and by favoring experts with reliable inverses. Concretely, we combine off-the-shelf video/image experts for stylization, local edits, insertion/removal, and human animation. We also convert image edit pairs into supervision via two routes: single-frame training and pseudo video-to-video (V2V) pairs created by applying the same smooth camera transforms to source and edited images. Lifting image-to-image (I2I) data increases scale and instruction variety; its limitation is weak motion supervision, and shared camera trajectories reintroduce temporal structure without changing semantics. We further leverage video continuation: we derive V2V pairs from densely captioned text-to-video (T2V) datasets by sampling input clips outside each captioned interval and target clips within it, while converting the caption into an instruction using an LLM. Continuation data teaches action and transition edits that are scarce in typical V2V corpora, at the cost of more careful caption-to-instruction normalization. We collect and create an extensive set of V2V pairs, the most comprehensive among published works, and conduct a detailed study of how editing capability emerges from each training source.

Architecture. Using a pretrained video backbone, we study two strategies to inject the source video: channel concatenation and sequence concatenation. Channel concatenation is faster in practice because it uses fewer tokens, but sequence concatenation consistently yields higher edit quality. The trade-off is efficiency versus separation: channel concatenation keeps context short but entangles source and target signals, while sequence concatenation costs more tokens but preserves clean roles for each stream, improving instruction following and local detail. Our final design adds small, zero-initialized patch-embedding routes for the source video and an edit mask, used for spatiotemporal control of the edit extent. It reuses the frozen video VAE for all modalities and fine-tunes only with LoRA. Full finetuning can help when massive, heterogeneous data are available, but it risks catastrophic forgetting and is costly to scale. With fewer than 10 million paired videos, LoRA transfers faster, reduces overfitting, and preserves pretrained knowledge while supporting later backbone swaps. Its main downside is a slight loss in potential quality when unlimited data and compute are present. We inject the mask by token addition and concatenate tokens from the source and optional reference image along the sequence. This approach preserves pretraining benefits, keeps token budgets tight (by not introducing new tokens for the mask), and makes the model easily portable to future backbones. Addition for masks is simple and fast; while it carries less capacity than a dedicated mask-token stream, we find it sufficient for precise region and schedule control without context bloat. We use an optional reference frame during training and testing to leverage strong image editors when available. References boost specificity and style adherence when present; randomly dropping the reference during training keeps the model robust when references are absent or noisy.

Flexible control. Prior work explores control by skeletons, segmentation, depth, and masks. A key signal is still missing: when the edit happens. Users often want an edit to appear gradually (e.g., “set the house on fire starting at 1.5s, then let the flames grow gradually”). We unify spatial and temporal control with a single mask video. Pixels mark where to edit; frames mark when to edit, and how the effect evolves. Alternatives include keyframe prompts or token schedules, which are flexible but harder to author and to align with motion. A single mask video is direct, differentiable, and composes well with text and optional references. The cost is requiring a mask sequence, which we keep lightweight and editable.

Combining these design choices results in a unified editor, EasyV2V, which supports flexible input combinations, including video + text, video + mask + text, and video + mask + reference + text. Despite its simplicity, our recipe consistently improves edit quality, motion, and instruction following over recently published methods. In summary, our contributions are: