While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization.
We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold.
On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model.
For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images.
Overall, EasyV2V works with flexible inputs, e.g., video + text, video + mask + text, video + mask + reference + text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems.
Spotlight video showcasing EasyV2V's diverse video editing capabilities across multiple edit types.
(a) Overview of our video editing architecture. The frozen video VAE encodes control signals. Mask tokens are added to the input video tokens and concatenated with the noisy latent tokens and optional reference image. The DiT is trained using LoRA.
(b) Input formats used for different dataset types during training. When object masks are available, the spatiotemporal mask input equals the input mask. Otherwise, the mask indicates transitions from the unedited to the edited video.
⚠️ All videos are compressed for faster loading! ⚠️
Results Gallery
Input
Output
change the cloudy sky to a vibrant sunset with golden hour lighting.
Input
Output
Apply an overgrown effect, as if nature is reclaiming everything.
Input
Output
transform the video into a vibrant oil painting style.
Input
Output
Transform the video into a pastoral oil painting with thick impasto brushwork, warm golden light, and a dreamy haze that blurs the edges of the sheep.
Input
Output
Render the hands in a glowing, ethereal bioluminescent style where each finger pulses with soft light, and the leopard print fabric glows with inner energy like a sacred pattern.
Input
Output
Reimagine the entire video in a hyper-saturated, cinematic anime style with glowing highlights on the golf ball and exaggerated motion lines to emphasize the swing's energy.
Input
Output
Render the entire video in a whimsical, storybook illustration style with thick outlines, pastel colors, and hand-drawn textures, turning the workshop into a magical crafting realm where labels come alive.
Input
Output
Render the entire video in a minimalist, monochrome graphic novel style with bold black outlines, white backgrounds, and speech bubbles that emerge from the characters’ gestures.
Input
Output
Render the video in a gothic horror aesthetic with candlelight flickers, shadowy figures in the background, and a sense of creeping unease beneath the professional surface.
Input
Output
Recreate the scene in a hand-drawn anime style with bold outlines, exaggerated expressions, and vibrant color blocking, evoking a Japanese TV drama from the 2000s.
Input
Output
Reimagine the scene in a glowing, bioluminescent ink painting style where the sky pulses with inner light, the rock face glows faintly, and the hikers’ outlines shimmer like starlight.
Input
Output
Reimagine the video as a high-contrast, hand-drawn cel-shaded animation with exaggerated motion lines and expressive character stylization, evoking a retro 1980s music video aesthetic.
Input
Output
Render the subject as a classical sculpture carved from a single block of pristine white marble.
Input
Output
change the scene to a bright, sunny autumn day.
Input
Output
Make it Children's Book Illustration.
Input
Output
Cast it as a gothic-style girl with long black hair, glowing red eyes, and a dark, metallic outfit adorned with chains and buckles, rendered in a highly detailed, anime-inspired artistic style.
Input
Output
Cast it as a young woman with long, wavy blonde hair styled in loose waves, wearing subtle makeup with defined eyebrows and eyeliner, captured in a photorealistic black-and-white portrait style.
Input
Output
Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.
Input
Output
Transform it into a glamorous woman adorned with vibrant green and gold makeup, featuring dripping paint effects and sparkling glitter, rendered in a highly detailed, photorealistic style.
Input
Output
Dress it up as a golden-armored angelic knight with long blonde hair, glowing blue eyes, and intricate gold armor adorned with red gemstones, rendered in a photorealistic fantasy style.
Input
Output
Transform it into a knight clad in dark, weathered armor with a deep purple hue, featuring intricate detailing and a hooded helmet, rendered in a photorealistic style with dramatic lighting.
Input
Output
Dress it up as a shark-like humanoid wearing a blue hoodie with pink accents, using a Japanese cartoon style with exaggerated features and bold outlines.
Input
Output
Transform the patterned prayer mat into a solid ivory mat with a subtle geometric border.
Input
Output
Replace the white protective suit with a reflective lab coat that shimmers under the lights.
Input
Output
Add a small, floating drone with a camera lens that hovers above the fields, scanning the terrain in slow motion.
Input
Output
Add a faint animated highlight around the open laptop screen, simulating a focus glow.
Input
Output
Change the glowing orb in the woman’s hand to a pulsating crystal sphere with shifting internal colors.
Input
Reference
Output
Transform the scene into a surreal, infinite highway that stretches into a starry void, where the trees are shadows of galaxies, and the road leads to a portal in the sky.
Input
Reference
Output
Turn a sheep into a cow
Input
Mask
Output
remove the lamb in the mask
Input
Mask
Output
remove the blue parrot in the mask
Input
Mask
Output
change the woman to a robot
Input
Mask
Output
remove the baby crocodile in the mask
Input
Mask
Output
remove the cup in the mask
Input
Mask
Output
change rose color to pink rose
Input
Output
The lamp smoothly turns on lighting up the room.
Input
Output
The scar on the man's face smoothly fades away.
Input
Output
As the sum comes up, the leaves of the tree turn yellow and glow in the sun.
Input
Output
Turn the grainy video back into a normal video
Input
Output
Turn the canny edge map into a video with the following description: A young Asian man sits at a desk in a brightly lit room, wearing a white t-shirt and a blue and gray plaid flannel shirt. Books and a teal-colored folder are on the desk, and a coat rack stands against the white wall in the background. The man stretches his arms out and laughs as the camera zooms out gradually.
Input
Output
Turn the colored video back into a normal video
Input
Output
Turn the colored video back into a normal video
Input
Output
Turn the blurred video back into a normal video
Input
Output
Turn the blurred video back into a normal video
Input
Output
Turn the optical flow map into a video with the following description: A young man with short red hair stands in front of a solid orange background. He wears a white t-shirt, a red and black flame-patterned jacket with white stripes on the sleeves, and bright pink sweatpants. He holds a ping pong paddle in his left hand. The man bounces a ping pong ball on the paddle several times and smiles as the camera zooms in gradually.
Input
Output
Pixelate the video with a pixel length of 16
Input
Output
Turn the pixelated video back into a normal video
Input
Output
Turn the DWpose pose map into a video with the following description: A young woman with fair skin and blonde hair pulled back in a low bun stands against a bright green background, wearing a black V-neck dress. She raises her index finger repeatedly as the camera zooms in gradually.
Input
Output
Turn the DWpose pose map into a video with the following description: A blurry outdoor scene shows a man wearing a blue tank top, black shorts, and sunglasses running towards the camera on a dirt path. The background includes trees and foliage. The camera remains static as the man runs towards it repeatedly.
Results Gallery - Continuation
Input
Output
Make them smile, shake hands, and kiss.
Input
Output
make the man look surprised
Input
Output
let the woman lift the towel
Input
Output
make the man high-five the woman
Input
Output
make the woman stop playing the piano
Input
Output
make the man give an okay gesture
Input
Output
make the woman look at the camera and smile
Input
Output
make the woman silently lower her head down
Input
Output
make the man put the electric kettle back on the floor
Input
Output
Render the man and van in a retro 1970s psychedelic style, with swirling patterns, vibrant color gradients, and a halo effect around the van’s windows.
Input
Output
Apply an overgrown effect, as if nature is reclaiming everything.
Input
Output
Reimagine the entire scene in a cyberpunk neon noir aesthetic, with glowing blue and magenta highlights on the knife and fish, and the background lit by flickering holographic food labels.
Input
Output
Forge a believable, real-world video sequence.
Input
Output
add a glowing orange lantern hanging in the background
Input
Output
add a small brown mushroom growing on the moss-covered log
Input
Output
add a flock of white birds flying across the top center of the frame.
Input
Output
add a steaming pie on top of the towel, centered on the tray.
Input
Output
add a modern sculpture made of polished steel in the center of the room, reflecting the sunlight.
Input
Output
add a potted green plant in the center of the floor, near the window frames.
Input
Output
add a golden retriever running slightly ahead and to the right of the woman.
Input
Output
add a flock of seagulls flying near the sun in the background.
Input
Output
add a colorful toucan perched above the sloth.
Input
Output
add a small, green bone toy to the left of the rubber duck.
Input
Output
add a single lily pad with a pink flower on the left side of the frame.
Input
Output
add a scattering of desert wildflowers in the foreground near the fox.
Input
Output
add a distant hot air balloon with rainbow stripes in the center background.
Input
Output
change the screen text to read “DEEP WORK”
Input
Output
add the text "DIY Project" at the top center in bold white letters.
Input
Output
add the text "PATIENCE" at the top center in a bold, sans-serif font.
Input
Output
add the text "REDWOOD RUN" at the top center in a bold, white font.
Input
Output
add the text “quack?” at the top center in bold white letters.
Input
Output
add the text "FREEDOM" at the top center in bold white letters.
Input
Output
add the text "cozy vibes" at the top center in a bold, white font.
Input
Output
add the text "Swiss Alps" at the top center in a bold, white font.
Input
Output
change the woman's business suit into glossy leather
Input
Output
change the cheetah's fur into shiny gold
Input
Output
change the hippopotamus's dark grey skin into smooth, wet stone
Input
Output
change the leather strap into shiny gold
Input
Output
change the peacock's feathers into shimmering gold
Input
Output
change the cliff walls to shimmering ice.
Input
Output
change the ice wall into rough stone.
Input
Output
change the wooden fence into smooth stone.
Input
Output
remove the flock of birds flying in the lower right corner.
Input
Output
remove the markers.
Input
Output
Remove the slow-moving, translucent jellyfish-like creature drifting near the surface.
Input
Output
Remove the soft glow of lavender mist rising subtly from the client’s forehead.
Input
Output
Remove the glowing blue LED ring around the circular rooftop structure.
Input
Output
Remove the single seashell with a faint inner glow lying near the footprints in the sand.
Input
Output
Remove the floating digital clock in the top-right corner showing '14:37'.
Input
Output
Remove the small, steaming teacup placed neatly beside the open books.
Input
Output
Remove the faint golden glow emanating from the bell tower's spire.
Input
Output
remove the birds flying near the top of the stupas.
Input
Output
remove the whale-like creature swimming on the left side of the window.
Input
Output
remove the trees visible through the doorway at the end of the corridor.
Input
Output
remove the subway train on the left side of the frame.
Input
Output
remove the parakeet from the windowsill.
Input
Output
remove the water stream from the pressure washer.
Input
Output
remove the "READ MORE BOOKS" poster on the bookshelf.
Input
Output
remove the walking persons in the alleyway.
Input
Output
remove the patch of green algae on the sloth’s shoulder.
Input
Output
remove the bear in the center of the frame
Input
Output
remove the full moon in the sky
Input
Output
Change the water color to deep navy blue and add a light fog to the mountains.
Input
Output
change the background to a sunset-lit living room and add a golden glow on the window frame
Input
Output
change the background to a savanna at golden hour and add a small river in the foreground
Input
Output
change the dragonfly's color to glowing pink and add a small butterfly floating in the foreground
Input
Output
change the background to a sunrise sky and add a small dog running in the foreground
Input
Output
Change the water to a glowing blue, add fireflies around the vegetation, and increase the sunlight intensity.
Input
Output
Change the field to a lavender field and add a rainbow arc in the background.
Input
Output
Change the forest to autumn colors and add falling leaves in the background.
Results Gallery - Continuation2
Input
Output
change the peacock’s feathers to shades of gold.
Input
Output
change the pruning shears to gold.
Input
Output
change the rug to rich emerald green
Input
Output
change the snow-capped mountains to warm orange
Input
Output
change the jaguar's eyes to glowing amber
Input
Output
change the koala's fur to soft white
Input
Output
change the light beam to a vibrant green.
Input
Output
change the woman's shirt to lavender.
Input
Output
Render all elements with photorealistic accuracy.
Input
Output
Convert the visual information to a realistic representation.
Input
Output
Fabricate a realistic video based on this input.
Input
Output
Eliminate the synthetic appearance of the video.
Input
Output
Visualize this as a real-life recording and create it.
Input
Output
Frame this sequence within a realistic context.
Input
Output
Forge a believable, real-world video sequence.
Input
Output
replace the puff of breath with a small, floating heart.
Input
Output
replace the “FOCUS MODE” text on the computer screen with a graph.
Input
Output
replace the fly with a butterfly
Input
Output
replace the dragonfly with a red butterfly
Input
Output
replace the rabbit with a white fox
Input
Output
replace the green trash bin with a red recycling bin
Input
Output
replace the two figures with a pair of grazing sheep.
Input
Output
replace the smoke plumes with colorful hot air balloons.
Input
Output
replace the robotic dog with a fluffy golden retriever.
Input
Output
replace the city skyline with a tropical beach.
Input
Output
replace the oak tree with a giant hot air balloon.
Input
Output
replace the ornate clock with a large digital billboard.
Input
Output
replace the seagulls with colorful hot air balloons.
Input
Output
Turn an old man into a wizard with a long robe and staff
Input
Output
Turn a circus ringmaster with a red coat and top hat into a clown
Input
Output
Turn a rabbit into a cheetah
Input
Output
Turn an old woman into a young man
Input
Output
Turn a robin into a kestrel
Input
Output
Turn a wolf into a tiger
Input
Output
Turn a pigeon into a parrot
Input
Output
Turn a rabbit into a sheep
Input
Output
Turn a wolf into a reindeer
Input
Output
Turn an old woman into a skeleton
Input
Output
Colorize it
Input
Output
Colorize it
Input
Output
Colorize it
Image Editing
Input
Output
Remove the yacht in the image.
Input
Output
Transfer the image into a traditional ukiyo-e woodblock-print style.
Input
Output
Add a group of dolphins swimming in the water near the center of the image.
Input
Output
Add a bicycle leaning against the pier in the foreground, positioned near the left side of the dock.
Input
Output
Add a cup of coffee on the table in the foreground.
Input
Output
Add a small stone gazebo near the center of the image.
Input
Output
Add a person wearing a red winter coat and black snow pants walking across the snowy field near the center of the image.
Input
Output
Change the background from a clear blue sky with bare branches to a sunset sky over a lush forest.
Input
Output
Change the green foliage background in the picture to a coastal beach setting.
Input
Output
Remove the human standing prominently in the foreground.
Input
Output
Remove the human standing in the foreground wearing a blue suit, who is positioned in front of the construction site.
Input
Output
Remove the teddy bear holding the heart in the foreground.
Input
Output
Replace the human in the image with a giant pumpkin.
Input
Output
Replace the animal in the image with a cat.
Input
Output
Replace the yacht in the image with a hot air balloon floating just above the ocean surface.
Input
Output
Replace the cathedral in the image with a giant stack of colorful macarons in the same setting and lighting.
Input
Output
Transfer the image into a colourful ceramic mosaic-tile style.
Input
Output
Transfer the image into a clean graphite pencil-sketch style.
Input
Output
Transfer the image into an 8-bit pixel-art video-game style.
Input
Output
Transfer the image into a neon-soaked cyberpunk poster style.
Input
Output
Transfer the image into a clean graphite pencil-sketch style.
Input
Output
Transfer the image into a loose, flowing watercolor-wash style.
Input
Output
Raise the person's left arm.
Input
Output
Raise the person's right arm.
Input
Output
Change the color of the dress to a soft pink.
Input
Output
Change the color of the vehicle to red.
Input
Output
Change the color of the vehicle to red.
Input
Output
Change the building's wall color to light blue.
Input
Output
Change the color of the telephone booth to blue.
Input
Output
Change the wooden bunk bed to blue.
Input
Output
Remove the person on the left side of the image, and adjust the brightness of the background to make it appear lighter.
Comparisons
Input
Lucy Edit
Ours
Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.
Input
Lucy Edit
Ours
Remove the slow-moving, translucent jellyfish-like creature drifting near the surface.
Input
Lucy Edit
Ours
Colorize it
Input
Lucy Edit
Ours
Render the man and van in a retro 1970s psychedelic style, with swirling patterns, vibrant color gradients, and a halo effect around the van’s windows.
Input
Lucy Edit
Ours
Add a faint animated highlight around the open laptop screen, simulating a focus glow.
Input
Lucy Edit
Ours
Turn a wolf into a tiger
Input
Lucy Edit
Ours
Turn a pigeon into a parrot
Input
Lucy Edit
Ours
Turn the pixelated video back into a normal video
Input
Lucy Edit
Ours
Reimagine the entire scene in a cyberpunk neon noir aesthetic, with glowing blue and magenta highlights on the knife and fish, and the background lit by flickering holographic food labels.
Input
Lucy Edit
Ours
Remove the soft glow of lavender mist rising subtly from the client’s forehead.
Input
Lucy Edit
Ours
Transform the scene into a vibrant, digital art style with hyper-saturated colors, glowing halos around the figures, and animated light trails from the boombox.
Input
Lucy Edit
Ours
Cast it as a young woman with long, wavy blonde hair styled in loose waves, wearing subtle makeup with defined eyebrows and eyeliner, captured in a photorealistic black-and-white portrait style.
Input
Señorita-2M
Ours
Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.
Input
Señorita-2M
Ours
Remove the slow-moving, translucent jellyfish-like creature drifting near the surface.
Input
Señorita-2M
Ours
Colorize it
Input
Señorita-2M
Ours
Render the man and van in a retro 1970s psychedelic style, with swirling patterns, vibrant color gradients, and a halo effect around the van’s windows.
Input
Señorita-2M
Ours
Add a faint animated highlight around the open laptop screen, simulating a focus glow.
Input
Señorita-2M
Ours
Turn a wolf into a tiger
Input
Señorita-2M
Ours
Turn the pixelated video back into a normal video
Input
Señorita-2M
Ours
Reimagine the entire scene in a cyberpunk neon noir aesthetic, with glowing blue and magenta highlights on the knife and fish, and the background lit by flickering holographic food labels.
Input
Señorita-2M
Ours
Remove the soft glow of lavender mist rising subtly from the client’s forehead.
Input
Señorita-2M
Ours
Transform the scene into a vibrant, digital art style with hyper-saturated colors, glowing halos around the figures, and animated light trails from the boombox.
Input
Señorita-2M
Ours
Cast it as a young woman with long, wavy blonde hair styled in loose waves, wearing subtle makeup with defined eyebrows and eyeliner, captured in a photorealistic black-and-white portrait style.
Input
InsViE
Ours
Alter the light from the window to emit a soft golden hue, simulating late afternoon sunlight.
Input
InsViE
Ours
Remove the slow-moving, translucent jellyfish-like creature drifting near the surface.
Input
InsViE
Ours
Colorize it
Input
InsViE
Ours
Render the man and van in a retro 1970s psychedelic style, with swirling patterns, vibrant color gradients, and a halo effect around the van’s windows.
Input
InsViE
Ours
Turn a wolf into a tiger
Input
InsViE
Ours
Reimagine the entire scene in a cyberpunk neon noir aesthetic, with glowing blue and magenta highlights on the knife and fish, and the background lit by flickering holographic food labels.
Input
InsViE
Ours
Remove the soft glow of lavender mist rising subtly from the client’s forehead.
Input
InsViE
Ours
Transform the scene into a vibrant, digital art style with hyper-saturated colors, glowing halos around the figures, and animated light trails from the boombox.
Input
InsViE
Ours
Cast it as a young woman with long, wavy blonde hair styled in loose waves, wearing subtle makeup with defined eyebrows and eyeliner, captured in a photorealistic black-and-white portrait style.
Quantitative Results
VLM Evaluation
Quantitative comparison of our method against baselines.
User Study
Win/Tie/Loss ratios for our method against each baseline across three criteria: Instruction Alignment (how well edits follow the text prompt), Preservation of Unedited Region (temporal consistency in unchanged areas), and Video Quality (overall visual fidelity).
Introduction
What makes a good instruction-based video editor?
We argue that three components govern performance: data, architecture, and control.
This paper analyzes the design space of these components and distills a recipe that works great in practice.
The result is a lightweight model that reaches state-of-the-art quality while accepting flexible inputs.
Training-free video editors adapt pretrained generators but are fragile and slow.
Training-based approaches improve stability, yet many target narrow tasks such as ControlNet-style conditioning, video inpainting, or reenactment.
General instruction-based video editors handle a wider range of edits, yet still lag behind image-based counterparts in visual fidelity and control. We set out to narrow this gap.
The Figure below motivates our design philosophy: modern video models already know how to transform videos.
To unlock this emerging capability with minimal adaptation, we conduct a comprehensive investigation into data curation, architectural design, and instruction-control.
A pretrained text-to-video model can mimic common editing effects without finetuning. This suggests that much of the "how" of video editing already lives inside modern backbones.
Data.
We see three data strategies.
(A) One generalist model renders all edits and is used to self-train an editor. This essentially requires a single teacher model that already solves the problem in high quality.
(B) Design and train new experts for specific edit types, then synthesize pairs at scale. This yields higher per-task fidelity and better coverage of hard skills, but training and maintaining many specialists is expensive and slows iteration and adaptation to future base models.
We propose a new strategy: (C) Select existing experts and compose them.
We focus on experts with a fast inverse (e.g., edge↔video, depth↔video) and compose more complex experts from them.
This makes supervision easy to obtain, and experts are readily available as off-the-shelf models.
It keeps costs low and diversity high by standing on strong off-the-shelf modules; the drawback is heterogeneous artifacts across experts, which we mitigate through filtering and by favoring experts with reliable inverses.
Concretely, we combine off-the-shelf video/image experts for stylization, local edits, insertion/removal, and human animation. We also convert image edit pairs into supervision via two routes: single-frame training and pseudo video-to-video (V2V) pairs created by applying the same smooth camera transforms to source and edited images.
Lifting image-to-image (I2I) data increases scale and instruction variety; its limitation is weak motion supervision, and shared camera trajectories reintroduce temporal structure without changing semantics.
We further leverage video continuation: we derive V2V pairs from densely captioned text-to-video (T2V) datasets by sampling input clips outside each captioned interval and target clips within it, while converting the caption into an instruction using an LLM.
Continuation data teaches action and transition edits that are scarce in typical V2V corpora, at the cost of more careful caption-to-instruction normalization.
We collect and create an extensive set of V2V pairs, the most comprehensive among published works, and conduct a detailed study of how editing capability emerges from each training source.
Architecture.
Using a pretrained video backbone, we study two strategies to inject the source video: channel concatenation and sequence concatenation.
Channel concatenation is faster in practice because it uses fewer tokens, but sequence concatenation consistently yields higher edit quality.
The trade-off is efficiency versus separation: channel concatenation keeps context short but entangles source and target signals, while sequence concatenation costs more tokens but preserves clean roles for each stream, improving instruction following and local detail.
Our final design adds small, zero-initialized patch-embedding routes for the source video and an edit mask, used for spatiotemporal control of the edit extent. It reuses the frozen video VAE for all modalities and fine-tunes only with LoRA.
Full finetuning can help when massive, heterogeneous data are available, but it risks catastrophic forgetting and is costly to scale.
With fewer than 10 million paired videos, LoRA transfers faster, reduces overfitting, and preserves pretrained knowledge while supporting later backbone swaps. Its main downside is a slight loss in potential quality when unlimited data and compute are present.
We inject the mask by token addition and concatenate tokens from the source and optional reference image along the sequence.
This approach preserves pretraining benefits, keeps token budgets tight (by not introducing new tokens for the mask), and makes the model easily portable to future backbones.
Addition for masks is simple and fast; while it carries less capacity than a dedicated mask-token stream, we find it sufficient for precise region and schedule control without context bloat.
We use an optional reference frame during training and testing to leverage strong image editors when available.
References boost specificity and style adherence when present; randomly dropping the reference during training keeps the model robust when references are absent or noisy.
Flexible control.
Prior work explores control by skeletons, segmentation, depth, and masks.
A key signal is still missing: when the edit happens.
Users often want an edit to appear gradually (e.g., “set the house on fire starting at 1.5s, then let the flames grow gradually”).
We unify spatial and temporal control with a single mask video.
Pixels mark where to edit; frames mark when to edit, and how the effect evolves.
Alternatives include keyframe prompts or token schedules, which are flexible but harder to author and to align with motion.
A single mask video is direct, differentiable, and composes well with text and optional references.
The cost is requiring a mask sequence, which we keep lightweight and editable.
Combining these design choices results in a unified editor, EasyV2V, which supports flexible input combinations, including video + text, video + mask + text, and video + mask + reference + text.
Despite its simplicity, our recipe consistently improves edit quality, motion, and instruction following over recently published methods.
In summary, our contributions are:
A clarified design space for instruction-based video editing and a consistent strategy across data, architecture, and control that achieves state-of-the-art results.
A reusable data engine built from composable experts with trivial inverses, lifted high-quality image edits, and video continuation, with per-expert/per-source ablations.
A lightweight architecture that minimally modifies a pretrained video backbone: zero-init patch-embeddings for source and mask, frozen VAE reuse, and LoRA finetuning, plus optional reference frames.
Temporal control as a first-class signal, unified with spatial control via a single mask video that schedules when edits start and how they evolve.