Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Yifan Wang^1,2 · Yanyu Li² · Gordon Guocheng Qian^2,† · Sergey Tulyakov² · Yun Fu¹ · Anil Kag²
¹Northeastern University · ²Snap Inc.

arXiv

Abstract

Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this rich feedback, Diffusion-DRF achieves stable reward-based tuning without preference datasets collection. Diffusion-DRF achieves significant gains both quantitatively and qualitatively, outperforming state-of-the-art Flow-GRPO by 4.74% in overall performance on unseen VBench-2.0.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A close-up of honey being drizzled onto pancakes, the thick liquid flowing slowly and smoothly.

Method

Pipeline of Diffusion-DRF. Given a reference video and its caption as the prompt, our Diffusion-DRF framework queries a frozen VLM with structured multi-dimensional questions spanning text-video alignment (TA), physical fidelity (Phy), and visual quality (VQ). For each dimension, the VLM produces a Yes/No binary judgment together with a dense free-form explanation. The logit difference through next-token prediction defines a differentiable reward for each query, and the aggregated multi-dimensional rewards are backpropagated through the VAE decoder and the final K denoising steps to update the latent diffusion model.

Gradient Heatmaps

In contrast to the latest scalar-based VideoAlign[2] and our Vanilla-DRF baseline built on a simple yes/no text-video alignment VLM feedback, our Diffusion-DRF leverages free and rich structured feedback to yield more semantically coherent and spatially localized gradients, accurately highlighting regions where the generated video violates the input prompt.

Prompts:
(Left) A player, dressed in a white and blue uniform with number 66.
(Right) The video shows two men engaged in a conversation on a street.

Section 1

Qualitative comparisons

We provide video comparison results of our method with other methods. We highlight each video with different colors. Red means the video has some drowbacks on text-video alignment or physical performance while Green means the video performs good. We also highlight some key points in the prompt with Blue to help the reader understand the difference.

Red border: weaker alignment or physical fidelity

Green border: stronger overall result

Blue text: key prompt constraint

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A Samoyed and a Golden Retriever dog are playfully romping through a futuristic neon city at night. The neon lights emitted from the nearby buildings glistens off of their fur.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: In a cozy, sunlit room, an elderly man with silver hair and a gentle smile sits beside his curious grandchild, guiding small hands over an ancient wooden loom. The room is filled with the warm glow of sunlight streaming through a large window, casting intricate patterns on the floor and illuminating the vibrant threads stretched across the loom. The grandfather, wearing a soft, knitted sweater, patiently demonstrates the rhythmic motion of weaving, his experienced hands moving gracefully. The child, eyes wide with wonder, mimics the movements, their small fingers clumsily grasping the threads. Dust motes dance in the golden light, adding a magical quality to this tender moment of shared tradition and learning.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A skilled skateboarder, wearing a black hoodie and ripped jeans, navigates the bustling city streets, performing gravity-defying tricks with precision. The camera captures their every move, from the moment they launch into the air, executing a flawless kickflip, to the seamless landing on the pavement. The urban backdrop blurs, emphasizing the skateboarder's speed and agility. As they weave through the cityscape, the sound of wheels against concrete and the rush of wind create an exhilarating atmosphere. The skateboarder’s expression is one of focus and thrill, embodying the essence of freedom and adventure in the heart of the city.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A white and orange tabby alley cat is seen darting across a back street alley in a heavy rain, looking for shelter

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: The video begins with a rapid zoom towards a vibrant red rose, its petals glistening with morning dew under soft sunlight. As the camera closes in, the intricate details of the petals become visible, revealing their velvety texture and delicate, layered structure. The rich crimson hue of the petals contrasts beautifully with the subtle green of the surrounding leaves. The camera continues to zoom, capturing the gentle curves and natural patterns etched into each petal, highlighting the rose's elegance and complexity. The scene concludes with an intimate close-up of the rose's heart, showcasing the intricate beauty of nature.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A sleek dolphin bursts from the ocean's surface, its body glistening under the sun, as the camera fluidly follows its graceful arc through the air. The dolphin's streamlined form cuts through the sky, droplets of water trailing behind like sparkling jewels. As it reaches the peak of its leap, the camera captures the joyful expression in its eyes, embodying the essence of freedom. The scene is alive with the sound of waves crashing, and as the dolphin dives back into the sea, the camera captures the explosive splash, sending shimmering sprays of water in all directions, encapsulating the playful energy of the moment.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: A young woman with flowing hair sits gracefully at a grand piano on a sandy beach, her fingers dancing over the keys, creating a melody that harmonizes with the gentle waves. The camera begins to rotate around her, capturing her serene expression and the intricate details of her white lace dress. As it moves, the expansive ocean stretches endlessly, its surface shimmering under the soft, golden light of the setting sun. The sky is painted with pastel hues, adding to the whimsical atmosphere. Seagulls glide gracefully above, and the gentle breeze rustles her hair, completing the enchanting scene.

Base Model [1]

Flow-GRPO [3]

Diffusion-DRF

Prompt: In the crystal-clear depths of a vibrant coral reef, a majestic shark glides effortlessly, its sleek body moving with grace and power. The camera captures the fluid motion of its muscular form, the sharpness of its dorsal fin slicing through the azure water. Sunlight filters down, casting dappled patterns on the reef below, where colorful fish dart among the corals. The shark's powerful tail propels it forward with rhythmic precision, creating gentle ripples in its wake. Around it, a kaleidoscope of marine life thrives, from swaying sea anemones to curious sea turtles, painting a vivid picture of underwater harmony.

Section 2

Single prompt fine-tuning

We provide visual results of the single prompt finetuning in the paper.

Base Model [1]

w/ VideoAlign [2]

w/ PickScore [4]

w/ Diffusion-DRF

Prompt 1: A man is preparing a meal. Sausages are accompanied by a green sauce and red tomatoes. The table holds pink bowl and a blue plate. The person is dressed in a blue shirt.

Base Model [1]

w/ VideoAlign [2]

w/ PickScore [4]

w/ Diffusion-DRF

Prompt 2: The video captures a serene lakeside scene during what appears to be either sunrise or sunset, two ducks are seen across the frame..

Base Model [1]

w/ VideoAlign [2]

w/ PickScore [4]

w/ Diffusion-DRF

Prompt 3: The video showcases a close-up view of a plate of seafood pasta, featuring linguine noodles mixed with shrimp, squid, and chunks of chicken. A fork is also visible on the left side of the frame.

Section 3

More results

In this section, we provide more comparisons between base model and ours, and we also provide more results of our model.

Pairwise comparisons

In this subsection, we present side-by-side comparisons between the base model and ours.

Base Model

Ours

Prompt: A daring surfer clad in gleaming medieval armor expertly navigates a molten lava flow, the intense heat shimmering around him. The camera follows closely, capturing the surreal juxtaposition of metal against the fiery backdrop. His visor reflects the glowing red and orange hues of the lava, while the armor's intricate engravings glint in the volcanic light. As he maneuvers skillfully, the molten rock crackles beneath his board, sending sparks into the air. The camera captures every detail, from the heat waves distorting the air to the intense focus in his eyes, creating a breathtaking scene of bravery and elemental power.

Base Model

Ours

Prompt: A poised model stands gracefully in a vibrant red silk dress, its glossy sheen catching the light with every subtle movement. The dress, with its smooth, shiny texture, cascades elegantly down her form, reflecting a spectrum of rich crimson hues. As she gently sways, the silk fabric ripples like liquid, accentuating its luxurious quality. Her confident stance and serene expression complement the dress's opulence, while the static camera captures the intricate play of light on the silk's surface, highlighting its lustrous, glossy appearance. The background remains understated, ensuring the focus remains on the model and the captivating allure of the dress.

Base Model

Ours

Prompt: A couple strolls hand in hand down a bustling city street, their fingers intertwined, embodying a sense of unity and affection. The man, wearing a navy coat and jeans, walks confidently beside the woman, who is dressed in a stylish red trench coat and black boots. The street is alive with the sounds of distant traffic and the chatter of passersby, while the couple remains in their own world, exchanging smiles and glances. The static camera captures their every step, as they navigate the urban landscape, framed by towering buildings and vibrant storefronts, under a sky painted with the warm hues of a setting sun.

Base Model

Ours

Prompt: A sleek black car glides smoothly along a winding coastal road, the camera capturing its every move with a handheld tracking shot. The sun casts a golden glow on the vehicle's polished surface, highlighting its aerodynamic design. As the car navigates sharp turns, the ocean waves crash against the rocky cliffs below, creating a dramatic backdrop. The camera follows closely, capturing the rhythmic motion of the wheels and the subtle reflections of the surrounding landscape on the car's glossy exterior. The scene conveys a sense of freedom and adventure, with the open road stretching endlessly ahead.

More ours results

Prompt: In a neon-drenched metropolis, a bioengineered tiger prowls with grace, its sleek, augmented body glistening under the vibrant city lights. The camera captures the intricate details of its glowing cybernetic implants, seamlessly integrated into its muscular frame. As it moves, the tiger's eyes, illuminated with an eerie blue glow, scan the bustling streets, reflecting the fusion of primal instinct and advanced technology. The futuristic cityscape, with towering skyscrapers and holographic advertisements, contrasts sharply with the tiger's natural elegance, creating a mesmerizing blend of nature and innovation. The scene evokes a sense of awe and wonder, highlighting the delicate balance between the wild and the artificial.

Prompt: In a warmly lit bedroom, an elderly man with gray hair, wearing a plaid shirt and comfortable trousers, enters the scene. The camera remains still, capturing his deliberate movements as he walks towards a neatly made bed adorned with a floral quilt. The soft evening light casts gentle shadows, creating a cozy ambiance. He pauses, taking a moment to gaze around the room, which features wooden furniture and a small lamp casting a warm glow. With a sigh of contentment, he sits on the edge of the bed, his expression reflecting a sense of peace and nostalgia in the tranquil setting.

Prompt: A person slicing a loaf of freshly baked bread.

Prompt: A cat is under the table, then the cat runs onto the table.

Prompt: A static camera captures a charming scene of a little girl, wearing a bright yellow raincoat and red boots, walking down a quaint cobblestone street. Her curly hair bounces with each step as she follows a small, fluffy white dog trotting eagerly ahead. The street is lined with colorful, flower-filled window boxes and vintage lampposts, creating a picturesque, storybook setting. The girl's laughter echoes softly as the dog occasionally glances back, ensuring she keeps up. Sunlight filters through the leaves of overhanging trees, casting playful shadows on the ground, enhancing the whimsical atmosphere of their joyful stroll.

Prompt: A man riding a horse through the Gobi Desert with a beautiful sunset behind him, movie quality.

Prompt: A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field.

Prompt (Right Group): On a rustic wooden bench, a fluffy tabby cat with striking green eyes sits poised, its gaze locked onto a small, vibrant bluebird perched at the opposite end. The cat's ears are perked, and its tail flicks with curiosity, while the bird, with its delicate feathers and bright eyes, tilts its head inquisitively. Sunlight filters through the leaves above, casting dappled shadows on the bench, creating a serene yet tense atmosphere. The gentle rustle of leaves and distant chirping enhance the moment, as both creatures remain still, captivated by each other's presence in this peaceful garden setting.

Prompt: This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird’s head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird’s striking appearance.

Prompt: In a captivating close-up, luxurious chocolate flows gracefully, cascading downward in a mesmerizing display. As it pours, the rich, glossy liquid forms the letters "TME," each curve and line perfectly sculpted by the flowing chocolate. The warm, ambient lighting accentuates the chocolate's glossy texture, creating a tantalizing visual feast. In slow motion, the velvety ripples and folds of the chocolate are captured in exquisite detail, each movement a testament to its smoothness. The camera pans slowly, following the chocolate's descent, drawing the viewer into its hypnotic journey, as the rich aroma seems almost tangible.

Prompt: The camera captures the ethereal northern lights swirling gracefully across the Arctic sky, painting vibrant hues of green, purple, and blue. Stars twinkle like scattered diamonds above the pristine, snow-blanketed landscape, where gentle mounds of snow glisten under the celestial display. The scene is framed by the silhouettes of distant, rugged mountains, adding depth to the horizon. A tranquil, frozen lake reflects the aurora's colors, creating a mirror-like effect that doubles the spectacle. The air is crisp and silent, enhancing the serene and magical atmosphere, as the lights dance in an ever-changing, mesmerizing ballet.

Prompt: In a serene forest setting, a majestic deer, with its sleek, tawny coat glistening in the dappled sunlight, prepares to leap over a narrow, babbling stream. The camera captures the moment in a medium shot, focusing on the deer's powerful hind leg as it pushes off the mossy bank, muscles rippling with strength and grace. Its antlers, adorned with delicate leaves, frame its alert eyes, reflecting the vibrant greens and browns of the surrounding foliage. As it soars effortlessly through the air, the gentle rustle of leaves and the soft murmur of the stream create a harmonious backdrop, encapsulating the essence of nature's elegance.

[1] Wan Team et al. Wan: Open and Advanced Large-Scale Video Generative Models. arXiv, 2025.

[2] Liu et al. Improving Video Generation with Human Feedback. NeurIPS, 2025.

[3] Liu et al. Flow-GRPO: Training Flow Matching Models via Online RL. NeurIPS, 2025.

[4] Kirstain et al. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. NeurIPS, 2023.

Citation

Cite our work

@misc{wang2026diffusiondrfdifferentiablerewardflow,
    title={Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning}, 
    author={Yifan Wang and Yanyu Li and Gordon Guocheng Qian and Sergey Tulyakov and Yun Fu and Anil Kag},
    year={2026},
    eprint={2601.04153},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2601.04153},
}