Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning
In contrast to the latest scalar-based VideoAlign[2] and our Vanilla-DRF baseline built on a simple yes/no text-video alignment VLM feedback, our Diffusion-DRF leverages free and rich structured feedback to yield more semantically coherent and spatially localized gradients, accurately highlighting regions where the generated video violates the input prompt.
(Left) A player, dressed in a white and blue uniform with number 66.
(Right) The video shows two men engaged in a conversation on a street.
Qualitative comparisons
We provide video comparison results of our method with other methods. We highlight each video with different colors. Red means the video has some drowbacks on text-video alignment or physical performance while Green means the video performs good. We also highlight some key points in the prompt with Blue to help the reader understand the difference.
Single prompt fine-tuning
We provide visual results of the single prompt finetuning in the paper.
More results
In this section, we provide more comparisons between base model and ours, and we also provide more results of our model.
In this subsection, we present side-by-side comparisons between the base model and ours.
Cite our work
@misc{wang2026diffusiondrfdifferentiablerewardflow,
title={Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning},
author={Yifan Wang and Yanyu Li and Gordon Guocheng Qian and Sergey Tulyakov and Yun Fu and Anil Kag},
year={2026},
eprint={2601.04153},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04153},
}