DenseDPO: Fine-Grained Temporal

Preference Optimization for Video Diffusion Models

arXiv 2025

Ziyi Wu1,2,3Anil Kag1Ivan Skorokhodov1Willi Menapace1Ashkan Mirzaei1
Igor Gilitschenski2,3,*Sergey Tulyakov1,*Aliaksandr Siarohin1,*
1Snap Research   2University of Toronto   3Vector Institute  
* Equal Supervision

[Paper]  [Twitter Thread]

TL;DR:

DenseDPO is a post-training method tailored towards video diffusion models. Compared to vanilla DPO, it improves the paired data construction and the preference label granularity, leading to videos with better visual quality and motion strength using only 1/3 of the data.


Caption Pre-trained Model VanillaDPO DenseDPO (Ours)

A woman doing push-up exercise.

In a studio, a popping dancer creates precise isolation movements.

A panda breakdancing in a neon-lit urban alley.


Method


(a) VanillaDPO compares videos generated from independent random noises and only assigns a single binary preference, biasing the annotators toward artifact-free slow-motion videos.
(b) DenseDPO generates structurally similar videos from partially noised real videos, and label segment-level dense preferences (e.g., every 1s subclip).


Qualitative Results


We show text prompts and generated videos from the pre-trained, VanillaDPO aligned, and our DenseDPO aligned models. For more results and comparison with SFT and StructuralDPO, see here.


Caption Pre-trained Model VanillaDPO DenseDPO (Ours)

A young woman dances in the night bustle against the backdrop of a glowing fanfare.

A young adult male doing a handstand on the beach.

A weightlifter performs a deadlift with perfect form in a concrete garage gym.

A man exercising with battle ropes at a gym.

A woman dancing in a gym. The woman is spinning around repeatedly.

A monkey performs a jump on a skateboard at the skate park, landing smoothly.

A giraffe stepping gingerly along a tightrope above a city plaza, drawing gasps from the crowd below.

Fingers press into a shimmering slime ball.

Close-up of a sushi chef slicing sashimi with deliberate, smooth movements.

Water poured into a glass.

A goat balancing on a large circus ball.

A bear wobbling slightly as it rides a bicycle down a forest trail, its paws gripping the seat for balance.

A raccoon rollerblading in a skate park, performing small jumps off the ramps.





BibTeX


If you find our work useful, please consider citing our paper:

@article{wu2025densedpo,
  title={{DenseDPO}: Fine-Grained Temporal Preference Optimization for Video Diffusion Models},
  author={Wu, Ziyi and Kag, Anil and Skorokhodov, Ivan and Menapace, Willi and Mirzaei, Ashkan and Gilitschenski, Igor and Tulyakov, Sergey and Siarohin, Aliaksandr},
  journal={arXiv},
  year={2025}
}


References


[1] Wallace, Bram, et al. "Diffusion Model Alignment Using Direct Preference Optimization." CVPR. 2024.
[2] Meng, Chenlin, et al. "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations." ICLR. 2022.