Hierarchical Patch Diffusion Models for High-Resolution Video Generation
CVPR 2024
Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion — an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101
Existing video diffusion paradigms
Architecture overview
Quantitative results
Method | FVD↓ | InceptionScore↑ |
---|---|---|
MoCoGAN-HD | 700 | 33.95 |
TATS | 635 | 57.63 |
VIDM | 294.7 | - |
PVDM | 343.6 | 74.4 |
Make-A-Video | 81.25 | 82.55 |
HDPM-S | 344.5 | 73.73 |
HPDM-M | 143.1 | 84.29 |
HPDM-L | 66.32 | 87.68 |
Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).