Hierarchical Patch Diffusion Models for High-Resolution Video Generation

CVPR 2024

Ivan Skorokhodov^1,²Willi Menapace^1,³Aliaksandr Siarohin¹Sergey Tulyakov¹

Abstract

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion — an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^{2}$ , surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base $36 \times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end.

Paper Arxiv Poster Video

Existing video diffusion paradigms

Comparing existing diffusion paradigms: Latent Diffusion Model (LDM) (upper left), Cascaded Diffusion Model (CDM) (bottom left), and Patch Diffusion Model (this work) during training (upper right) and inference (bottom right). In our work, we develop hierarchical patch diffusion, which never operates on full-resolution inputs, but instead optimizes the lower stages of the hierarchy to produce spatially aligned context information for the later pyramid levels to enforce global consistency between patches.

Architecture overview

Architecture overview of HPDM for a 3-level pyramid. The model is trained to denoise all the patches jointly. During training, we use only a single patch from each pyramid level and restrict information propagation in the coarse-to-fine manner. This allows one to synthesize the whole image (or video) at a given resolution patch-by-patch using tiled inference.

Quantitative results

Method	FVD↓	InceptionScore↑
MoCoGAN-HD	700	33.95
TATS	635	57.63
VIDM	294.7	-
PVDM	343.6	74.4
Make-A-Video	81.25	82.55
HDPM-S	344.5	73.73
HPDM-M	143.1	84.29
HPDM-L	66.32	87.68

Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).