Hierarchical Patch Diffusion Models for High-Resolution Video Generation
CVPR 2024



Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion — an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101
Existing video diffusion paradigms

Comparing existing diffusion paradigms: Latent Diffusion Model (LDM) (upper left), Cascaded Diffusion Model (CDM) (bottom left), and Patch Diffusion Model (this work) during training (upper right) and inference (bottom right). In our work, we develop hierarchical patch diffusion, which never operates on full-resolution inputs, but instead optimizes the lower stages of the hierarchy to produce spatially aligned context information for the later pyramid levels to enforce global consistency between patches.
Architecture overview

Architecture overview of HPDM for a 3-level pyramid. The model is trained to denoise all the patches jointly. During training, we use only a single patch from each pyramid level and restrict information propagation in the coarse-to-fine manner. This allows one to synthesize the whole image (or video) at a given resolution patch-by-patch using tiled inference.
Quantitative results
Method | FVD↓ | InceptionScore↑ |
---|---|---|
MoCoGAN-HD | 700 | 33.95 |
TATS | 635 | 57.63 |
VIDM | 294.7 | - |
PVDM | 343.6 | 74.4 |
Make-A-Video | 81.25 | 82.55 |
HDPM-S | 344.5 | 73.73 |
HPDM-M | 143.1 | 84.29 |
HPDM-L | 66.32 | 87.68 |
Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).
Video generation results on UCF101
HPDM 64x256x256 (ours; random samples)
PVDM 128x256x256 (provided samples)
PVDM 16x256x256 (provided samples)
DIGAN 128x128x128 (provided samples)
StyleGAN-V 128x256x256 (provided samples)
Text-to-video generation results (our prompts)
HPDM-T2V (ours) --- "A robot planting a tree."
HPDM-T2V (ours) --- "A high-definition video of a pack of wolves hunting in a snowy forest, natural behavior, dynamic angles."
HPDM-T2V (ours) --- "A detailed animation of an ancient Egyptian city, with the Nile river and pyramids, 4K, historically accurate."
HPDM-T2V (ours) --- "A 4K time-lapse of a blooming rose, showing each stage of the flower opening."
Text-to-video generation results (comparison)
A confused grizzly bear in calculus class.
Make-A-Video
HPDM-T2V (ours)
Humans building a highway on mars, highly detailed.
Make-A-Video
HPDM-T2V (ours)
Sailboat sailing on a sunny day in a mountain lake, highly detailed.
Make-A-Video
HPDM-T2V (ours)
A panda bear driving a car.
Imagen-Video
HPDM-T2V (ours)
A panda eating bamboo on a rock.
Imagen-Video
HPDM-T2V (ours)
A shark swimming in clear Carribean ocean.
Imagen-Video
HPDM-T2V (ours)
A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset.
Imagen-Video
HPDM-T2V (ours)
A teddy bear skating in Times Square.
Imagen-Video
HPDM-T2V (ours)
A cute rabbit is eating grass, wildlife photography, photograph, high quality, wildlife, f 1.8, soft focus, 8k, award - winning photograph.
PYoCo
HPDM-T2V (ours)
A very happy fuzzy panda dressed as a chef eating pizza in the New York street food truck.
PYoCo
HPDM-T2V (ours)
The supernova explosion of a white dwarf in the universe, photo realistic, 8k, cinematic lighting, hd, atmospheric, hyperdetailed, photography, glow effect.
PYoCo
HPDM-T2V (ours)
An epic tornado attacking above a glowing city at night, the tornado is made of smoke, highly detailed.
PYoCo
HPDM-T2V (ours)
Glass sphere filled with swirling multicolored liquid, cinematic lighting.
PYoCo
HPDM-T2V (ours)
A high quality 3D render of hyperrealist, super strong, multicolor stripped, and fluffy bear with wings, highly detailed, sharp focus.
PYoCo
HPDM-T2V (ours)