Just chekcing that no SEO is included.

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

CVPR 2024

Ivan Skorokhodov1,2Willi Menapace1,3Aliaksandr Siarohin1Sergey Tulyakov1

Abstract

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion — an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 2562, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base 36×64 low-resolution generator for high-resolution 64×288×512 text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end.

Existing video diffusion paradigms

Comparing existing diffusion paradigms: Latent Diffusion Model (LDM) (upper left), Cascaded Diffusion Model (CDM) (bottom left), and Patch Diffusion Model (this work) during training (upper right) and inference (bottom right). In our work, we develop hierarchical patch diffusion, which never operates on full-resolution inputs, but instead optimizes the lower stages of the hierarchy to produce spatially aligned context information for the later pyramid levels to enforce global consistency between patches.


Architecture overview

Architecture overview of HPDM for a 3-level pyramid. The model is trained to denoise all the patches jointly. During training, we use only a single patch from each pyramid level and restrict information propagation in the coarse-to-fine manner. This allows one to synthesize the whole image (or video) at a given resolution patch-by-patch using tiled inference.


Quantitative results

Method FVD↓ InceptionScore↑
MoCoGAN-HD 700 33.95
TATS 635 57.63
VIDM 294.7 -
PVDM 343.6 74.4
Make-A-Video 81.25 82.55
HDPM-S 344.5 73.73
HPDM-M 143.1 84.29
HPDM-L 66.32 87.68


Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).


Video generation results on UCF101

HPDM 64x256x256 (ours; random samples)

PVDM 128x256x256 (provided samples)

PVDM 16x256x256 (provided samples)

DIGAN 128x128x128 (provided samples)

StyleGAN-V 128x256x256 (provided samples)


Text-to-video generation results (our prompts)

HPDM-T2V (ours) --- "A robot planting a tree."

HPDM-T2V (ours) --- "A high-definition video of a pack of wolves hunting in a snowy forest, natural behavior, dynamic angles."

HPDM-T2V (ours) --- "A detailed animation of an ancient Egyptian city, with the Nile river and pyramids, 4K, historically accurate."

HPDM-T2V (ours) --- "A 4K time-lapse of a blooming rose, showing each stage of the flower opening."


Text-to-video generation results (comparison)

A confused grizzly bear in calculus class.

Make-A-Video

HPDM-T2V (ours)

Humans building a highway on mars, highly detailed.

Make-A-Video

HPDM-T2V (ours)

Sailboat sailing on a sunny day in a mountain lake, highly detailed.

Make-A-Video

HPDM-T2V (ours)

A panda bear driving a car.

Imagen-Video

HPDM-T2V (ours)

A panda eating bamboo on a rock.

Imagen-Video

HPDM-T2V (ours)

A shark swimming in clear Carribean ocean.

Imagen-Video

HPDM-T2V (ours)

A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset.

Imagen-Video

HPDM-T2V (ours)

A teddy bear skating in Times Square.

Imagen-Video

HPDM-T2V (ours)

A cute rabbit is eating grass, wildlife photography, photograph, high quality, wildlife, f 1.8, soft focus, 8k, award - winning photograph.

PYoCo

HPDM-T2V (ours)

A very happy fuzzy panda dressed as a chef eating pizza in the New York street food truck.

PYoCo

HPDM-T2V (ours)

The supernova explosion of a white dwarf in the universe, photo realistic, 8k, cinematic lighting, hd, atmospheric, hyperdetailed, photography, glow effect.

PYoCo

HPDM-T2V (ours)

An epic tornado attacking above a glowing city at night, the tornado is made of smoke, highly detailed.

PYoCo

HPDM-T2V (ours)

Glass sphere filled with swirling multicolored liquid, cinematic lighting.

PYoCo

HPDM-T2V (ours)

A high quality 3D render of hyperrealist, super strong, multicolor stripped, and fluffy bear with wings, highly detailed, sharp focus.

PYoCo

HPDM-T2V (ours)

Just chekcing that no footer is included.