Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

CVPR 2022

1Snap Inc., 2Rutgers University

Generated Videos on Multimodal VoxCeleb.

Abstract

Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.

Multimodal VoxCeleb Dataset

Text-to-video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for text-to-video generation. We show three synthesized videos for each input multimodal condition.

Independent multimodal video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for independent multimodal video generation. The input control signals are text and a segmentation mask. We show two synthesized videos for each input multimodal condition.

Samples generated by MMVID conditioned on text and an artistic drawing.

Samples generated by MMVID conditioned on text and a partially observed image.

Dependent multimodal video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for dependent multimodal video generation. The input control signals are text, an image, and a segmentation mask. We show two synthesized videos for each input multimodal condition.

Samples generated by MMVID conditioned on text, an artistic drawing, and a segmentation mask.

Samples generated by MMVID conditioned on text, an image (used for appearance), and a video (used for motion guidance).

Textual Augmentation

Example videos generated by methods w/ (w/ RoBERTa) and w/o (w/o RoBERTa) using language embedding from RoBERTa as text augmentation. Models are trained on the Multimodal VoxCeleb dataset for text-to-video generation. We show three synthesized videos for each input text condition.



Moving Shapes Dataset

Text-to-video generation

Samples generated by our approach on the Moving Shapes dataset for text-to-video generation. We show three synthesized videos for each input text condition.

Independent multimodal video generation

Samples generated by our approach on the Shapes dataset for independent multimodal generation. The input control signals are text and a partially observed image (with the center masked out, shown in white color). We show two synthesized videos for each input multimodal condition.

Dependent multimodal video generation

Samples generated by our approach on the Shapes dataset for dependent multimodal generation. The input control signals are text and images. We show one synthesized video for each input multimodal condition.



iPER Dataset

Long sequence generation

Example videos generated by our approach on the iPER dataset for long sequence generation. The extrapolation process is repeated for each sequence 100 times, resulting in a 107-frame video. The textual input also controls the speed, where "slow" indicates videos with slow speed such that the motion is slow, while "fast" indicates the performed motion is fast. We show one synthesized video for each input text condition. The first video following the text input corresponds to the "slow" condition, the second corresponds to the "normal", and the last corresponds to the "fast".

Temporal Interpolation

Example videos of our approach for video interpolation on iPER dataset.



Supplemental Materials

More supplemental videos can be found at this webpage.

BibTeX

@article{han2022show,
        title={Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning},
        author={Han, Ligong and Ren, Jian and Lee, Hsin-Ying and Barbieri, Francesco and Olszewski, Kyle and Minaee, Shervin and Metaxas, Dimitris and Tulyakov, Sergey},
        journal={arXiv preprint arXiv:2203.02573},
        year={2022}
      }