VIMI: Grounding Video Generation through Multi-modal Instruction

Yuwei Fang1 Willi Menapace1 Aliaksandr Siarohin1 Tsai-Shien Chen1,2,*
Kuan-Chien Wang1 Ivan Skorokhodov1 Graham Neubig3 Sergey Tulyakov1

Snap Inc.1 UC Merced2 Carnegie Mellon University3 Work performed while interning at Snap Inc.*

Paper Code

Abstract

Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we fine-tune the model from the first stage on three video generation tasks, incorporating multimodal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multimodal information. After this two-stage training process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.

Our Framework

Retrieve-Augmented Pretraining for Videos

Multimodal Instruction Tuning for Videos

We first construct a large-scale dataset by employing retrieval methods to pair multimodal in-context with given text prompts. Then we present a multimodal conditional video generation framework for pretraining on these augmented datasets

We propose multimodal instruction tuning for video generation, grounding the model on customized input specified in different multimodal instructions for video generation, including subject-driven video generation, video prediction and text-to-video.

Single-Subject Video Generation

Multi-subject Video Generation

Video Prediction

BibTex

@article{fang2024vimi,
              title={VIMI: Grounding Video Generation through Multi-modal Instruction}, 
              author= {},
              journal={arXiv preprint arXiv:2407.06304},
              year={2024}
            }