⚗️ Video Alchemist

Multi-subject Open-set Personalization in Video Generation


Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov,
Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov



Read research paper View more comparisons

Video Alchemist personalizes video generation

Given a text prompt and a set of reference images conceptualizing entity words in the prompt, Video Alchemist generates the video conditioned on both text and reference images.

"A man pets a dog on sea ice."

*Click the images to switch different reference images. Click the left and right buttons to switch the images of different conditional subjects: person, dog, or background.


Video Alchemist supports multi-subject open-set personalization

Video Alchemist is equipped with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for test-time optimization.


Video Alchemist is the state-of-the-art personalization model

Compared to the existing personalization models, Video Alchemist can generate videos with the best text alignment, the highest subject fidelity, and the largest video dynamic.

ELITE

VideoBooth

DreamVideo

Video Alchemist

Ground Truth

IP-Adapter

PhotoMaker

Magic-Me

Video Alchemist

Ground Truth

View more comparisons

According to the evaluation on the proposed benchmark MSRVTT-Personalization. Check the following section or the research paper for more details.


How do we make Video Alchemist work?

To achieve multi-subject, open-set personalization, we design a dataset construction pipeline, including retrieving the entity words and preparing the subject and background images. To mitigate the issue of "copy-and-paste" effect, we collect subject images from multiple frames and introduce training-time image augmentations.

Video Alchemist is built on new Diffusion Transformer modules with additional cross-attention layer for personalization conditioning. To achieve multi-subject conditioning, we introduce a subject-level fusion, binding the word description of each subject with its image representations.

Read research paper

MSRVTT-Personalization: a new benchmark of personalization

To evaluate Video Alchemist, we introduce MSRVTT-Personalization aiming at accurate subject fidelity assessment and supporting various conditioning modes, including conditioning on face crops, single or multiple arbitrary subjects, and the combination of foreground objects and background. We show a test sample in MSRVTT-Personalization below.

Ground Truth Video

Personalization Annotations

Evaluation Metrics

Benchmark dataset and code

Ablation study

Video Alchemist can achieve better subject fidelity using DINOv2 as the encoder compared to CLIP [1],
it can correctly bind the reference image and the corresponding entity word with the usage of word tokens [2],
and it can mitigate the copy-and-paste effect and synthesize text-aligned videos via the proposed image augmentations [3].

Use CLIP [1]

No word token [2]

No augmentation [3]

Video Alchemist


Acknowledgement

We thank Ziyi Wu, Moayed Haji Ali, and Alper Canberk for their helpful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻