Video Alchemist

⚗️ Video Alchemist

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov,
Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

CVPR 2025

Read research paper Github View more comparisons

Video Alchemist personalizes video generation

Given a text prompt and a set of reference images conceptualizing entity words in the prompt, Video Alchemist generates the video conditioned on both text and reference images.

"A man pets a dog on sea ice."

Switch
background

Switch
dog

Switch
person

Switch
background

Switch
dog

Switch
person

*Click the images to switch different reference images. Click the left and right buttons to switch the images of different conditional subjects: person, dog, or background.

Video Alchemist supports multi-subject open-set personalization

Video Alchemist is equipped with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for test-time optimization.

"A woman rides a dinosaur on a field."

3 conditional subjects

"A man and a woman discuss something in a meeting room."

3 conditional subjects

"A woman in a suit sits in a living room and drinks tea."

3 conditional subjects

"A rocket launches from the Moon’s surface with a UFO behind."

3 conditional subjects

*Click the up and down buttons to change the number of conditional subjects.

View comparisons with Pika 2.1

Video Alchemist is the state-of-the-art personalization model

Compared to the existing personalization models, Video Alchemist can generate videos with the best text alignment, the highest subject fidelity, and the largest video dynamic^†.

ELITE

VideoBooth

DreamVideo

Video Alchemist

Ground Truth

IP-Adapter

PhotoMaker

Magic-Me

Video Alchemist

Ground Truth

View more comparisons

^†According to the evaluation on the proposed benchmark MSRVTT-Personalization. Check the following section or the research paper for more details.

How do we make Video Alchemist work?

To achieve multi-subject, open-set personalization, we design a dataset construction pipeline, including retrieving the entity words and preparing the subject and background images. To mitigate the issue of "copy-and-paste" effect, we collect subject images from multiple frames and introduce training-time image augmentations.

Video Alchemist is built on new Diffusion Transformer modules with additional cross-attention layer for personalization conditioning. To achieve multi-subject conditioning, we introduce a subject-level fusion, binding the word description of each subject with its image representations.

Read research paper

MSRVTT-Personalization: a new benchmark of personalization

To evaluate Video Alchemist, we introduce MSRVTT-Personalization aiming at accurate subject fidelity assessment and supporting various conditioning modes, including conditioning on face crops, single or multiple arbitrary subjects, and the combination of foreground objects and background. We show a test sample in MSRVTT-Personalization below.

Ground Truth Video

Personalization Annotations

Evaluation Metrics

Benchmark dataset and code

Ablation study

Video Alchemist can achieve better subject fidelity using DINOv2 as the encoder compared to CLIP ^[1],
it can correctly bind the reference image and the corresponding entity word with the usage of word tokens ^[2],
and it can mitigate the copy-and-paste effect and synthesize text-aligned videos via the proposed image augmentations ^[3].

Use CLIP ^[1]

No word token ^[2]

No augmentation ^[3]

Video Alchemist

Acknowledgement

We thank Ziyi Wu, Moayed Haji Ali, and Alper Canberk for their helpful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻