⚗️ Video Alchemist
Multi-subject Open-set Personalization in Video Generation
Video Alchemist personalizes video generation
Given a text prompt and a set of reference images conceptualizing entity words in the prompt, Video Alchemist generates the video conditioned on both text and reference images.
*Click the images to switch different reference images. Click the left and right buttons to switch the images of different conditional subjects: person, dog, or background.
Video Alchemist supports multi-subject open-set personalization
Video Alchemist is equipped with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for test-time optimization.
Video Alchemist is the state-of-the-art personalization model
Compared to the existing personalization models, Video Alchemist can generate videos with the best text alignment, the highest subject fidelity, and the largest video dynamic†.
ELITE
VideoBooth
DreamVideo
Video Alchemist
Ground Truth
IP-Adapter
PhotoMaker
Magic-Me
Video Alchemist
Ground Truth
†According to the evaluation on the proposed benchmark MSRVTT-Personalization. Check the following section or the research paper for more details.
How do we make Video Alchemist work?
To achieve multi-subject, open-set personalization, we design a dataset construction pipeline, including retrieving the entity words and preparing the subject and background images. To mitigate the issue of "copy-and-paste" effect, we collect subject images from multiple frames and introduce training-time image augmentations.
Video Alchemist is built on new Diffusion Transformer modules with additional cross-attention layer for personalization conditioning. To achieve multi-subject conditioning, we introduce a subject-level fusion, binding the word description of each subject with its image representations.
MSRVTT-Personalization: a new benchmark of personalization
To evaluate Video Alchemist, we introduce MSRVTT-Personalization aiming at accurate subject fidelity assessment and supporting various conditioning modes, including conditioning on face crops, single or multiple arbitrary subjects, and the combination of foreground objects and background. We show a test sample in MSRVTT-Personalization below.
Ground Truth Video
Personalization Annotations
Evaluation Metrics
Ablation study
Video Alchemist can achieve better subject fidelity using DINOv2 as the encoder compared to CLIP [1],
it can correctly bind the reference image and the corresponding entity word with the usage of word tokens [2],
and it can mitigate the copy-and-paste effect and synthesize text-aligned videos via the proposed image augmentations [3].
Use CLIP [1]
No word token [2]
No augmentation [3]
Video Alchemist
Acknowledgement
We thank Ziyi Wu, Moayed Haji Ali, and Alper Canberk for their helpful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻