AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,  Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

Baseline comparison on VGGSounds with Text Guidance

Task description: For an input audio and text prompt, AV-Link generates temporally-aligned videos. Input: audio + video text descriptoin -> video.

Baselines: We compare AV-LINK with TempoToken. We notice that TempoToken displays poor temporal alignment.

Dataset: We selected samples from VGGSounds that exhibit clear temporal actions. Our method generates 36x64 videos at 6fps. We use an upsampler to increase the generated video resoltion to 144x256.

Additional Qualitative Results

We include additional generated videos of our model based on the audio and input video text prompt.