AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,  Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

Baseline Comparison on VGGSounds

Task description: Based of an input audio alone, AV-Link generates temporally-aligned videos. Input: audio -> video.

Baselines: We compare AV-LINK with TempoToken. We observe that TempoToken lacks temporal alignment and occasionally generates videos that are not semantically consistent with the input audio.

Dataset: We selected samples from VGGSounds that exhibit clear temporal actions. Our method generates 36x64 videos at 6fps. We use an upsampler to increase the generated video resoltion to 144x256.

Additional Qualitative Results

We provide additional generated videos from our model based solely on the audio input.