AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

Snap Research

Research Paper

Github

Baseline Comparison on VGGSounds

Task description: Based of an input audio alone, AV-Link generates temporally-aligned videos. Input: audio -> video.

Baselines: We compare AV-LINK with TempoToken. We observe that TempoToken lacks temporal alignment and occasionally generates videos that are not semantically consistent with the input audio.

Dataset: We selected samples from VGGSounds that exhibit clear temporal actions. Our method generates 36x64 videos at 6fps. We use an upsampler to increase the generated video resoltion to 144x256.

Ours	TempoToken	Ours	TempoToken	Ours	TempoToken

Additional Qualitative Results

We provide additional generated videos from our model based solely on the audio input.