AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,  Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

In-The-Wild Baseline Comparison on Video-to-Audio without Text Prompts

Task Description: To evaluate our method in real-world scenarios, we compare it against baselines using videos recorded in our lab. Input: video → audio.

Baselines: We compare our method with state-of-the-art approaches: FoleyCrafter, Diff-Foley, and Frieren. We observe that all baselines struggle with in-the-wild videos, exhibiting a lack of temporal alignment and producing low-quality audio. All videos are cropped to 5s for consistency.

Dataset: We recorded various videos featuring sounds with distinct temporal characteristics.