AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,  Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

Baseline Comparison on Video-to-Audio Meta Movie Gen Benchmark without Text Prompts.

Task Description: We evaluate our method against baselines on the newly released Movie Gen Benchmark that contains AI-generated videos and replaced the audio using the designated baseline methods. Movie Gen license can be found here. Input: video → audio.

Baselines: We compare our method with state-of-the-art approaches: FoleyCrafter, Diff-Foley, and Frieren. We also include Movie Gen released videos that were generated with text prompts. We observe that all baselines lack precise temporal alignment. All videos are downsampled and cropped to 5s for consistency.

Dataset: Movie Gen Benchmark is a recently released benchmark by Meta featuring AI-generated videos. From the 527 released videos, we select those that display distinct temporal actions.

Baseline Comparison on Video-to-Audio Meta Movie Gen Benchmark with Text Prompts

Task Description: We evaluate our method against baselines on the newly released Movie Gen Benchmark that contains AI-generated videos and replaced the audio using the designated baseline methods. Movie Gen license can be found here. Input: video + audio text prompt → audio.

Baselines: We compare our method with state-of-the-art approaches: Movie Gen A2V, FoleyCrafter, Diff-Foley, and Seeing and Hearing. We observe that all baselines lack precise temporal alignment. All videos are downsampled and cropped to 5s for consistency.

Dataset: Movie Gen Benchmark is a recently released benchmark by Meta featuring AI-generated videos. From the 527 released videos, we select those that display distinct temporal actions.