AV-Link

In-The-Wild Baseline Comparison on Video-to-Audio without Text Prompts

Task Description: To evaluate our method in real-world scenarios, we compare it against baselines using videos recorded in our lab. Input: video → audio.

Baselines: We compare our method with state-of-the-art approaches: FoleyCrafter, Diff-Foley, and Frieren. We observe that all baselines struggle with in-the-wild videos, exhibiting a lack of temporal alignment and producing low-quality audio. All videos are cropped to 5s for consistency.

Dataset: We recorded various videos featuring sounds with distinct temporal characteristics.

Video ID	Groundtruth	Ours	FoleyCrafter	Diff-foley	Frieren
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

In-The-Wild Baseline Comparison on Video-to-Audio without Text Prompts