Task Description: To evaluate our method in real-world scenarios, we compare it against baselines using videos recorded in our lab. Input: video → audio.
Baselines: We compare our method with state-of-the-art approaches: FoleyCrafter, Diff-Foley, and Frieren. We observe that all baselines struggle with in-the-wild videos, exhibiting a lack of temporal alignment and producing low-quality audio. All videos are cropped to 5s for consistency.
Dataset: We recorded various videos featuring sounds with distinct temporal characteristics.
Video ID | Groundtruth | Ours | FoleyCrafter | Diff-foley | Frieren |
---|---|---|---|---|---|
1 | |||||
2 | |||||
3 | |||||
4 | |||||
5 | |||||
6 | |||||
7 | |||||
8 | |||||
9 | |||||
10 | |||||
11 | |||||
12 | |||||
13 | |||||
14 | |||||
15 | |||||
16 | |||||
17 | |||||
18 | |||||
19 | |||||
20 | |||||
21 | |||||
22 | |||||
23 | |||||
24 | |||||
25 |