Snap Research
Task Description: To evaluate our method in real-world scenarios, we compare it against baselines using videos recorded in our lab. Input: video → audio.
Baselines: We compare our method with state-of-the-art approaches: FoleyCrafter, Diff-Foley, and Frieren. We observe that all baselines struggle with in-the-wild videos, exhibiting a lack of temporal alignment and producing low-quality audio. All videos are cropped to 5s for consistency.
Dataset: We recorded various videos featuring sounds with distinct temporal characteristics.
| Video ID | Groundtruth | Ours | FoleyCrafter | Diff-foley | Frieren |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| 13 | |||||
| 14 | |||||
| 15 | |||||
| 16 | |||||
| 17 | |||||
| 18 | |||||
| 19 | |||||
| 20 | |||||
| 21 | |||||
| 22 | |||||
| 23 | |||||
| 24 | |||||
| 25 |