AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,  Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

Control generated audio with input text prompt!

Since audio can exist in various formats for the same video, we demonstrate our model's ability to control the generated sound using input text prompts. For this, we selected videos from the Movie Gen Benchmarks that feature temporal actions.