AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Ivan Skorokhodov,  Alper Canberk,  Kwot Sin Lee,  Vicente Ordonez,   Sergey Tulyakov 

Rice University Logo
Snap Research Logo Snap Research

Bring Silent Videos to Life!

πŸŽ‰ Introducing AV-LINK, our groundbreaking research that magically generates synchronized audio for your silent videos. Witness the transformation as your videos sound alive! 🎢✨




Add Action to Your Audio!

🎬 Transform sound into sight! Watch your audio clip spring to life with a perfectly synchronized video. Audio meets action, seamlessly! 🎢✨




Control the Soundtrack of Your Story!

πŸ“βœ¨ Turn words into sound! Use text prompts to craft the perfect audio for your silent video. Whether it is immersive background sounds, foley effects, or music synced to your scene, the power is in your hands. Unleash your creativity and let your video be alive! 🎢πŸŽ₯


Upbeat music playing


Learn more about AV-LINK!


We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation.



Compared to current Video-to-Audio and Audio-to-Video methods, AV-Link provides a unified framework for these two tasks. Rather than relying on feature extractors pretrained for other tasks (e.g. CLIP, CLAP), we directly leverage the activations from pretrained frozen Flow Matching models using a Fusion Block to achieve precise time alignment between modalities. Our approach offers competitive semantic alignment and improved temporal alignment in a self-contained framework for both modalities.

Explore AV-Link Capabilities!

Explore our model's capabilities in both Audio-to-Video and Video-to-Audio generation tasks. Use the links below to navigate to different pages showcasing our results.

Video-to-Audio Generation

Audio-to-Video Generation


Paper

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, and Sergey Tulyakov


Bibtex: @misc{avlink,
title={AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation},
author={Moayed Haji-Ali and Willi Menapace and Aliaksandr Siarohin and Ivan Skorokhodov and Alper Canberk and Kwot Sin Lee and Vicente Ordonez and Sergey Tulyakov},
year={2024}, eprint={2412.15191},
archivePrefix={arXiv}, primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.15191}, }