AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

Snap Research

Control generated audio with input text prompt!

Since audio can exist in various formats for the same video, we demonstrate our model's ability to control the generated sound using input text prompts. For this, we selected videos from the Movie Gen Benchmarks that feature temporal actions.

Without Prompt

Birds chirping in the background

Upbeat music playing

humming sound

Without Prompt

People cheering

Music playing in the background

Thunder sound

Without Prompt

People cheering

Person breathing heavliy in the background

People clapping

Without Prompt

Birds chirping in the background

People cheering

Music playing

Without Prompt

Wind blows

Person breathing heavliy in the background

In a crowded place

Without Prompt

Person breathing heavliy in the background

Wind blows

Birds chirping in the background

Without Prompt

Birds chirping in the background

Relaxing guitar music playing

Wind blows