Taming Data and Transformers for Audio Generation


Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Guha Balakrishnan,  Sergey Tulyakov,  Vicente Ordonez 

Rice University Logo
Snap Research Logo Snap Research

Comparison of AutoCap with state-of-the-art audio captioning methods:

We evaluate our captioning approach by comparing it with ENCLAP and CoNeTTE using the AudioCaps test split. To facilitate a clearer analysis, we highlight sections of the caption that the baseline methods either miss or describe inaccurately.

Input Groundtruth Caption AutoCap ENCLAP CoNeTTE
A man talking as ocean waves trickle and splash while wind blows into a microphone A man speaks as wind blows and water splashes A man is speaking and wind is blowing A man is speaking and wind is blowing
An adult male speaks, birds chirp in the background, and many insects are buzzing Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby Birds are chirping and a man speaks A man speaking with birds chirping in the background.
A telephone dialing tone followed by a plastic switch flipping on and off A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface A telephone dialing followed by a series of electronic beeps A telephone ringing followed by a beep.
A running train and then a train whistle A train moves getting closer and a horn is triggered A train running on railroad tracks followed by a train horn blowing as wind blows into a microphone A train horn blows and a steam whistle is blowing
A female speaking with some rustling followed by another female speaking Dishes are being moved and a woman laughs and speaks A woman speaking followed by clanking A woman is speaking and a child is laughing.
A child is speaking followed by a door moving A child speaks followed by a loud crash and a scream A young girl speaks followed by a loud bang A woman speaking followed by a door opening and closing.
Water splashing as a baby is laughing and birds chirp in the background A baby laughs and splashes, and an adult female speaks A baby laughs and splashes in water A baby is laughing and people are talking.
Leaves rustling in the wind with dogs barking and birds chirping Birds chirp in the distance, and then a dog barks nearby Birds chirp and a dog barks A dog is barking and a person is walking.
Tapping followed by water spraying and more tapping Some light rustling followed by a clank then water pouring A faucet is turned on and runs A toilet is flushed and water is running.

Comparison of GenAu with state-of-the-art text-to-audio methods:

We compare out method against Make-an-audio 1&2, AudioLDM 1&2, Tango 1&2, and StableAudio. Please scroll right to acess the rest of the methods.

                  Input                        Ours Make-an-audio Make-an-audio-2 AudioLDM AudioLDM2 Tango Tango2 Stable Audio
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone
A small child and woman speak with splashing water
Horses growl and clop hooves.
A woman speaks with chirping frogs and distant music playing
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone
A man speaks with a high frequency hum with some banging and clanking