Taming Data and Transformers for Audio Generation

Moayed Haji-Ali,  Willi Menapace,  Aliaksandr Siarohin,  Guha Balakrishnan,  Sergey Tulyakov,  Vicente Ordonez 

Rice University Logo
Snap Research Logo Snap Research
! Click anywhere on the page, then hover over (or click) the waveforms to listen to the generated sound through our audio generator GenAU .
A man speaks followed by a toilet flush
A chainsaw cutting as wood is cracking
A crowd murmurs as a siren blares and then stops at a distance
A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding
A mid-size motor vehicle engine accelerates and is accompanied by hissing and spinning tires, then it decelerates and an adult male begins to speak
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone
A cat meows and hisses
Fireworks pop and explode
A small child and woman speak with splashing water
Horses growl and clop hooves
A gunshot firing in the distance followed by steam hissing and fire crackling
A woman speaks with chirping frogs and distant music playing
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone
A dog barking as a man is talking while wind blows into a microphone as birds chirp in the distance
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone
! Click anywhere on the page, then hover over (or click) the waveforms to listen to audio and examine the captions generated through our audio captioner AutoCap .
A man speaks as wind blows and water splashes
A train moves getting closer and a horn is triggered
Dishes are being moved and a woman laughs and speaks
A child speaks followed by a loud crash and a scream
A baby laughs and splashes, and an adult female speaks
Birds chirp in the distance, and then a dog barks nearby
Some light rustling followed by a clank then water pouring
Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby
A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface


Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761, 000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to stateof-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.

Our Models

AutoCap Model Diagram

AutoCap: We employ frozen CLAP and HTSAT audio encoders to produce the audio representation. We then compact this representation into 4x less tokens using a Q-Former module. This enhances the efficieny of the captioning model and aligning the audio representation with the language representation of a pretrained BART encoder-decoder model that aggregates these tokens along with tokens extected from useful metadata to produce the output caption.

GenAu Model Diagram

GenAu: We use a frozen audio 1D-VAE to produce a sequence of latents from a Mel-Spectrogram representation. Based on the FIT architecture, these latents are patchified and divided into groups which processed by local attention layers. The read and write operations are implemented as cross attention layers that transfer information between input latents and learnable latent tokens. Finally, global attention layers process latent tokens with attention spanning over all groups of latent tokens, enabling global communication.



Taming Data and Transformers for Audio Generation (514 KB)

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, and Vicente Ordonez