Click anywhere on the page, then hover over (or click) the waveforms to listen to the generated sound through our audio generator GenAU .
A man speaks followed by a toilet flush
A chainsaw cutting as wood is cracking
A crowd murmurs as a siren blares and then stops at a distance
A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding
A mid-size motor vehicle engine accelerates and is accompanied by hissing and spinning tires, then it decelerates and an adult male begins to speak
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone
A cat meows and hisses
Fireworks pop and explode
A small child and woman speak with splashing water
Horses growl and clop hooves
A gunshot firing in the distance followed by steam hissing and fire crackling
A woman speaks with chirping frogs and distant music playing
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone
A dog barking as a man is talking while wind blows into a microphone as birds chirp in the distance
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone
Click anywhere on the page, then hover over (or click) the waveforms to listen to audio and examine the captions generated through our audio captioner AutoCap .
A man speaks as wind blows and water
splashes
A train moves getting closer and a horn is triggered
Dishes are being moved and a woman laughs and speaks
A child speaks followed by a loud crash and a scream
A baby laughs and splashes, and an adult female speaks
Birds chirp in the distance, and then a dog barks nearby
Some light rustling followed by a clank then water pouring
Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby
A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface
Abstract
Generating ambient sounds and effects is a challenging problem due to data scarcity
and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two
new models. First, we propose AutoCap, a high-quality and efficient automatic
audio captioning model. We show that by leveraging metadata available with the
audio modality, we can substantially improve the quality of captions. AutoCap
reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available
captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761, 000 audio clips with high-quality
captions, forming the largest available audio-text dataset. Second, we propose
GenAu, a scalable transformer-based audio generation architecture that we scale
up to 1.25B parameters and train with our new dataset. When compared to stateof-the-art audio generators, GenAu obtains significant improvements of 15.7%
in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly
improved quality of generated audio compared to previous works. This shows that
the quality of data is often as important as its quantity. Besides, since AutoCap is
fully automatic, new audio samples can be added to the training dataset, unlocking
the training of even larger generative models for audio synthesis.
Our Models
AutoCap: We employ frozen CLAP and HTSAT audio encoders to produce the
audio representation. We then compact this representation into 4x less tokens using a
Q-Former module. This enhances the efficieny
of the captioning model and aligning the audio representation with the language
representation of a pretrained BART encoder-decoder model that aggregates these tokens along
with tokens extected from useful metadata to produce the output caption.
GenAu: We use a frozen audio 1D-VAE to produce a sequence of latents from a
Mel-Spectrogram representation. Based on the FIT architecture, these latents are patchified
and divided into groups which processed by local attention
layers. The read and write
operations are implemented as cross attention layers that transfer information between input
latents and learnable latent tokens.
Finally, global attention layers process latent tokens with
attention spanning over all groups of latent tokens, enabling global communication.
Improvements
Paper
Taming Data and Transformers for Audio Generation (514 KB)