We evaluate our captioning approach by comparing it with ENCLAP and CoNeTTE using the AudioCaps test split. To facilitate a clearer analysis, we highlight sections of the caption that the baseline methods either miss or describe inaccurately.
Input | Groundtruth Caption | AutoCap | ENCLAP | CoNeTTE |
---|---|---|---|---|
A man talking as ocean waves trickle and splash while wind blows into a microphone | A man speaks as wind blows and water splashes | A man is speaking and wind is blowing | A man is speaking and wind is blowing | |
An adult male speaks, birds chirp in the background, and many insects are buzzing | Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby | Birds are chirping and a man speaks | A man speaking with birds chirping in the background. | |
A telephone dialing tone followed by a plastic switch flipping on and off | A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface | A telephone dialing followed by a series of electronic beeps | A telephone ringing followed by a beep. | |
A running train and then a train whistle | A train moves getting closer and a horn is triggered | A train running on railroad tracks followed by a train horn blowing as wind blows into a microphone | A train horn blows and a steam whistle is blowing | |
A female speaking with some rustling followed by another female speaking | Dishes are being moved and a woman laughs and speaks | A woman speaking followed by clanking | A woman is speaking and a child is laughing. | |
A child is speaking followed by a door moving | A child speaks followed by a loud crash and a scream | A young girl speaks followed by a loud bang | A woman speaking followed by a door opening and closing. | |
Water splashing as a baby is laughing and birds chirp in the background | A baby laughs and splashes, and an adult female speaks | A baby laughs and splashes in water | A baby is laughing and people are talking. | |
Leaves rustling in the wind with dogs barking and birds chirping | Birds chirp in the distance, and then a dog barks nearby | Birds chirp and a dog barks | A dog is barking and a person is walking. | |
Tapping followed by water spraying and more tapping | Some light rustling followed by a clank then water pouring | A faucet is turned on and runs | A toilet is flushed and water is running. |
We compare out method against Make-an-audio 1&2, AudioLDM 1&2, Tango 1&2, and StableAudio. Please scroll right to acess the rest of the methods.
Input | Ours | Make-an-audio | Make-an-audio-2 | AudioLDM | AudioLDM2 | Tango | Tango2 | Stable Audio |
---|---|---|---|---|---|---|---|---|
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone | ||||||||
A small child and woman speak with splashing water | ||||||||
Horses growl and clop hooves. | ||||||||
A woman speaks with chirping frogs and distant music playing | ||||||||
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone | ||||||||
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone | ||||||||
A man speaks with a high frequency hum with some banging and clanking |