AutoReCap: We propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. Our data data collection approach
leverages existing automatic video transcription to identify segments with ambient sounds. We then use our proposed captioning method AutoCap to caption the identified segments and exclude speech and music audio clips based on keyword search.
Please refer to the GitHub page for instructions on downloading the dataset!
Dataset Samples
Click anywhere on the page, then hover over (or click) the video to listen to examine samples from our proposed dataset AutoReCap .
A loud bang
Clicking and rustling
A crowd of people chanting
A crowd of people chanting and cheering
Fireworks are going off
Rain falls onto a hard surface
A person breathes heavily
Paper is being crumpled
A dog barks
A gun is fired and a man yells
A motorcycle engine is running
A motorcycle engine revving
A beep followed by a beep
A vehicle engine is idling and then revving up
A power tool drilling
Some objects are crumpled
Multiple dogs howling
Waves crash against a shoreline
Traffic passes by in the distance
A large motor vehicle engine idles and then accelerates
Birds chirp in the distance, followed by a small motor running
A power tool motor is running
Fireworks are going off
A crowd of people are screaming and cheering, and the wind is blowing