Taming Data and Transformers for Audio Generation

AutoReCapXL

47M audio-text pairs with variable length and minimum 0.1 CLAP similarity

AutoReCapXL-MQ

20.7M audio-text pairs with variable length and minumum 0.4 CLAP similarity

AutoReCapXL-MQ-L

14.7M audio-text pairs with minimum 5 seconds clips and 0.4 CLAP similarity

AutoReCapXL-HQ

10.7M audio-text pairs with variable length and minimum 0.5 CLAP similarity

Code for Downloading Dataset

AutoReCap Collection Pipeline

AutoReCap: We propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. Our data data collection approach leverages existing automatic video transcription to identify segments with ambient sounds. We then use our proposed captioning method AutoCap to caption the identified segments and exclude speech and music audio clips based on keyword search.

Please refer to the GitHub page for instructions on downloading the dataset!