🐼 Panda-70M

A Large-Scale Dataset with 70M High-Quality Video-Caption Pairs

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov

A blue off-road truck is driving on a sand dune and jumping into the air.

There are ants tunneling under a thick carpet of moss.

A person is holding a long haired dachshund in their arms.

There is a river flowing through a forest and the water is flowing downstream.

A group of basketball players are practicing their shots on the court.

A rocket launches into space on the launch pad.

Someone is frying dough balls in a pan with oil.

A person is kneading dough and putting jam on it.

A person is driving a boat on a river with rocks and waterfalls.

A woman is playing golf at an outdoor driving range.

It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.

The waves are crashing on the beach and the water is foamy.

A rhino and a lion are fighting in the dirt.

A blue toyota tacoma truck is parked in a parking lot surrounded by trees.

A person is making a pie crust on a table.

A large pile of lava blocking a road.

We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Download Panda-70M

Training [full] (2.01 GB) 70.7M samples / 167 khrs duration / ~36 TB
Training [10M] (381 MB) 10.5M samples / 37.0 khrs duration / ~8.0 TB
Training [2M] (86.5 MB) 2.4M samples / 7.56 khrs duration / ~1.6 TB
Validation (803 KB) 6000 samples / 18.5 hrs duration / ~4.0 GB
Testing (803 KB) 6000 samples / 18.5 hrs duration / ~4.0 GB
Code for Dataset Downloading

The video samples are collected from the publicy available dataset.
Users must follow the related license to use these video samples.

Collection Pipeline of Panda-70M

We first collect 3.8M long videos from HD-VILA-100M dataset and split it into 70.8M semantically coherent clips (blue). Next, we utilize a number of teacher models with different multimodal inputs to generate multiple captions for a video clip (green). Lastly, we finetune a fine-grained retrieval model to select the caption that best describes the video clip as the annotation (yellow).

Demo of Long Video Annotation

We demo our splitting and captioning algorithm on long videos (scroll to view more). The results are shown in the subtitles:

We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.



We show the value of Panda-70M on three downstream tasks. We compare the models training on the existing dataset and the proposed dataset. For a fair comparison, we use the same model architecture, same training configuration, and same amount of training data for all comparisons. For more details:

Read research paper


We sincerely thank to everyone who contributed to the meaningful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻

Copyright © Snap Inc. 2024. This dataset is made available by Snap Inc. for informational purposes only. No license, whether implied or otherwise, is granted in or to such dataset (including any rights to copy, modify, publish, distribute and/or commercialize such dataset), unless you have entered into a separate agreement for such rights. Such dataset is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that such dataset is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from the dataset or your use thereof.