🐼 Panda-70M

A Large-Scale Dataset with 70M High-Quality Video-Caption Pairs


Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov

CVPR 2024






A blue off-road truck is driving on a sand dune and jumping into the air.

There are ants tunneling under a thick carpet of moss.

A person is holding a long haired dachshund in their arms.

There is a river flowing through a forest and the water is flowing downstream.

A group of basketball players are practicing their shots on the court.

A rocket launches into space on the launch pad.

Someone is frying dough balls in a pan with oil.

A person is kneading dough and putting jam on it.

A person is driving a boat on a river with rocks and waterfalls.

A woman is playing golf at an outdoor driving range.

It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.

The waves are crashing on the beach and the water is foamy.

A rhino and a lion are fighting in the dirt.

A blue toyota tacoma truck is parked in a parking lot surrounded by trees.

A person is making a pie crust on a table.

A large pile of lava blocking a road.

We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Download Panda-70M

Training [full] (2.73 GB) 70.7M samples / 167 khrs duration / ~36 TB
Training [10M] (504 MB) 10.5M samples / 37.0 khrs duration / ~8.0 TB
Training [2M] (118 MB) 2.4M samples / 7.56 khrs duration / ~1.6 TB
Validation (1.2 MB) 6000 samples / 18.5 hrs duration / ~4.0 GB
Testing (1.2 MB) 6000 samples / 18.5 hrs duration / ~4.0 GB
Code for Dataset Downloading

🔥 Updates (Oct 2024)
To enhance the training of video generation models, we introduce two additional annotations:
Desirability Filtering and Shot Boundary Detection. Check here for more details.

The video samples are collected from the publicy available dataset.
Users must follow the related license to use these video samples.



Collection Pipeline of Panda-70M

We first collect 3.8M long videos from HD-VILA-100M dataset and split it into 70.8M semantically coherent clips (blue). Next, we utilize a number of teacher models with different multimodal inputs to generate multiple captions for a video clip (green). Lastly, we finetune a fine-grained retrieval model to select the caption that best describes the video clip as the annotation (yellow).


Demo of Long Video Annotation

We demo our splitting and captioning algorithm on long videos (scroll to view more). The results are shown in the subtitles:

We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Statistic


Performance

We show the value of Panda-70M on three downstream tasks. We compare the models training on the existing dataset and the proposed dataset. For a fair comparison, we use the same model architecture, same training configuration, and same amount of training data for all comparisons. For more details:

Read research paper

Acknowledgement

We sincerely thank to everyone who contributed to the meaningful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻

Copyright © 2024 Snap Inc. All rights reserved. This dataset and code is made available by Snap Inc. for non-commercial, research purposes only. Non-commercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Research purposes mean solely for study, instruction, or non-commercial research, testing or validation. No commercial license, whether implied or otherwise, is granted in or to this dataset and code, unless you have entered into a separate agreement with Snap Inc. for such rights. This dataset and code is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that the code is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from this dataset and code or your use thereof. Any redistribution of this dataset and code must retain or reproduce the above copyright notice, conditions and disclaimer.