A blue off-road truck is driving on a sand dune and jumping into the air.

There are ants tunneling under a thick carpet of moss.

A person is holding a long haired dachshund in their arms.

There is a river flowing through a forest and the water is flowing downstream.

A group of basketball players are practicing their shots on the court.

A rocket launches into space on the launch pad.

Someone is frying dough balls in a pan with oil.

A person is kneading dough and putting jam on it.

A person is driving a boat on a river with rocks and waterfalls.

A woman is playing golf at an outdoor driving range.

It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.

The waves are crashing on the beach and the water is foamy.

A rhino and a lion are fighting in the dirt.

A blue toyota tacoma truck is parked in a parking lot surrounded by trees.

A person is making a pie crust on a table.

A large pile of lava blocking a road.

We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Download Panda-70M

Training [full] (2.73 GB)	70.7M samples / 167 khrs duration / ~36 TB
Training [10M] (504 MB)	10.5M samples / 37.0 khrs duration / ~8.0 TB
Training [2M] (118 MB)	2.4M samples / 7.56 khrs duration / ~1.6 TB
Validation (1.2 MB)	6000 samples / 18.5 hrs duration / ~4.0 GB
Testing (1.2 MB)	6000 samples / 18.5 hrs duration / ~4.0 GB

Code for Dataset Downloading

🔥 Updates (Oct 2024)
To enhance the training of video generation models, we introduce two additional annotations:
Desirability Filtering and Shot Boundary Detection. Check here for more details.

The video samples are collected from the publicy available dataset.
Users must follow the related license to use these video samples.

Collection Pipeline of Panda-70M

We first collect 3.8M long videos from HD-VILA-100M dataset and split it into 70.8M semantically coherent clips (blue). Next, we utilize a number of teacher models with different multimodal inputs to generate multiple captions for a video clip (green). Lastly, we finetune a fine-grained retrieval model to select the caption that best describes the video clip as the annotation (yellow).

Demo of Long Video Annotation

We demo our splitting and captioning algorithm on long videos (scroll to view more). The results are shown in the subtitles:

We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

🐼 Panda-70M

A Large-Scale Dataset with 70M High-Quality Video-Caption Pairs

Download Panda-70M

Collection Pipeline of Panda-70M

Demo of Long Video Annotation

Statistic

Performance

Acknowledgement