Training [full] (2.73 GB) | 70.7M samples / 167 khrs duration / ~36 TB |
Training [10M] (504 MB) | 10.5M samples / 37.0 khrs duration / ~8.0 TB |
Training [2M] (118 MB) | 2.4M samples / 7.56 khrs duration / ~1.6 TB |
Validation (1.2 MB) | 6000 samples / 18.5 hrs duration / ~4.0 GB |
Testing (1.2 MB) | 6000 samples / 18.5 hrs duration / ~4.0 GB |
🔥 Updates (Oct 2024)
To enhance the training of video generation models, we introduce two additional annotations:
Desirability Filtering and Shot Boundary Detection. Check here for more details.
The video samples are collected from the publicy available dataset.
Users must follow the related license to use these video samples.
We first collect 3.8M long videos from HD-VILA-100M dataset and split it into 70.8M semantically coherent clips (blue). Next, we utilize a number of teacher models with different multimodal inputs to generate multiple captions for a video clip (green). Lastly, we finetune a fine-grained retrieval model to select the caption that best describes the video clip as the annotation (yellow).
We demo our splitting and captioning algorithm on long videos (scroll to view more). The results are shown in the subtitles:
We show the value of Panda-70M on three downstream tasks. We compare the models training on the existing dataset and the proposed dataset. For a fair comparison, we use the same model architecture, same training configuration, and same amount of training data for all comparisons. For more details:
We sincerely thank to everyone who contributed to the meaningful discussions, and also extend our gratitude to Snap Inc. for providing the computational resources and fostering a conducive research environment. 🤗 🙏 👻
Copyright © 2024 Snap Inc. All rights reserved. This dataset and code is made available by Snap Inc. for non-commercial, research purposes only. Non-commercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Research purposes mean solely for study, instruction, or non-commercial research, testing or validation. No commercial license, whether implied or otherwise, is granted in or to this dataset and code, unless you have entered into a separate agreement with Snap Inc. for such rights. This dataset and code is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that the code is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from this dataset and code or your use thereof. Any redistribution of this dataset and code must retain or reproduce the above copyright notice, conditions and disclaimer.