--- license: apache-2.0 pipeline_tag: image-text-to-text --- **
TinyLLaVA-Video
** [![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.15513)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video) For training data, We combine partial data from two datasets: [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [ Valley](https://github.com/RupertLuo/Valley). | Stage | Source | #Sample | |----------| :---------------------------: | :-----------: | | Pretrain | LLaVA-Video-178K + Valley | 397k | | Finetune | LLaVA-Video-178K | 491k | #### Pretrain Data We use four subsets of [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K): ``0_30_s_academic_v0_1``, ``30_60_s_academic_v0_1``, ``0_30_s_youtube_v0_1``, and ``30_60_s_youtube_v0_1``, supplemented with the filtered [Video-LLaVA](https://huggingface.co/datasets/LanguageBind/Video-LLaVA). We provide cleaned annotations data, and the video data can be downloaded from [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [Video-LLaVA](https://huggingface.co/datasets/LanguageBind/Video-LLaVA). #### Finetune Data We use four subsets of [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K): ``0_30_s_academic_v0_1``, ``30_60_s_academic_v0_1``, ``0_30_s_youtube_v0_1``, and ``30_60_s_youtube_v0_1``. We provide cleaned annotations data, and the video data can be downloaded from [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). #### Organize Data Organize the image files and annotation files as follows in ``path/to/your/dataset``: ```Shell dataset ├── academic_source ├── liwei_youtube_videos ├── valley ├── text_files │ ├── cleaned_video_caption.json │ ├── cleaned_video_openqa.json ``` **Note: If there is any infringement, please contact us for removal.**