Prepare Data
By default, all training data used for this repository will be searched in the directory
DATA_PATH
. Please specify or change your data location before code running as following:export DATA_PATH='/mnt/lustre/share_data/zhujinguo/open_source_dataset'
To make it very easy for you to run the training code, we provide a toy dataset, which is a small subset of our pretraing data. You should download this and unzip this file to
DATA_PATH
.With the data of this subset, you can train Uni-Perceiver with config file configs/BERT_L12_H192_experiments/4tasks_training_small_datasets.yaml. Please refer to pretraining.md for training usage.
For tasks with a fixed candidate target sets, such as image / video classification (where the target sets are the category labels) and masked language modeling (where the target set is the vocabulary), you also need to perpare the target set file. Please refer to the jupyter notebook tools/generate_target_sets.ipynb for details.
For the complete datasets for training our models, please download datasets according to the instructions below:
Different datasets
Todo List:
- Imagenet-21k and Imagenet-1k
- books&wiki
- MSCOCO Caption
- YFCC
- CC12M
- CC3M
- Visual Genome
- SBU
- Kinetics-400 & Kinetics-700
- Moments in Time
- Flickr30k
- MSVD
- MSR-VTT
- GLUE
- VQA
Imagenet-1k
Please download the images of imagenet dataset from the official website Imagenet.
We provide the annotation files (including train.txt, val.txt and test.txt) on meta.
a) Tokenizing imagenet class names to generate "imagenet_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb
b) Or using generated file we provide from here
Organize them as follows:
DATA_PATH/ βββ imagenet/ βββ imagenet_class_name_CLIP_with_endoftext.pkl βββ meta β βββ test.txt β βββ train.txt β βββ val.txt βββ test β βββ ILSVRC2012_test_00000001.JPEG β βββ ILSVRC2012_test_00000002.JPEG β βββ ILSVRC2012_test_00000003.JPEG β βββ ILSVRC2012_test_00000004.JPEG β βββ ... βββ train β βββ n01440764 β β βββ n01440764_10026.JPEG | β βββ n01440764_10027.JPEG | β βββ n01440764_10029.JPEG | β βββ ... β βββ n01443537 | β βββ ... β βββ n01484850 | β βββ ... | βββ ... ββββ val βββ ILSVRC2012_val_00000001.JPEG βββ ILSVRC2012_val_00000002.JPEG βββ ILSVRC2012_val_00000003.JPEG βββ ...
Imagenet-22k
Please refer to Imagenet-1K dataset.
Meta file is provided from here
Imagenet class name file in generate_target_sets.ipynb for tokenizing is provided from here. Or you can directly use the CLIP-tokenized imagenet-22K class name files is provided from here
Books&wiki
please download files wiki.doc abd bc1g.doc. And put them together into a file:
cat wiki.doc bc1g.doc > bookswiki.doc
a) Tokenizing vocabularies to generate "vocabulary_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb
b) Or using generated file we provide from here
Then put this files in
DATA_PATH
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ bert_pretrain_data/ ββ bookswiki/ βββ bookswiki.doc
you can also download the plain text dataset from huggingface.co/datasets/wikipedia and huggingface.co/datasets/bookcorpus.
MSCOCO
Please download the images of COCO 2014 from MSCOCO.
Download preprocessed coco captions from Karpathy's homepage: link and extract "dataset_coco.json" from zip file.
a) You can run the coco_preprocess.py file to split the dataset_coco.json file into train, val and test part:
- walk into the /data/preprocess folder and open the coco_preprocess.py file;
- fill the 'original_json' variable with the path you download the dataset_coco.json file.
- fill the 'savepath' with the path you want to save the splited json file.
- run the coco_preprocess.py file.
b) Or you can directly use the generated json file we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ mscoco_dataset/ βββ new_annotations β βββ captions_test5k.json β βββ captions_train113k.json β βββ captions_val5k.json | βββ dataset_coco.json βββ coco_origin βββ train2014 β βββ COCO_train2014_000000000009.jpg | βββ COCO_train2014_000000000025.jpg | βββ COCO_train2014_000000000030.jpg β βββ ... βββ val2014 βββ COCO_val2014_000000000042.jpg βββ COCO_val2014_000000000073.jpg βββ COCO_val2014_000000000074.jpg βββ ...
Visual Genome
Please download the images and region decriptions of visual genome from VG.
a) You can run the region_descriptions.ipynb to preprocess the downloaded "region_descriptions.json" file:
- walk into the /data/preprocess folder and open the region_descriptions.ipynb;
- fill the path of downloaded 'region_descriptions.json' and the path you want to save the processed file.
- run the region_descriptions.ipynb.
b) Or you can directly use the generated json file we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ visual_genome/ βββ annotations β βββ region_descriptions.json β βββ vg_captions_128filter.json βββ images βββ VG_100K β βββ 2.jpg | βββ 3.jpg | βββ 4.jpg β βββ ... βββ VG_100K_2 βββ 1.jpg βββ 51.jpg βββ 52.jpg βββ ...
Flickr30k
Please download the images of filckr30k according to the instruction of Flickr30k.
Download flickr_jsons which provides the annotations of flickr30k images.
a) You can run the process_flickr_caption_json.py to preprocess the json file:
- walk into the /data/preprocess folder and open the process_flickr_caption_json.py;
- fill the path of downloaded json files and fill the path you want to save the processed json files.
- run the process_flickr_caption_json.py.
b) Or you can directly use the generated json files (including captions_test.json, captions_train.json and captions_val.json) we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ flickr30k/ β βββ captions_test.json β βββ captions_train.json β βββ captions_val.json βββ flickr30k_images βββ flickr30k_images βββ flickr30k_images βββ 36979.jpg βββ 65567.jpg βββ ... ```
SBU
Filling the path of above files in sbu_download_list.py and run it for generating the download_list.
Running the script sbu_download.sh to download the sbu images.
a) You can run the make_sbu_json.py to get the annotation file:
b) Or you can directly download the generated json file sbucaption.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ sbucaption/ βββ annotations β βββ sbucaptions.json βββ images βββ 4385058960_b0f291553e.jpg βββ 5148648301_1174ef59bc.jpg βββ ...
CC3M
Please download "Train_GCC-training.tsv" and "Validation_GCC-1.1.0-Validation.tsv" from here
Filling the path of "Train_GCC-training.tsv" in cc3m_train_download_list.py and run it for generating the training download list.
Filling the path of "Validation_GCC-1.1.0-Validation.tsv" in cc3m_val_download_list.py and run it for generating the validation download list.
Running the script cc3m_train_download.sh and cc3m_val_download.sh to download the cc3m images.
Zip (without compression) "train_image", "val_image" by:
zip -0 ../train_image.zip ./* zip -0 ../val_image.zip ./*
a) You can run the make_cc3m_train_json.py and make_cc3m_val_json.py to get the annotation file:
b) Or you can directly download the generated json files train_spacy.json and val_spacy.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ cc3m/ βββ train_spacy.json βββ val_spacy.json βββtrain_image β βββ 00000000.jpg β βββ ... βββ val_image βββ 00000000.jpg βββ ...
CC12M
Please download "cc12m.tsv" from here
Filling the path of "cc12m.tsv" in cc12m_train_download_list.py and run it for generating the training download list.
Running the script cc12m_train_download.sh to download the cc12m images.
Zip (without compression) "train_image" by:
zip -0 ../train_image.zip ./*
a) You can run the make_cc12m_train_json.py to get the annotation file:
b) Or you can directly download the generated json file train_available.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ c12m/ βββ train_available.json βββ train_image βββ 00000000.jpg βββ ...
Kinetics-400 & Kinetics-700
Please download the Kinectics-400 & Kinetics-700 videos according to the instructions of this
a)
i. Filling the path of K400's "training" and "validation" folder you download in k400_construct_csv.py and run it for generating the K400 related files (K400_val.csv, K400_train.csv, categories.txt, annotation.json).
ii. Filling the path of K700's "training" and "validation" folder you download in k700_construct_csv.py and run it for generating the K700 related files (K700_val.csv, K700_train.csv, categories.txt, annotation.json).
iii. Running script video_categories.ipynb to generate "category_mapping.txt".
b) Or you can directly download the processed files we provide: K400, K700
a) Tokenizing K400, K700 class names to generate "k400_class_name_CLIP_with_endoftext.pkl" and "k700_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb
b) Or using generated file we provide from K400-CLIP and K700-CLIP
Organize the files into following structure:
DATA_PATH/ βββ k400_class_name_CLIP_with_endoftext.pkl βββ K400/ βββ training β βββ abseiling β β βββ _4YTwq0-73Y_000044_000054.mp4 β β βββ ... β βββ air_drumming β βββ ... βββ validation/ β βββ abseiling β β βββ __NrybzYzUg.mkv β β βββ ... β βββ air_drumming β βββ ... βββ annotation.json βββ category_mapping.txt βββ categories.txt βββ K400_train.csv βββ K400_val.csv
K700 is similar.
MomentsInTime
Please download the MomentsInTime videos according to the instructions of Official Website
a)
i. Filling the path of "training" folder you download in moments_construct_csv.py and run it for generating the training files (moments_train.csv, categories.txt, annotation.json).
ii. Running script video_categories.ipynb to generate "category_mapping.txt".
b) Or you can directly download the processed files we provide: moments.
a) Tokenizing momentsInTime class names to generate "MiT_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb
b) Or using generated file we provide from MiT-CLIP
Organize the files into following structure:
DATA_PATH/ βββ MiT_class_name_CLIP_with_endoftext.pkl βββ MomentsInTime/ βββ training β βββ adult+female+singing β β βββ 0a2b81cb0ec5fde79b8c.mp4 β β βββ ... β βββ adult+female+speaking β βββ ... βββ annotation.json βββ categories.txt βββ category_mapping.txt βββ moments_train.csv
MSVD
Download MSVD videos "YoutTubeClips.tar" from here and preprocessed "txt_labels" from here.
a) Fill the path of downloaded files in msvd_preprocess.py to generate the annotation files (caption_msvd_train_cocostyle.json, caption_msvd_val_cocostyle.json, caption_msvd_test_cocostyle.json)
b) Or directly download the annotation files we provide new_annotations
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ msvd_dataset/ βββ new_annotations β βββ caption_msvd_test_cocostyle.json β βββ caption_msvd_train_cocostyle β βββ caption_msvd_val_cocostyle βββ txt_labels β βββ sents_test_lc_nopunc.txt β βββ sents_train_lc_nopunc.txt β βββ sents_train_lc_nopunc.txt β βββ youtube_mapping.txt βββ YouTubeClips βββ _0nX-El-ySo_83_93.avi βββ ...
MSR-VTT
Download MSRVTT videos ("train_val_videos.zip", "test_videos.zip") and annotation files ("train_val_annotation.zip", "test_videodatainfo.zip") from here and download dataset split info from here.
Unzip downloaded files above, fill the paths of "test_videodatainfo.json", "train_val_videodatainfo.json", "MSRVTT_train.9k.csv", "MSRVTT_JSFUSION_test.csv" in the msrvtt_dataprocess_1k.ipynb
b) Or directly download the annotation files ("caption_msrvtt_1k_trainval_cocostyle.json","caption_msrvtt_1k_test_cocostyle.json") we provide annotations_new
Generating tokenized vocabularies as mentioned in bookswiki part
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ msrvtt_dataset/ βββ annotations_new β βββ caption_msrvtt_1k_trainval_cocostyle.json β βββ caption_msrvtt_1k_test_cocostyle.json βββ videos βββ video0.mp4 βββ ...
VQA
Download VQA meta data from the datalink vilbert provided, files including:
- dictionary.pkl
- train_ids.pkl
- val_ids.pkl
- train_target.pkl
- trainval_ans2label.pkl
- val_target.pkl
- trainval_label2ans.pkl
Download VG questions and answers from here
Download VQA annotations from the link xmodaler provided, files including:
- vg_target.pkl
- VG_questions2.json
- download
- VG_annotations.json
Download VQA annotations from VQA website, files including:
- v2_OpenEnded_mscoco_test2015_questions.json
- v2_OpenEnded_mscoco_train2014_questions.json
- v2_OpenEnded_mscoco_val2014_questions.json
a) Tokenizing all the possible answers using generate_target_sets.ipynb.
b) Or you can use the tokenized answers we provide VQA_Answers.
Organize the files into following structure:
DATA_PATH/ βββ vocabulary_CLIP_with_endoftext.pkl βββ mscoco_dataset/ | βββ coco_origin | βββ train2014 | β βββ COCO_train2014_000000000009.jpg | | βββ COCO_train2014_000000000025.jpg | | βββ COCO_train2014_000000000030.jpg | β βββ ... | βββ val2014 | βββ COCO_val2014_000000000042.jpg | βββ COCO_val2014_000000000073.jpg | βββ COCO_val2014_000000000074.jpg | βββ ... βββ VQA βββ trainval_ans2label.pkl βββ trainval_label2ans.pkl βββ v2_OpenEnded_mscoco_train2014_questions.json βββ v2_OpenEnded_mscoco_val2014_questions.json βββ v2_OpenEnded_mscoco_test-dev2015_questions.json βββ val_target.pkl βββ VG_questions2.json βββ vg_target.pkl βββ coco_map.json
GLUE
Follow the instructions of this to download GLUE benchmark data and refer to fairseq to preprocess datasets.
a) Tokenizing GLUE datasets using generate_target_sets.ipynb.
b) Or you can use the tokenized answers we provide GLUE_classnames.
Organize the files into following structure:
DATA_PATH/
βββ GLUE_classnames
βββ bert_pretrain_data/
βββ glue_data
βββ CoLA
βββ CoLA-bin
βββ diagnostic
βββ MNLI
βββ MNLI-bin
βββ MRPC
βββ MRPC-bin
βββ QNLI
βββ QNLI-bin
βββ QQP
βββ QQP-bin
βββ RTE
βββ RTE-bin
βββ SST-2
βββ SST-2-bin
βββ STS-B
βββ STS-B-bin
βββ WNLI