Prepare Data

By default, all training data used for this repository will be searched in the directory DATA_PATH. Please specify or change your data location before code running as following:
```
export DATA_PATH='/mnt/lustre/share_data/zhujinguo/open_source_dataset'
```
To make it very easy for you to run the training code, we provide a toy dataset, which is a small subset of our pretraing data. You should download this and unzip this file to DATA_PATH.

With the data of this subset, you can train Uni-Perceiver with config file configs/BERT_L12_H192_experiments/4tasks_training_small_datasets.yaml. Please refer to pretraining.md for training usage.
For tasks with a fixed candidate target sets, such as image / video classification (where the target sets are the category labels) and masked language modeling (where the target set is the vocabulary), you also need to perpare the target set file. Please refer to the jupyter notebook tools/generate_target_sets.ipynb for details.
For the complete datasets for training our models, please download datasets according to the instructions below:

Different datasets

Todo List:

Imagenet-21k and Imagenet-1k
books&wiki
MSCOCO Caption
YFCC
CC12M
CC3M
Visual Genome
SBU
Kinetics-400 & Kinetics-700
Moments in Time
Flickr30k
MSVD
MSR-VTT
GLUE
VQA

Imagenet-1k

Please download the images of imagenet dataset from the official website Imagenet.
We provide the annotation files (including train.txt, val.txt and test.txt) on meta.
a) Tokenizing imagenet class names to generate "imagenet_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb

b) Or using generated file we provide from here

Organize them as follows:

DATA_PATH/
└── imagenet/
    ├── imagenet_class_name_CLIP_with_endoftext.pkl
    ├── meta
    │   ├── test.txt
    │   ├── train.txt
    │   └── val.txt
    ├── test
    │   ├── ILSVRC2012_test_00000001.JPEG
    │   ├── ILSVRC2012_test_00000002.JPEG
    │   ├── ILSVRC2012_test_00000003.JPEG
    │   ├── ILSVRC2012_test_00000004.JPEG
    │   └── ...
    ├── train
    │   ├── n01440764
    │   │   ├── n01440764_10026.JPEG
    |   │   ├── n01440764_10027.JPEG
    |   │   ├── n01440764_10029.JPEG
    |   │   └── ...
    │   ├── n01443537
    |   │   └── ...   
    │   ├── n01484850
    |   │   └── ...
    |   └── ...
    └─── val
       ├── ILSVRC2012_val_00000001.JPEG
       ├── ILSVRC2012_val_00000002.JPEG
       ├── ILSVRC2012_val_00000003.JPEG
       └── ...

Imagenet-22k

Please refer to Imagenet-1K dataset.
Meta file is provided from here
Imagenet class name file in generate_target_sets.ipynb for tokenizing is provided from here. Or you can directly use the CLIP-tokenized imagenet-22K class name files is provided from here

Books&wiki

please download files wiki.doc abd bc1g.doc. And put them together into a file:
```
cat wiki.doc bc1g.doc > bookswiki.doc
```
a) Tokenizing vocabularies to generate "vocabulary_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb

b) Or using generated file we provide from here

Then put this files in DATA_PATH

DATA_PATH/
├── vocabulary_CLIP_with_endoftext.pkl
└── bert_pretrain_data/
    └─ bookswiki/
    └── bookswiki.doc

you can also download the plain text dataset from huggingface.co/datasets/wikipedia and huggingface.co/datasets/bookcorpus.

MSCOCO

Please download the images of COCO 2014 from MSCOCO.
Download preprocessed coco captions from Karpathy's homepage: link and extract "dataset_coco.json" from zip file.
a) You can run the coco_preprocess.py file to split the dataset_coco.json file into train, val and test part:
1. walk into the /data/preprocess folder and open the coco_preprocess.py file;
2. fill the 'original_json' variable with the path you download the dataset_coco.json file.
3. fill the 'savepath' with the path you want to save the splited json file.
4. run the coco_preprocess.py file.
b) Or you can directly use the generated json file we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

DATA_PATH/
├── vocabulary_CLIP_with_endoftext.pkl
└── mscoco_dataset/
    ├── new_annotations
    │   ├── captions_test5k.json
    │   ├── captions_train113k.json
    │   ├── captions_val5k.json
    |   └── dataset_coco.json
    └── coco_origin 
        ├── train2014
        │   ├── COCO_train2014_000000000009.jpg
        |   ├── COCO_train2014_000000000025.jpg
        |   ├── COCO_train2014_000000000030.jpg
        │   └── ...
        └── val2014
            ├── COCO_val2014_000000000042.jpg
            ├── COCO_val2014_000000000073.jpg
            ├── COCO_val2014_000000000074.jpg
            └── ...

Visual Genome

Please download the images and region decriptions of visual genome from VG.
a) You can run the region_descriptions.ipynb to preprocess the downloaded "region_descriptions.json" file:
1. walk into the /data/preprocess folder and open the region_descriptions.ipynb;
2. fill the path of downloaded 'region_descriptions.json' and the path you want to save the processed file.
3. run the region_descriptions.ipynb.
b) Or you can directly use the generated json file we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

    DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── visual_genome/
        ├── annotations
        │   ├── region_descriptions.json
        │   ├── vg_captions_128filter.json
        └── images 
            ├── VG_100K
            │   ├── 2.jpg
            |   ├── 3.jpg
            |   ├── 4.jpg
            │   └── ...
            └── VG_100K_2
                ├── 1.jpg
                ├── 51.jpg
                ├── 52.jpg
                └── ...

Flickr30k

Please download the images of filckr30k according to the instruction of Flickr30k.
Download flickr_jsons which provides the annotations of flickr30k images.
a) You can run the process_flickr_caption_json.py to preprocess the json file:
1. walk into the /data/preprocess folder and open the process_flickr_caption_json.py;
2. fill the path of downloaded json files and fill the path you want to save the processed json files.
3. run the process_flickr_caption_json.py.
b) Or you can directly use the generated json files (including captions_test.json, captions_train.json and captions_val.json) we provide from here
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

    DATA_PATH/
     ├── vocabulary_CLIP_with_endoftext.pkl
     ├── flickr30k/
     │   ├── captions_test.json
     │   ├── captions_train.json
     │   └── captions_val.json
     └── flickr30k_images
            └──  flickr30k_images
                └──  flickr30k_images
                    ├── 36979.jpg
                    ├── 65567.jpg
                    └── ...

    ```

SBU

Please download the SBU url and caption files.
Filling the path of above files in sbu_download_list.py and run it for generating the download_list.
Running the script sbu_download.sh to download the sbu images.
a) You can run the make_sbu_json.py to get the annotation file:

b) Or you can directly download the generated json file sbucaption.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

    DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── sbucaption/
        ├── annotations
        │  └── sbucaptions.json
        └── images
            ├── 4385058960_b0f291553e.jpg
            ├── 5148648301_1174ef59bc.jpg
            └── ...

CC3M

Please download "Train_GCC-training.tsv" and "Validation_GCC-1.1.0-Validation.tsv" from here
Filling the path of "Train_GCC-training.tsv" in cc3m_train_download_list.py and run it for generating the training download list.
Filling the path of "Validation_GCC-1.1.0-Validation.tsv" in cc3m_val_download_list.py and run it for generating the validation download list.
Running the script cc3m_train_download.sh and cc3m_val_download.sh to download the cc3m images.

Zip (without compression) "train_image", "val_image" by:

zip -0 ../train_image.zip ./*
zip -0 ../val_image.zip ./*

a) You can run the make_cc3m_train_json.py and make_cc3m_val_json.py to get the annotation file:

b) Or you can directly download the generated json files train_spacy.json and val_spacy.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

    DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── cc3m/
        ├── train_spacy.json
        ├── val_spacy.json
        ├──train_image
        │   ├── 00000000.jpg
        │   └── ...
        └── val_image
            ├── 00000000.jpg
            └── ...

CC12M

Please download "cc12m.tsv" from here
Filling the path of "cc12m.tsv" in cc12m_train_download_list.py and run it for generating the training download list.
Running the script cc12m_train_download.sh to download the cc12m images.
Zip (without compression) "train_image" by:
```
zip -0 ../train_image.zip ./*
```
a) You can run the make_cc12m_train_json.py to get the annotation file:

b) Or you can directly download the generated json file train_available.json we provide.
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

    DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── c12m/
        ├── train_available.json
        └── train_image
            ├── 00000000.jpg
            └── ...

Kinetics-400 & Kinetics-700

Please download the Kinectics-400 & Kinetics-700 videos according to the instructions of this
a)

i. Filling the path of K400's "training" and "validation" folder you download in k400_construct_csv.py and run it for generating the K400 related files (K400_val.csv, K400_train.csv, categories.txt, annotation.json).

ii. Filling the path of K700's "training" and "validation" folder you download in k700_construct_csv.py and run it for generating the K700 related files (K700_val.csv, K700_train.csv, categories.txt, annotation.json).

iii. Running script video_categories.ipynb to generate "category_mapping.txt".

b) Or you can directly download the processed files we provide: K400, K700
a) Tokenizing K400, K700 class names to generate "k400_class_name_CLIP_with_endoftext.pkl" and "k700_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb

b) Or using generated file we provide from K400-CLIP and K700-CLIP

Organize the files into following structure:

DATA_PATH/
    ├── k400_class_name_CLIP_with_endoftext.pkl
    └── K400/
        ├── training
        │    ├── abseiling
        │    │   ├── _4YTwq0-73Y_000044_000054.mp4
        │    │   └── ...
        │    ├── air_drumming
        │    └── ...
        ├── validation/
        │    ├── abseiling
        │    │   ├── __NrybzYzUg.mkv
        │    │   └── ...
        │    ├── air_drumming
        │    └── ...
        ├── annotation.json
        ├── category_mapping.txt
        ├── categories.txt
        ├── K400_train.csv
        └── K400_val.csv

K700 is similar.

MomentsInTime

Please download the MomentsInTime videos according to the instructions of Official Website
a)

i. Filling the path of "training" folder you download in moments_construct_csv.py and run it for generating the training files (moments_train.csv, categories.txt, annotation.json).

ii. Running script video_categories.ipynb to generate "category_mapping.txt".

b) Or you can directly download the processed files we provide: moments.
a) Tokenizing momentsInTime class names to generate "MiT_class_name_CLIP_with_endoftext.pkl" using generate_target_sets.ipynb

b) Or using generated file we provide from MiT-CLIP

Organize the files into following structure:

DATA_PATH/
    ├── MiT_class_name_CLIP_with_endoftext.pkl
    └── MomentsInTime/
        ├── training
        │    ├── adult+female+singing
        │    │   ├── 0a2b81cb0ec5fde79b8c.mp4
        │    │   └── ...
        │    ├── adult+female+speaking
        │    └── ...
        ├── annotation.json
        ├── categories.txt
        ├── category_mapping.txt
        └── moments_train.csv

MSVD

Download MSVD videos "YoutTubeClips.tar" from here and preprocessed "txt_labels" from here.
a) Fill the path of downloaded files in msvd_preprocess.py to generate the annotation files (caption_msvd_train_cocostyle.json, caption_msvd_val_cocostyle.json, caption_msvd_test_cocostyle.json)

b) Or directly download the annotation files we provide new_annotations
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── msvd_dataset/
        ├── new_annotations
        │    ├── caption_msvd_test_cocostyle.json
        │    ├── caption_msvd_train_cocostyle
        │    └── caption_msvd_val_cocostyle
        ├── txt_labels
        │    ├── sents_test_lc_nopunc.txt
        │    ├── sents_train_lc_nopunc.txt
        │    ├── sents_train_lc_nopunc.txt
        │    └── youtube_mapping.txt
        └── YouTubeClips
             ├── _0nX-El-ySo_83_93.avi
             └── ...

MSR-VTT

Download MSRVTT videos ("train_val_videos.zip", "test_videos.zip") and annotation files ("train_val_annotation.zip", "test_videodatainfo.zip") from here and download dataset split info from here.
Unzip downloaded files above, fill the paths of "test_videodatainfo.json", "train_val_videodatainfo.json", "MSRVTT_train.9k.csv", "MSRVTT_JSFUSION_test.csv" in the msrvtt_dataprocess_1k.ipynb

b) Or directly download the annotation files ("caption_msrvtt_1k_trainval_cocostyle.json","caption_msrvtt_1k_test_cocostyle.json") we provide annotations_new
Generating tokenized vocabularies as mentioned in bookswiki part

Organize the files into following structure:

DATA_PATH/
    ├── vocabulary_CLIP_with_endoftext.pkl
    └── msrvtt_dataset/
        ├── annotations_new
        │    ├── caption_msrvtt_1k_trainval_cocostyle.json
        │    └── caption_msrvtt_1k_test_cocostyle.json
        └── videos
             ├── video0.mp4
             └── ...

VQA

Download VQA meta data from the datalink vilbert provided, files including:
- dictionary.pkl
- train_ids.pkl
- val_ids.pkl
- train_target.pkl
- trainval_ans2label.pkl
- val_target.pkl
- trainval_label2ans.pkl
Download VG questions and answers from here
Download VQA annotations from the link xmodaler provided, files including:
- vg_target.pkl
- VG_questions2.json
- download
- VG_annotations.json
Download VQA annotations from VQA website, files including:
- v2_OpenEnded_mscoco_test2015_questions.json
- v2_OpenEnded_mscoco_train2014_questions.json
- v2_OpenEnded_mscoco_val2014_questions.json
a) Tokenizing all the possible answers using generate_target_sets.ipynb.

b) Or you can use the tokenized answers we provide VQA_Answers.

Organize the files into following structure:

DATA_PATH/
├── vocabulary_CLIP_with_endoftext.pkl
├── mscoco_dataset/
|    └── coco_origin 
|       ├── train2014
|       │   ├── COCO_train2014_000000000009.jpg
|       |   ├── COCO_train2014_000000000025.jpg
|       |   ├── COCO_train2014_000000000030.jpg
|       │   └── ...
|       └── val2014
|           ├── COCO_val2014_000000000042.jpg
|           ├── COCO_val2014_000000000073.jpg
|           ├── COCO_val2014_000000000074.jpg
|           └── ...
└── VQA
    ├── trainval_ans2label.pkl
    ├── trainval_label2ans.pkl
    ├── v2_OpenEnded_mscoco_train2014_questions.json
    ├── v2_OpenEnded_mscoco_val2014_questions.json
    ├── v2_OpenEnded_mscoco_test-dev2015_questions.json
    ├── val_target.pkl
    ├── VG_questions2.json
    ├── vg_target.pkl
    └── coco_map.json

GLUE

Follow the instructions of this to download GLUE benchmark data and refer to fairseq to preprocess datasets.
a) Tokenizing GLUE datasets using generate_target_sets.ipynb.

b) Or you can use the tokenized answers we provide GLUE_classnames.
Organize the files into following structure:

    DATA_PATH/
        ├── GLUE_classnames
        └── bert_pretrain_data/
            └── glue_data
                 ├── CoLA
                 ├── CoLA-bin
                 ├── diagnostic
                 ├── MNLI
                 ├── MNLI-bin
                 ├── MRPC
                 ├── MRPC-bin
                 ├── QNLI
                 ├── QNLI-bin
                 ├── QQP
                 ├── QQP-bin
                 ├── RTE
                 ├── RTE-bin
                 ├── SST-2
                 ├── SST-2-bin
                 ├── STS-B
                 ├── STS-B-bin
                 └── WNLI