Data Preparation

Create a new directory data to store all the datasets.

Ref-COCO

Download the dataset from the official website COCO.
RefCOCO/+/g use the COCO2014 train split. Download the annotation files from github.

Convert the annotation files:

python3 tools/data/convert_refexp_to_coco.py

Finally, we expect the directory structure to be the following:

ReferFormer
├── data
│   ├── coco
│   │   ├── train2014
│   │   ├── refcoco
│   │   │   ├── instances_refcoco_train.json
│   │   │   ├── instances_refcoco_val.json
│   │   ├── refcoco+
│   │   │   ├── instances_refcoco+_train.json
│   │   │   ├── instances_refcoco+_val.json
│   │   ├── refcocog
│   │   │   ├── instances_refcocog_train.json
│   │   │   ├── instances_refcocog_val.json

Ref-Youtube-VOS

Download the dataset from the competition's website here. Then, extract and organize the file. We expect the directory structure to be the following:

ReferFormer
├── data
│   ├── ref-youtube-vos
│   │   ├── meta_expressions
│   │   ├── train
│   │   │   ├── JPEGImages
│   │   │   ├── Annotations
│   │   │   ├── meta.json
│   │   ├── valid
│   │   │   ├── JPEGImages

Ref-DAVIS17

Downlaod the DAVIS2017 dataset from the website. Note that you only need to download the two zip files DAVIS-2017-Unsupervised-trainval-480p.zip and DAVIS-2017_semantics-480p.zip. Download the text annotations from the website. Then, put the zip files in the directory as follows.

ReferFormer
├── data
│   ├── ref-davis
│   │   ├── DAVIS-2017_semantics-480p.zip
│   │   ├── DAVIS-2017-Unsupervised-trainval-480p.zip
│   │   ├── davis_text_annotations.zip

Unzip these zip files.

unzip -o davis_text_annotations.zip
unzip -o DAVIS-2017_semantics-480p.zip
unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip

Preprocess the dataset to Ref-Youtube-VOS format. (Make sure you are in the main directory)

python tools/data/convert_davis_to_ytvos.py

Finally, unzip the file DAVIS-2017-Unsupervised-trainval-480p.zip again (since we use mv in preprocess for efficiency).

unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip

A2D-Sentences

Follow the instructions and download the dataset from the website here. Then, extract the files. Additionally, we use the same json annotation files generated by MTTR. Please download these files from onedrive. We expect the directory structure to be the following:

ReferFormer
├── data
│   ├── a2d_sentences
│   │   ├── Release
│   │   ├── text_annotations
│   │   │   ├── a2d_annotation_with_instances
│   │   │   ├── a2d_annotation.txt
│   │   │   ├── a2d_missed_videos.txt
│   │   ├── a2d_sentences_single_frame_test_annotations.json
│   │   ├── a2d_sentences_single_frame_train_annotations.json
│   │   ├── a2d_sentences_test_annotations_in_coco_format.json

JHMDB-Sentences

ReferFormer
├── data
│   ├── jhmdb_sentences
│   │   ├── Rename_Images
│   │   ├── puppet_mask
│   │   ├── jhmdb_annotation.txt
│   │   ├── jhmdb_sentences_samples_metadata.json
│   │   ├── jhmdb_sentences_gt_annotations_in_coco_format.json