AnsenH commited on
Commit
9a680a5
·
1 Parent(s): 271ed35

add application code

Browse files
Files changed (1) hide show
  1. README.md +161 -5
README.md CHANGED
@@ -1,13 +1,169 @@
1
  ---
2
- title: Highlight Detection With MomentDETR
3
- emoji: 💻
4
- colorFrom: yellow
5
  colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 3.41.2
 
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Highlight Detection with MomentDETR
3
+ emoji: ✍️
4
+ colorFrom: purple
5
  colorTo: yellow
6
  sdk: gradio
7
+ sdk_version: 3.34.0
8
+ python_version: 3.7
9
  app_file: app.py
10
  pinned: false
11
  license: apache-2.0
12
  ---
13
 
14
+ # Moment-DETR
15
+
16
+ [QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries](https://arxiv.org/abs/2107.09609), NeurIPS 2021
17
+
18
+ [Jie Lei](http://www.cs.unc.edu/~jielei/),
19
+ [Tamara L. Berg](http://tamaraberg.com/), [Mohit Bansal](http://www.cs.unc.edu/~mbansal/)
20
+
21
+ This repo contains a copy of QVHighlights dataset for moment retrieval and highlight detections. For details, please check [data/README.md](data/README.md)
22
+ This repo also hosts the Moment-DETR model (see overview below), a new model that predicts moment coordinates and saliency scores end-to-end based on a given text query. This released code supports pre-training, fine-tuning, and evaluation of Moment-DETR on the QVHighlights datasets. It also supports running prediction on your own raw videos and text queries.
23
+
24
+
25
+ ![model](./res/model_overview.png)
26
+
27
+
28
+ ## Table of Contents
29
+
30
+ * [Getting Started](#getting-started)
31
+ * [Prerequisites](#prerequisites)
32
+ * [Training](#training)
33
+ * [Inference](#inference)
34
+ * [Pretraining and Finetuning](#pretraining-and-finetuning)
35
+ * [Evaluation and Codalab Submission](#evaluation-and-codalab-submission)
36
+ * [Train Moment-DETR on your own dataset](#train-moment-detr-on-your-own-dataset)
37
+ * [Demo: Run predictions on your own videos and queries](#run-predictions-on-your-own-videos-and-queries)
38
+ * [Acknowledgement](#acknowledgement)
39
+ * [LICENSE](#license)
40
+
41
+
42
+
43
+ ## Getting Started
44
+
45
+ ### Prerequisites
46
+ 0. Clone this repo
47
+
48
+ ```
49
+ git clone https://github.com/jayleicn/moment_detr.git
50
+ cd moment_detr
51
+ ```
52
+
53
+ 1. Prepare feature files
54
+
55
+ Download [moment_detr_features.tar.gz](https://drive.google.com/file/d/1Hiln02F1NEpoW8-iPZurRyi-47-W2_B9/view?usp=sharing) (8GB),
56
+ extract it under project root directory:
57
+ ```
58
+ tar -xf path/to/moment_detr_features.tar.gz
59
+ ```
60
+ The features are extracted using Linjie's [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor).
61
+ If you want to use your own choices of video features, please download the raw videos from this [link](https://nlp.cs.unc.edu/data/jielei/qvh/qvhilights_videos.tar.gz).
62
+
63
+ 2. Install dependencies.
64
+
65
+ This code requires Python 3.7, PyTorch, and a few other Python libraries.
66
+ We recommend creating conda environment and installing all the dependencies as follows:
67
+ ```
68
+ # create conda env
69
+ conda create --name moment_detr python=3.7
70
+ # activate env
71
+ conda actiavte moment_detr
72
+ # install pytorch with CUDA 11.0
73
+ conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
74
+ # install other python packages
75
+ pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas
76
+ ```
77
+ The PyTorch version we tested is `1.9.0`.
78
+
79
+ ### Training
80
+
81
+ Training can be launched by running the following command:
82
+ ```
83
+ bash moment_detr/scripts/train.sh
84
+ ```
85
+ This will train Moment-DETR for 200 epochs on the QVHighlights train split, with SlowFast and Open AI CLIP features. The training is very fast, it can be done within 4 hours using a single RTX 2080Ti GPU. The checkpoints and other experiment log files will be written into `results`. For training under different settings, you can append additional command line flags to the command above. For example, if you want to train the model without the saliency loss (by setting the corresponding loss weight to 0):
86
+ ```
87
+ bash moment_detr/scripts/train.sh --lw_saliency 0
88
+ ```
89
+ For more configurable options, please checkout our config file [moment_detr/config.py](moment_detr/config.py).
90
+
91
+ ### Inference
92
+ Once the model is trained, you can use the following command for inference:
93
+ ```
94
+ bash moment_detr/scripts/inference.sh CHECKPOINT_PATH SPLIT_NAME
95
+ ```
96
+ where `CHECKPOINT_PATH` is the path to the saved checkpoint, `SPLIT_NAME` is the split name for inference, can be one of `val` and `test`.
97
+
98
+ ### Pretraining and Finetuning
99
+ Moment-DETR utilizes ASR captions for weakly supervised pretraining. To launch pretraining, run:
100
+ ```
101
+ bash moment_detr/scripts/pretrain.sh
102
+ ```
103
+ This will pretrain the Moment-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into `results`. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint `PRETRAIN_CHECKPOINT_PATH` as:
104
+ ```
105
+ bash moment_detr/scripts/train.sh --resume ${PRETRAIN_CHECKPOINT_PATH}
106
+ ```
107
+ Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.
108
+
109
+
110
+ ### Evaluation and Codalab Submission
111
+ Please check [standalone_eval/README.md](standalone_eval/README.md) for details.
112
+
113
+
114
+ ### Train Moment-DETR on your own dataset
115
+ To train Moment-DETR on your own dataset, please prepare your dataset annotations following the format
116
+ of QVHighlights annotations in [data](./data), and extract features using [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor).
117
+ Next copy the script [moment_detr/scripts/train.sh](./moment_detr/scripts/train.sh) and modify the dataset specific parameters
118
+ such as annotation and feature paths. Now you are ready to use this script for training as described in [Training](#training).
119
+
120
+
121
+ ## Run predictions on your own videos and queries
122
+ You may also want to run Moment-DETR model on your own videos and queries.
123
+ First you need to add a few libraries for feature extraction to your environment. Before this, you should have already installed PyTorch and other libraries for running Moment-DETR following instuctions in previous sections.
124
+ ```bash
125
+ pip install ffmpeg-python ftfy regex
126
+ ```
127
+ Next, run the example provided in this repo:
128
+ ```bash
129
+ PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py
130
+ ```
131
+ This will load the Moment-DETR model [checkpoint](run_on_video/moment_detr_ckpt/model_best.ckpt) trained with CLIP image and text features, and make predictions for the video [RoripwjYFp8_60.0_210.0.mp4](run_on_video/example/RoripwjYFp8_60.0_210.0.mp4) with its associated query in [run_on_video/example/queries.jsonl](run_on_video/example/queries.jsonl).
132
+ The output will look like the following:
133
+ ```
134
+ Build models...
135
+ Loading feature extractors...
136
+ Loading CLIP models
137
+ Loading trained Moment-DETR model...
138
+ Run prediction...
139
+ ------------------------------idx0
140
+ >> query: Chef makes pizza and cuts it up.
141
+ >> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
142
+ >> GT moments: [[106, 122]]
143
+ >> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
144
+ [49.967, 64.9129, 0.9421],
145
+ [66.4396, 81.0731, 0.9271],
146
+ [105.9434, 122.0372, 0.9234],
147
+ [93.2057, 103.3713, 0.2222],
148
+ ...,
149
+ [45.3834, 52.2183, 0.0005]
150
+ ]
151
+ >> GT saliency scores (only localized 2-sec clips):
152
+ [[2, 3, 3], [2, 3, 3], ...]
153
+ >> Predicted saliency scores (for all 2-sec clip):
154
+ [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]
155
+ ```
156
+ You can see the 3rd ranked moment `[105.9434, 122.0372]` matches quite well with the ground truth of `[106, 122]`, with a confidence score of `0.9234`.
157
+ You may want to refer to [data/README.md](data/README.md) for more info about how the ground-truth is organized.
158
+ Your predictions might slightly differ from the predictions here, depends on your environment.
159
+
160
+ To run predictions on your own videos and queries, please take a look at the `run_example` function inside the [run_on_video/run.py](run_on_video/run.py) file.
161
+
162
+
163
+ ## Acknowledgement
164
+ We thank [Linjie Li](https://scholar.google.com/citations?user=WR875gYAAAAJ&hl=en) for the helpful discussions.
165
+ This code is based on [detr](https://github.com/facebookresearch/detr) and [TVRetrieval XML](https://github.com/jayleicn/TVRetrieval). We used resources from [mdetr](https://github.com/ashkamath/mdetr), [MMAction2](https://github.com/open-mmlab/mmaction2), [CLIP](https://github.com/openai/CLIP), [SlowFast](https://github.com/facebookresearch/SlowFast) and [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor). We thank the authors for their awesome open-source contributions.
166
+
167
+ ## LICENSE
168
+ The annotation files are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license, see [./data/LICENSE](data/LICENSE). All the code are under [MIT](https://opensource.org/licenses/MIT) license, see [LICENSE](./LICENSE).
169
+