letuan commited on
Commit
f2429d6
·
verified ·
1 Parent(s): 57957a0

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -11
README.md CHANGED
@@ -1,11 +1,76 @@
1
- ---
2
- license: mit
3
- language:
4
- - vi
5
- - en
6
- base_model:
7
- - Gregor/mblip-mt0-xl
8
- pipeline_tag: visual-question-answering
9
- ---
10
-
11
- Vietnamese Visual Reading Comprehension
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Joint Training and Feature Augmentation for Vietnamese Visual Reading Comprehension
2
+ In Vietnamese, there is currently limited availability of datasets and methodologies for the Vietnamese Visual Question Answering task. In the VLSP 2023 challenge on Visual Reading Comprehension for Vietnamese, the [OpenViVQA](https://arxiv.org/abs/2305.04183) dataset is employed as the benchmark dataset. This is a challenging dataset with a wide variety of questions, encompassing both the content of the scenery and the text within the images. Notably, all the images are captured in Vietnam, giving this dataset distinct characteristics and cultural features unique to Vietnam. The texts contained within the images are mainly in Vietnamese. Therefore, the VQA system is required to have the capability to recognize and comprehend Vietnamese within the images. The responses are in open-ended format, requiring the VQA system to generate answers. This task is more challenging compared to selecting answers from a predefined list.
3
+
4
+ To address this challenging task, we propose an approach using three primary models: a Scene Text Recognition model, a Vision model, and a Language model. Particularly, the Scene Text Recognition model is responsible for extracting scene text from image, the Vision model is tasked with extracting visual features from image, and the Language model takes the output from the two aforementioned models as input and generates the corresponding answer for the question. Our approach has achieved a CIDEr score of 3.6384 in the private test set, ranking first among competing teams.
5
+
6
+ <p align="center">
7
+ <img width="800" alt="overview" src="figures/overview.png"><br>
8
+ Diagram of our proposed model
9
+ </p>
10
+
11
+
12
+
13
+ ## Contents
14
+ 1. [Install](#setup) <br>
15
+ 2. [Train model](#train_model) <br>
16
+ 3. [Evaluate model](#evaluate_model) <br>
17
+
18
+ Our model is available at [letuan/mblip-mt0-xl-vivqa ](https://huggingface.co/letuan/mblip-mt0-xl-vivqa). Please download model:
19
+ ```bash
20
+ huggingface-cli download letuan/mblip-mt0-xl-vivqa --local-dir <the folder on your computer to store the model>
21
+ ```
22
+
23
+ ## 1. Install <a name="setup"></a>
24
+ **Clone project:**
25
+ ```bash
26
+ git clone https://github.com/tuanlt175/mblip_stqa.git
27
+ cd mblip_stqa/
28
+ ```
29
+
30
+ **Using Docker:**
31
+ ```bash
32
+ sudo docker build -t vivrc_mblip:dev -f Dockerfile .
33
+ ```
34
+
35
+ ## 2. Train model <a name="train_model"></a>
36
+ **Run a docker container:**
37
+ ```bash
38
+ sudo docker run --gpus all --network host \
39
+ --volume ${PWD}/icvrc:/code/icvrc \
40
+ --volume ${PWD}/data:/code/data \
41
+ --volume ${PWD}/models:/code/models \
42
+ --volume ${PWD}/deepspeed_train_mblip_bloomz.sh:/code/deepspeed_train_mblip_bloomz.sh \
43
+ --volume ${PWD}/deepspeed_train_mblip_mt0.sh:/code/deepspeed_train_mblip_mt0.sh \
44
+ --volume ${PWD}/deepspeed_config.json:/code/deepspeed_config.json \
45
+ -it vivrc_mblip:dev /bin/bash
46
+ ```
47
+
48
+ Then, run the code below:
49
+ ```bash
50
+ chmod +x deepspeed_train_mblip_mt0.sh
51
+ ./deepspeed_train_mblip_mt0.sh
52
+ ```
53
+
54
+ ## 3. Evaluate model <a name="evaluate_model"></a>
55
+ **Run a docker container:**
56
+ ```bash
57
+ sudo docker run --gpus all --network host \
58
+ --volume ${PWD}/icvrc:/code/icvrc \
59
+ --volume ${PWD}/data:/code/data \
60
+ --volume <folder containing the model you just downloaded>:/code/models \
61
+ --volume ${PWD}/evaluate.sh:/code/evaluate.sh \
62
+ -it vivrc_mblip:dev /bin/bash
63
+ ```
64
+
65
+ Then run the code below to evaluate:
66
+ ```bash
67
+ chmod +x evaluate.sh
68
+ ./evaluate.sh
69
+ ```
70
+
71
+ ## 4. Examples
72
+
73
+ <p align="center">
74
+ <img width="400" alt="overview" src="figures/examples.png"><br>
75
+ Generated VQA answers of the proposed model in comparison with that of the baselines.
76
+ </p>