File size: 5,536 Bytes
57d1795
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# IterVM: Iterative Vision Modeling Module for Scene Text Recognition

The official code of [IterNet](https://arxiv.org/abs/2204.02630).

We propose IterVM, an iterative approach for visual feature extraction which can significantly improve scene text recognition accuracy.
IterVM repeatedly uses the high-level visual feature extracted at the previous iteration to enhance the multi-level features extracted at the subsequent iteration.


![framework](./figures/framework.png)


## Runtime Environment
```
pip install -r requirements.txt
```
Note: `fastai==1.0.60` is required.

## Datasets
<details>
  <summary>Training datasets (Click to expand) </summary>
    1. [MJSynth](http://www.robots.ox.ac.uk/~vgg/data/text/) (MJ): 
        - Use `tools/create_lmdb_dataset.py` to convert images into LMDB dataset
        - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ)
    2. [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST):
        - Use `tools/crop_by_word_bb.py` to crop images from original [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) dataset, and convert images into LMDB dataset by `tools/create_lmdb_dataset.py`
        - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ)
    3. [WikiText103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), which is only used for pre-trainig language models:
        - Use `notebooks/prepare_wikitext103.ipynb` to convert text into CSV format.
        - [CSV dataset BaiduNetdisk(passwd:dk01)](https://pan.baidu.com/s/1yabtnPYDKqhBb_Ie9PGFXA)
</details>

<details>
  <summary>Evaluation datasets (Click to expand) </summary>
- Evaluation datasets, LMDB datasets can be downloaded from [BaiduNetdisk(passwd:1dbv)](https://pan.baidu.com/s/1RUg3Akwp7n8kZYJ55rU5LQ), [GoogleDrive](https://drive.google.com/file/d/1dTI0ipu14Q1uuK4s4z32DqbqF3dJPdkk/view?usp=sharing).
    1. ICDAR 2013 (IC13)
    2. ICDAR 2015 (IC15)
    3. IIIT5K Words (IIIT)
    4. Street View Text (SVT)
    5. Street View Text-Perspective (SVTP)
    6. CUTE80 (CUTE)
</details>

<details>
  <summary>The structure of `data` directory (Click to expand) </summary>
- The structure of `data` directory is
    ```
    data
    β”œβ”€β”€ charset_36.txt
    β”œβ”€β”€ evaluation
    β”‚Β Β  β”œβ”€β”€ CUTE80
    β”‚Β Β  β”œβ”€β”€ IC13_857
    β”‚Β Β  β”œβ”€β”€ IC15_1811
    β”‚Β Β  β”œβ”€β”€ IIIT5k_3000
    β”‚Β Β  β”œβ”€β”€ SVT
    β”‚Β Β  └── SVTP
    β”œβ”€β”€ training
    β”‚Β Β  β”œβ”€β”€ MJ
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ MJ_test
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ MJ_train
    β”‚Β Β  β”‚Β Β  └── MJ_valid
    β”‚Β Β  └── ST
    β”œβ”€β”€ WikiText-103.csv
    └── WikiText-103_eval_d1.csv
    ```
</details>

## Pretrained Models

Get the pretrained models from [GoogleDrive](https://drive.google.com/drive/folders/1C8NMI8Od8mQUMlsnkHNLkYj73kbAQ7Bl?usp=sharing). Performances of the pretrained models are summaried as follows:

|Model|IC13|SVT|IIIT|IC15|SVTP|CUTE|AVG|
|-|-|-|-|-|-|-|-|
|IterNet|97.9|95.1|96.9|87.7|90.9|91.3|93.8|

## Training

1. Pre-train vision model
    ```
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/pretrain_vm.yaml
    ```
2. Pre-train language model
    ```
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml
    ```
3. Train IterNet
    ```
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/train_iternet.yaml
    ```
Note:
- You can set the `checkpoint` path for vision model (vm) and language model separately for specific pretrained model, or set to `None` to train from scratch


## Evaluation

```
CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_iternet.yaml --phase test --image_only
```
Additional flags:
- `--checkpoint /path/to/checkpoint` set the path of evaluation model 
- `--test_root /path/to/dataset` set the path of evaluation dataset
- `--model_eval [alignment|vision]` which sub-model to evaluate
- `--image_only` disable dumping visualization of attention masks

## Run Demo
[<a href="https://colab.research.google.com/drive/1XmZGJzFF95uafmARtJMudPLLKBO2eXLv?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a>](https://colab.research.google.com/drive/1XmZGJzFF95uafmARtJMudPLLKBO2eXLv?usp=sharing)

```
python demo.py --config=configs/train_iternet.yaml --input=figures/demo
```
Additional flags:
- `--config /path/to/config` set the path of configuration file 
- `--input /path/to/image-directory` set the path of image directory or wildcard path, e.g, `--input='figs/test/*.png'`
- `--checkpoint /path/to/checkpoint` set the path of trained model
- `--cuda [-1|0|1|2|3...]` set the cuda id, by default -1 is set and stands for cpu
- `--model_eval [alignment|vision]` which sub-model to use
- `--image_only` disable dumping visualization of attention masks


## Citation
If you find our method useful for your reserach, please cite
```bash 
@article{chu2022itervm,
  title={IterVM: Iterative Vision Modeling Module for Scene Text Recognition},
  author={Chu, Xiaojie and Wang, Yongtao},
  journal={arXiv preprint arXiv:2204.02630},
  year={2022}
}
 ```

## License
The project is only free for academic research purposes, but needs authorization for commerce. For commerce permission, please contact [email protected].

## Acknowledgements
This project is based on [ABINet](https://github.com/FangShancheng/ABINet.git).
Thanks for their great works.