sangu.han
commited on
Commit
·
e21f16e
1
Parent(s):
bc2e9bd
[update]README.md
Browse files
README.md
CHANGED
@@ -1,16 +1,29 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
-
- ko
|
|
|
5 |
pipeline_tag: image-text-to-text
|
|
|
6 |
tags:
|
7 |
- multimodal
|
8 |
-
|
9 |
-
-
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
# ko-ocr-qwen2-vl-awq
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
## Requirements
|
15 |
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
|
16 |
```
|
@@ -73,18 +86,4 @@ messages = [
|
|
73 |
],
|
74 |
}
|
75 |
]
|
76 |
-
```
|
77 |
-
|
78 |
-
## Limitations
|
79 |
-
|
80 |
-
While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
|
81 |
-
|
82 |
-
1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
|
83 |
-
2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
|
84 |
-
3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
|
85 |
-
4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
|
86 |
-
5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
|
87 |
-
6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
|
88 |
-
|
89 |
-
These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
|
90 |
-
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
+
- ko
|
5 |
+
- en
|
6 |
pipeline_tag: image-text-to-text
|
7 |
+
|
8 |
tags:
|
9 |
- multimodal
|
10 |
+
- ocr
|
11 |
+
- quantization
|
12 |
+
- awq
|
13 |
+
|
14 |
+
base_model: Qwen/Qwen2-VL-72B-Instruct
|
15 |
---
|
16 |
|
17 |
# ko-ocr-qwen2-vl-awq
|
18 |
|
19 |
+
## Model Summary
|
20 |
+
|
21 |
+
**ko-ocr-qwen2-vl-awq** is a fine-tuned and quantized version of [Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct), optimized for Korean OCR tasks. The model was trained with supervised fine-tuning (SFT) and further compressed using [AWQ (Activation-aware Weight Quantization)](https://arxiv.org/abs/2306.00978) for efficient inference with minimal performance loss.
|
22 |
+
|
23 |
+
### Intended Use
|
24 |
+
|
25 |
+
This model is designed for **OCR tasks on Korean images**, capable of recognizing text in natural scenes, scanned documents, and mixed-language content. It also supports general visual-language understanding, such as image captioning and question answering.
|
26 |
+
|
27 |
## Requirements
|
28 |
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
|
29 |
```
|
|
|
86 |
],
|
87 |
}
|
88 |
]
|
89 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|