sangu.han
commited on
Commit
·
e21f16e
1
Parent(s):
bc2e9bd
[update]README.md
Browse files
README.md
CHANGED
|
@@ -1,16 +1,29 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
-
- ko
|
|
|
|
| 5 |
pipeline_tag: image-text-to-text
|
|
|
|
| 6 |
tags:
|
| 7 |
- multimodal
|
| 8 |
-
|
| 9 |
-
-
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# ko-ocr-qwen2-vl-awq
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
## Requirements
|
| 15 |
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
|
| 16 |
```
|
|
@@ -73,18 +86,4 @@ messages = [
|
|
| 73 |
],
|
| 74 |
}
|
| 75 |
]
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
## Limitations
|
| 79 |
-
|
| 80 |
-
While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
|
| 81 |
-
|
| 82 |
-
1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
|
| 83 |
-
2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
|
| 84 |
-
3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
|
| 85 |
-
4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
|
| 86 |
-
5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
|
| 87 |
-
6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
|
| 88 |
-
|
| 89 |
-
These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
|
| 90 |
-
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
+
- ko
|
| 5 |
+
- en
|
| 6 |
pipeline_tag: image-text-to-text
|
| 7 |
+
|
| 8 |
tags:
|
| 9 |
- multimodal
|
| 10 |
+
- ocr
|
| 11 |
+
- quantization
|
| 12 |
+
- awq
|
| 13 |
+
|
| 14 |
+
base_model: Qwen/Qwen2-VL-72B-Instruct
|
| 15 |
---
|
| 16 |
|
| 17 |
# ko-ocr-qwen2-vl-awq
|
| 18 |
|
| 19 |
+
## Model Summary
|
| 20 |
+
|
| 21 |
+
**ko-ocr-qwen2-vl-awq** is a fine-tuned and quantized version of [Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct), optimized for Korean OCR tasks. The model was trained with supervised fine-tuning (SFT) and further compressed using [AWQ (Activation-aware Weight Quantization)](https://arxiv.org/abs/2306.00978) for efficient inference with minimal performance loss.
|
| 22 |
+
|
| 23 |
+
### Intended Use
|
| 24 |
+
|
| 25 |
+
This model is designed for **OCR tasks on Korean images**, capable of recognizing text in natural scenes, scanned documents, and mixed-language content. It also supports general visual-language understanding, such as image captioning and question answering.
|
| 26 |
+
|
| 27 |
## Requirements
|
| 28 |
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
|
| 29 |
```
|
|
|
|
| 86 |
],
|
| 87 |
}
|
| 88 |
]
|
| 89 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|