sangu.han commited on
Commit
e21f16e
·
1 Parent(s): bc2e9bd

[update]README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -18
README.md CHANGED
@@ -1,16 +1,29 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - ko, en
 
5
  pipeline_tag: image-text-to-text
 
6
  tags:
7
  - multimodal
8
- base_model:
9
- - Qwen/Qwen2-VL-72B-Instruct
 
 
 
10
  ---
11
 
12
  # ko-ocr-qwen2-vl-awq
13
 
 
 
 
 
 
 
 
 
14
  ## Requirements
15
  The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
16
  ```
@@ -73,18 +86,4 @@ messages = [
73
  ],
74
  }
75
  ]
76
- ```
77
-
78
- ## Limitations
79
-
80
- While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
81
-
82
- 1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
83
- 2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
84
- 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
85
- 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
86
- 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
87
- 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
88
-
89
- These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
90
-
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - ko
5
+ - en
6
  pipeline_tag: image-text-to-text
7
+
8
  tags:
9
  - multimodal
10
+ - ocr
11
+ - quantization
12
+ - awq
13
+
14
+ base_model: Qwen/Qwen2-VL-72B-Instruct
15
  ---
16
 
17
  # ko-ocr-qwen2-vl-awq
18
 
19
+ ## Model Summary
20
+
21
+ **ko-ocr-qwen2-vl-awq** is a fine-tuned and quantized version of [Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct), optimized for Korean OCR tasks. The model was trained with supervised fine-tuning (SFT) and further compressed using [AWQ (Activation-aware Weight Quantization)](https://arxiv.org/abs/2306.00978) for efficient inference with minimal performance loss.
22
+
23
+ ### Intended Use
24
+
25
+ This model is designed for **OCR tasks on Korean images**, capable of recognizing text in natural scenes, scanned documents, and mixed-language content. It also supports general visual-language understanding, such as image captioning and question answering.
26
+
27
  ## Requirements
28
  The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
29
  ```
 
86
  ],
87
  }
88
  ]
89
+ ```