timtkddn
/

ko-ocr-qwen2-vl-awq

Image-Text-to-Text

4-bit precision

Model card Files Files and versions

sangu.han commited on Apr 2

Commit

e21f16e

·

1 Parent(s): bc2e9bd

[update]README.md

Files changed (1) hide show

README.md +17 -18

README.md CHANGED Viewed

@@ -1,16 +1,29 @@
 ---
 license: apache-2.0
 language:
-- ko, en
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
-base_model:
-- Qwen/Qwen2-VL-72B-Instruct
 ---
 # ko-ocr-qwen2-vl-awq
 ## Requirements
 The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
 ```
@@ -73,18 +86,4 @@ messages = [
         ],
     }
 ]
-```
-## Limitations
-While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
-1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
-2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
-3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
-4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
-5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
-6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
-These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.

 ---
 license: apache-2.0
 language:
+- ko
+- en
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
+- ocr
+- quantization
+- awq
+base_model: Qwen/Qwen2-VL-72B-Instruct
 ---
 # ko-ocr-qwen2-vl-awq
+## Model Summary
+**ko-ocr-qwen2-vl-awq** is a fine-tuned and quantized version of [Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct), optimized for Korean OCR tasks. The model was trained with supervised fine-tuning (SFT) and further compressed using [AWQ (Activation-aware Weight Quantization)](https://arxiv.org/abs/2306.00978) for efficient inference with minimal performance loss.
+### Intended Use
+This model is designed for **OCR tasks on Korean images**, capable of recognizing text in natural scenes, scanned documents, and mixed-language content. It also supports general visual-language understanding, such as image captioning and question answering.
 ## Requirements
 The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
 ```
         ],
     }
 ]
+```