Model Card for Model ID
Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) optimized for extracting structured financial information from cheque images. It processes cheque visuals and generates JSON-formatted outputs containing key details such as check number, beneficiary, amount, and issue dates.
Model Details
Model Description
Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) designed for extracting structured financial details from cheque images. It processes cheque visuals and outputs structured JSON containing key details such as check number, beneficiary, total amount, and issue dates. The model follows the ChatML format and has been fine-tuned on a cheque-specific dataset to improve accuracy in financial document processing.
This is the model card of a π€ transformers model that has been pushed on the Hub.
- Developed by: Independent fine-tuning on Qwen2.5-VL-3B-Instruct
- Model type: Vision-Language Model for cheque information extraction
- Language(s) (NLP): Primarily English (optimized for financial terminology)
- License: [More Information Needed]
- Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
Uses
The Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is intended for automated cheque processing and structured data extraction. It is designed to analyze cheque images and generate JSON-formatted outputs containing key financial details. The model can be used in:
- Banking and Financial Services β Automating cheque verification and processing.
- Accounting and Payroll Systems β Extracting financial details for record-keeping.
- AI-powered OCR Pipelines β Enhancing traditional OCR systems with structured output.
- Enterprise Document Management β Automating financial data extraction from scanned cheques.
Direct Use
The model can be further fine-tuned or integrated into larger applications such as:
- Custom AI-powered financial processing tools
- Multi-document parsing workflows for financial institutions
- Intelligent chatbots for banking automation
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
- General OCR applications unrelated to cheques β The model is optimized specifically for cheque image processing and may not perform well on other document types.
- Handwritten cheque recognition β The model primarily works with printed cheques and may struggle with cursive handwriting.
- Non-English cheque processing β While it is trained in English financial contexts, it may not generalize well to cheques in other languages.
How to Get Started with the Model
pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
Using π€ Transformers to Chat
Here we show a code snippet to show you how to use the chat model with transformers
and qwen_vl_utils
:
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16)
MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "/kaggle/input/testch/Handwritten-legal-amount.png",
},
{"type": "text", "text": "extract in json"},
],
}
]
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Preparation for inference
text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")
Training Details
Training Data
The dataset consists of cheque images and corresponding JSON annotations in the following format:
{
"image": "1.png",
"prefix": "Format the json as shown below",
"suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}
Images Folder: Contains corresponding cheque images.
Annotations: Structured JSON specifying cheque details like check number, beneficiary, amount, client issue date, and bank issue date.
Training Procedure
The model configuration sets the minimum and maximum pixel limits for image processing, ensuring compatibility with the Qwen2.5-VLProcessor. The processor is initialized with these constraints using a pre-trained model ID.The Qwen2.5-VLForConditionalGeneration model is then loaded with Torch data type set to bfloat16 for optimized performance.
Finally, LoRA (Low-Rank Adaptation) is applied to the model using get_peft_model, reducing memory overhead while fine-tuning specific layers.
config = {
"max_epochs": 4,
"batch_size": 1,
"lr": 2e-4,
"check_val_every_n_epoch": 2,
"gradient_clip_val": 1.0,
"accumulate_grad_batches": 8,
"num_nodes": 1,
"warmup_steps": 50,
"result_path": "qwen2.5-3b-instruct-cheque-manifest"
}
Compute Infrastructure
GPU: NVIDIA A100
Citation
If you find our work helpful, feel free to give us a cite.
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
- Downloads last month
- 15
Model tree for AJNG/qwen-vl-2.5-3B-finetuned-cheque
Unable to build the model tree, the base model loops to the model itself. Learn more.