Model Card for Model ID

Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) optimized for extracting structured financial information from cheque images. It processes cheque visuals and generates JSON-formatted outputs containing key details such as check number, beneficiary, amount, and issue dates.

Model Details

Model Description

Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is a Vision-Language Model (VLM) designed for extracting structured financial details from cheque images. It processes cheque visuals and outputs structured JSON containing key details such as check number, beneficiary, total amount, and issue dates. The model follows the ChatML format and has been fine-tuned on a cheque-specific dataset to improve accuracy in financial document processing.

This is the model card of a 🤗 transformers model that has been pushed on the Hub.

Developed by: Independent fine-tuning on Qwen2.5-VL-3B-Instruct
Model type: Vision-Language Model for cheque information extraction
Language(s) (NLP): Primarily English (optimized for financial terminology)
License: [More Information Needed]
Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct

Uses

The Qwen2.5-VL-3B-Instruct Fine-tuned on Personal Cheque Dataset is intended for automated cheque processing and structured data extraction. It is designed to analyze cheque images and generate JSON-formatted outputs containing key financial details. The model can be used in:

Banking and Financial Services – Automating cheque verification and processing.
Accounting and Payroll Systems – Extracting financial details for record-keeping.
AI-powered OCR Pipelines – Enhancing traditional OCR systems with structured output.
Enterprise Document Management – Automating financial data extraction from scanned cheques.

Direct Use

The model can be further fine-tuned or integrated into larger applications such as:

Custom AI-powered financial processing tools
Multi-document parsing workflows for financial institutions
Intelligent chatbots for banking automation

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

General OCR applications unrelated to cheques – The model is optimized specifically for cheque image processing and may not perform well on other document types.
Handwritten cheque recognition – The model primarily works with printed cheques and may struggle with cursive handwriting.
Non-English cheque processing – While it is trained in English financial contexts, it may not generalize well to cheques in other languages.

How to Get Started with the Model

  pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

Using 🤗 Transformers to Chat

Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Preparation for inference

text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

Training Details

Training Data

The dataset consists of cheque images and corresponding JSON annotations in the following format:

{
  "image": "1.png", 
  "prefix": "Format the json as shown below",  
  "suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}

Images Folder: Contains corresponding cheque images.

Annotations: Structured JSON specifying cheque details like check number, beneficiary, amount, client issue date, and bank issue date.

Training Procedure

The model configuration sets the minimum and maximum pixel limits for image processing, ensuring compatibility with the Qwen2.5-VLProcessor. The processor is initialized with these constraints using a pre-trained model ID.The Qwen2.5-VLForConditionalGeneration model is then loaded with Torch data type set to bfloat16 for optimized performance.

Finally, LoRA (Low-Rank Adaptation) is applied to the model using get_peft_model, reducing memory overhead while fine-tuning specific layers.

config = {
    "max_epochs": 4,
    "batch_size": 1,
    "lr": 2e-4,
    "check_val_every_n_epoch": 2,
    "gradient_clip_val": 1.0,
    "accumulate_grad_batches": 8,
    "num_nodes": 1,
    "warmup_steps": 50,
    "result_path": "qwen2.5-3b-instruct-cheque-manifest"
}

Compute Infrastructure

GPU: NVIDIA A100

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}
@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

AJNG
/

qwen-vl-2.5-3B-finetuned-cheque