document-qa-model / README.md

lakshya-rawat

Update README.md

022d025 verified 7 months ago

preview code

raw

history blame

2.9 kB

metadata

library_name: transformers
tags:
  - document-question-answering
  - layoutlmv3
  - ocr
  - document-understanding
  - paddleocr
  - multilingual
  - layout-aware
  - lakshya-singh
license: apache-2.0
language:
  - en
base_model:
  - microsoft/layoutlmv3-base
datasets:
  - nielsr/docvqa_1200_examples

Document QA Model

This is a fine-tuned document question-answering model based on layoutlmv3-base. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout.

Model Details

Model Description

Model Name: document-qa-model
Base Model: microsoft/layoutlmv3-base
Fine-tuned by: Lakshya Singh (solo contributor)
Languages: English, Spanish, Chinese
License: Apache-2.0 (inherited from base model)
Intended Use: Extract answers to structured queries from scanned documents
Not funded — this project was completed independently.

Model Sources

Repository: [https://github.com/Lakshyasinghrawat12/DocumentQA-lakshya-rawat-document-qa-model.git]
Trained on: Adapted version of nielsr/docvqa_1200_examples
Model metrics: See

Uses

Direct Use

This model can be used for:

Question Answering on document images (PDFs, invoices, utility bills)
Information extraction tasks using OCR and layout-aware understanding

Out-of-Scope Use

Not suitable for conversational QA
Not suitable for images with no OCR-processed text

Training Details

Dataset

The dataset consisted of:

Images of utility bills and documents
OCR data with bounding boxes (from PaddleOCR)
Queries in English, Spanish, and Chinese
Answer spans with match scores and positions

Training Procedure

Preprocessing: PaddleOCR was used to extract tokens, positions, and structure
Model: LayoutLMv3-base
Epochs: 4
Learning rate schedule: Shown in image below

Training Metrics

F1 Score (validation):
Loss & Learning Rate Chart:

Evaluation

Metrics Used

F1 score
Match score of predicted spans
Token overlap vs ground truth

Summary

The model performs well on document-style QA tasks, especially with:

Clearly structured OCR results
Document types similar to utility bills, invoices, and forms

How to Use

Available on my Github