AI Bookkeeper: Enhancing Accounting Document Understanding Through Supervised Fine-Tuning
Abstract
We present Ark series, a group of large vision language model (LVLM) fine-tuned specifically for accounting and financial document understanding. Through extensive experimentation, we demonstrate significant improvements in document understanding, data extraction, and document intelligence strictly for bookkeeping tasks. Our model achieves state-of-the-art performance across multiple bookkeeping-specific benchmarks while maintaining high accuracy. This is the first step in the right direction in building out an AI Bookkeeper as we start to delegate the manual data entry and ledger coding tasks before fully transitioning to other end-to-end operational and administrative tasks like document chase down and retrieval, completing bulk editing and bulk publishing function to reduce friction and manual intervention.
1. Introduction
Automating bookkeeping processes requires a robust system capable of not just understanding and processing financial documents with high accuracy, but being able to reduce the need for manual intervention in the bookkeeping workflow. This publication focuses on Ark’s training on manual data entry and ledger coding tasks. While existing LVLMs show promise in general document understanding, the specialised nature of accounting documents presents unique challenges requiring domain-specific optimisation.
2. Dataset
2.1 Data Collection
Our dataset comprises two main sources:
- Historical records from expert accountant annotations
- New annotations from in-house and outsourced accounting experts
Dataset includes:
- Invoice processing samples
- Receipt analysis samples
- personalised ledger data
2.2 Annotation Process
Expert annotations were collected through:
- AI annotation (40%)
- In-house accounting and bookkeeping experts (40%)
- Outsourced accounting expert (20%)
2.3 Prompting Strategy
We employ Chain of Thought (CoT) and Tree of Thought (ToT) prompting methods to guide structured extraction, categorisation, and decision-making processes:
- Chain of Thought (CoT): Used for tasks requiring sequential reasoning, including VAT calculations, arithmetic checks, and accounting document type classification.
- Tree of Thought (ToT): Used for ambiguous or hierarchical tasks such as line item extraction from nested tables, resolving supplier discrepancies, and multi-page single document context.
[Figure 1: Document Processing Flow (CoT + ToT)]
3. Methodology
[Figure 2: Training Pipeline Diagram]
3.1 Low-Rank Adaptation (LoRA)
Configuration:
- Backbone (Vision Encoder): R-16
- LLM: R-16
Parameter efficiency:
trainable params: 6,291,456 || trainable%: 2.0275153367147256
3.2 Supervised Fine-tuning (SFT)
Training parameters:
Model Name | Param Size | Learning Rate | Batch Size | Gradient Accumulation Step | Epoch | Warmup Ratio | Weight Decay |
---|---|---|---|---|---|---|---|
Ark I | 8B | 2e-5 | 1 | 4 | 5 | 0.03 | 0.05 |
Ark I | 8B | 5e-5 | 1 | 2 | 8 | 0.05 | 0.03 |
Ark I | 8B | 4.5e-5 | 1 | 2 | 6 | 0.04 | 0.04 |
Ark II | 26B | 1e-6 | 1 | 2 | 6 | 0.04 | 0.04 |
Ark II | 26B | 5e-5 | 1 | 2 | 6 | 0.05 | 0.03 |
[Figure 3: Loss Convergence Graph]
4. Results
[Figure 4: Model Performance by Category]
4.1 Model Performance Analysis
The Ark series demonstrates progressive improvements across key metrics as seen in our leaderboard:
- Ark I (8B): Established baseline performance with 64.1% accuracy in accounting document classification
- Ark II (26B): Achieved 71.8% accuracy with enhanced comprehension of complex accounting document structures
4.2 Comparative Analysis
- Document Understanding: 15% improvement over GPT-4o in accounting document comprehension
- Processing Speed: 2.5x faster document processing compared to human benchmarks
5. Discussion
5.1 Technical Advancements
The results demonstrate significant improvements in:
Accounting Document Processing:
- Enhanced transaction classification accuracy
- Adaptive document type recognition
Workflow Integration:
- Streamlined processing pipeline
- Multi-document context understanding
5.2 Current Limitations
- Complex multi-page multi-document handling
- Cross-reference validation
6. Conclusion
The Ark series demonstrates the effectiveness of SFT in specialised document understanding tasks. With Ark III on the horizon, future work will focus on:
- Advanced RL Integration
- Enhanced Workflow Automation
This technical report establishes a foundation for next-generation autonomous bookkeeping systems, with planned developments aimed at continuously exceeding current performance benchmarks through sophisticated RL techniques and workflow automation.