🧾 DOCINTEL — Document AI (Donut-based)

DOCINTEL extracts structured insights from scanned PDFs and images using naver-clova-ix/donut-base (Donut). It supports OCR fallback, entity extraction, and document summarization via Donut on page images.

⚠️ Install system dependencies: poppler and tesseract for pdf2image and pytesseract respectively.

Quickstart

Create venv & install dependencies:

python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt

Run API server:

uvicorn app:app --host 0.0.0.0 --port 8000

Upload a PDF and call endpoints (see examples/demo_commands.txt).

Files

ocr_extractor.py — PDF→images→OCR pipeline
pdf_loader.py — extract embedded text from PDFs
entity_tagger.py — regex-based entity extraction
summarize_doc.py — DONUT-based summarizer for page images
app.py — FastAPI server with upload/summary endpoints

Notes

Donut requires vision-encoder-decoder inference which may need GPU for speed.
For text-only PDFs consider using extract_text_from_pdf then a text summarizer instead of Donut.
This repo is a prototype/demo. Validate on your data before production use.

Downloads last month: 3