🧾 DOCINTEL — Document AI (Donut-based)
DOCINTEL extracts structured insights from scanned PDFs and images using naver-clova-ix/donut-base (Donut). It supports OCR fallback, entity extraction, and document summarization via Donut on page images.
⚠️ Install system dependencies:
popplerandtesseractfor pdf2image and pytesseract respectively.
Quickstart
- Create venv & install dependencies:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
- Run API server:
uvicorn app:app --host 0.0.0.0 --port 8000
- Upload a PDF and call endpoints (see examples/demo_commands.txt).
Files
ocr_extractor.py— PDF→images→OCR pipelinepdf_loader.py— extract embedded text from PDFsentity_tagger.py— regex-based entity extractionsummarize_doc.py— DONUT-based summarizer for page imagesapp.py— FastAPI server with upload/summary endpoints
Notes
- Donut requires vision-encoder-decoder inference which may need GPU for speed.
- For text-only PDFs consider using
extract_text_from_pdfthen a text summarizer instead of Donut. - This repo is a prototype/demo. Validate on your data before production use.
- Downloads last month
- 3