metadata

title: Medical PDF Ingestion System
emoji: 🏥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: RAG system for medical PDFs with multimodal embeddings
tags:
  - medical
  - pdf
  - rag
  - embeddings
  - chromadb
  - multimodal
suggested_hardware: cpu-upgrade

Medical PDF Ingestion System

This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities.

Features

PDF Upload & Ingestion: Upload PDF files and extract text and images using unstructured.io
Intelligent Chunking: Automatically chunks documents for optimal retrieval
Vector Embeddings: Uses BAAI/bge-m3 model for high-quality text embeddings
Image Processing: Extracts and embeds images using CLIP models
Deduplication: Prevents duplicate ingestion of the same files using SHA-256 hashing
Semantic Search: Query your document library using natural language
Persistent Storage: ChromaDB database persists between sessions

Usage

Upload PDFs: Use the file upload interface to add PDF documents to your library
Ingest Documents: Click "Ingest PDFs" to process and add them to the vector database
Query Library: Use natural language queries to search through your ingested documents

Technical Details

Vector Database: ChromaDB for efficient similarity search
Text Embeddings: BAAI/bge-m3 (768-dimensional, multilingual)
Image Embeddings: CLIP ViT-B/32 (512-dimensional)
PDF Processing: unstructured.io for robust document parsing with OCR
UI Framework: Gradio for interactive web interface
Deduplication: SHA-256 hash-based system

Requirements

This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher.

Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio.