Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.44.1
metadata
title: Medical PDF Ingestion System
emoji: 🏥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: RAG system for medical PDFs with multimodal embeddings
tags:
- medical
- pdf
- rag
- embeddings
- chromadb
- multimodal
suggested_hardware: cpu-upgrade
Medical PDF Ingestion System
This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities.
Features
- PDF Upload & Ingestion: Upload PDF files and extract text and images using unstructured.io
- Intelligent Chunking: Automatically chunks documents for optimal retrieval
- Vector Embeddings: Uses BAAI/bge-m3 model for high-quality text embeddings
- Image Processing: Extracts and embeds images using CLIP models
- Deduplication: Prevents duplicate ingestion of the same files using SHA-256 hashing
- Semantic Search: Query your document library using natural language
- Persistent Storage: ChromaDB database persists between sessions
Usage
- Upload PDFs: Use the file upload interface to add PDF documents to your library
- Ingest Documents: Click "Ingest PDFs" to process and add them to the vector database
- Query Library: Use natural language queries to search through your ingested documents
Technical Details
- Vector Database: ChromaDB for efficient similarity search
- Text Embeddings: BAAI/bge-m3 (768-dimensional, multilingual)
- Image Embeddings: CLIP ViT-B/32 (512-dimensional)
- PDF Processing: unstructured.io for robust document parsing with OCR
- UI Framework: Gradio for interactive web interface
- Deduplication: SHA-256 hash-based system
Requirements
This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher.
Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio.