drwlf
Shorten description to meet 60 character limit
2ec577c
---
title: Medical PDF Ingestion System
emoji: 🏥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
short_description: RAG system for medical PDFs with multimodal embeddings
tags:
- medical
- pdf
- rag
- embeddings
- chromadb
- multimodal
suggested_hardware: cpu-upgrade
---
# Medical PDF Ingestion System
This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities.
## Features
- **PDF Upload & Ingestion**: Upload PDF files and extract text and images using unstructured.io
- **Intelligent Chunking**: Automatically chunks documents for optimal retrieval
- **Vector Embeddings**: Uses BAAI/bge-m3 model for high-quality text embeddings
- **Image Processing**: Extracts and embeds images using CLIP models
- **Deduplication**: Prevents duplicate ingestion of the same files using SHA-256 hashing
- **Semantic Search**: Query your document library using natural language
- **Persistent Storage**: ChromaDB database persists between sessions
## Usage
1. **Upload PDFs**: Use the file upload interface to add PDF documents to your library
2. **Ingest Documents**: Click "Ingest PDFs" to process and add them to the vector database
3. **Query Library**: Use natural language queries to search through your ingested documents
## Technical Details
- **Vector Database**: ChromaDB for efficient similarity search
- **Text Embeddings**: BAAI/bge-m3 (768-dimensional, multilingual)
- **Image Embeddings**: CLIP ViT-B/32 (512-dimensional)
- **PDF Processing**: unstructured.io for robust document parsing with OCR
- **UI Framework**: Gradio for interactive web interface
- **Deduplication**: SHA-256 hash-based system
## Requirements
This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher.
---
Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio.