--- title: Medical PDF Ingestion System emoji: 🏥 colorFrom: blue colorTo: green sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false short_description: RAG system for medical PDFs with multimodal embeddings tags: - medical - pdf - rag - embeddings - chromadb - multimodal suggested_hardware: cpu-upgrade --- # Medical PDF Ingestion System This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities. ## Features - **PDF Upload & Ingestion**: Upload PDF files and extract text and images using unstructured.io - **Intelligent Chunking**: Automatically chunks documents for optimal retrieval - **Vector Embeddings**: Uses BAAI/bge-m3 model for high-quality text embeddings - **Image Processing**: Extracts and embeds images using CLIP models - **Deduplication**: Prevents duplicate ingestion of the same files using SHA-256 hashing - **Semantic Search**: Query your document library using natural language - **Persistent Storage**: ChromaDB database persists between sessions ## Usage 1. **Upload PDFs**: Use the file upload interface to add PDF documents to your library 2. **Ingest Documents**: Click "Ingest PDFs" to process and add them to the vector database 3. **Query Library**: Use natural language queries to search through your ingested documents ## Technical Details - **Vector Database**: ChromaDB for efficient similarity search - **Text Embeddings**: BAAI/bge-m3 (768-dimensional, multilingual) - **Image Embeddings**: CLIP ViT-B/32 (512-dimensional) - **PDF Processing**: unstructured.io for robust document parsing with OCR - **UI Framework**: Gradio for interactive web interface - **Deduplication**: SHA-256 hash-based system ## Requirements This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher. --- Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio.