Spaces:
Sleeping
Sleeping
| title: Medical PDF Ingestion System | |
| emoji: 🏥 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: "4.44.0" | |
| app_file: app.py | |
| pinned: false | |
| short_description: RAG system for medical PDFs with multimodal embeddings | |
| tags: | |
| - medical | |
| - rag | |
| - embeddings | |
| - chromadb | |
| - multimodal | |
| suggested_hardware: cpu-upgrade | |
| # Medical PDF Ingestion System | |
| This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities. | |
| ## Features | |
| - **PDF Upload & Ingestion**: Upload PDF files and extract text and images using unstructured.io | |
| - **Intelligent Chunking**: Automatically chunks documents for optimal retrieval | |
| - **Vector Embeddings**: Uses BAAI/bge-m3 model for high-quality text embeddings | |
| - **Image Processing**: Extracts and embeds images using CLIP models | |
| - **Deduplication**: Prevents duplicate ingestion of the same files using SHA-256 hashing | |
| - **Semantic Search**: Query your document library using natural language | |
| - **Persistent Storage**: ChromaDB database persists between sessions | |
| ## Usage | |
| 1. **Upload PDFs**: Use the file upload interface to add PDF documents to your library | |
| 2. **Ingest Documents**: Click "Ingest PDFs" to process and add them to the vector database | |
| 3. **Query Library**: Use natural language queries to search through your ingested documents | |
| ## Technical Details | |
| - **Vector Database**: ChromaDB for efficient similarity search | |
| - **Text Embeddings**: BAAI/bge-m3 (768-dimensional, multilingual) | |
| - **Image Embeddings**: CLIP ViT-B/32 (512-dimensional) | |
| - **PDF Processing**: unstructured.io for robust document parsing with OCR | |
| - **UI Framework**: Gradio for interactive web interface | |
| - **Deduplication**: SHA-256 hash-based system | |
| ## Requirements | |
| This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher. | |
| --- | |
| Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio. | |