---
title: Medical PDF Ingestion System
emoji: 🏥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
short_description: RAG system for medical PDFs with multimodal embeddings
tags:
  - medical
  - pdf
  - rag
  - embeddings
  - chromadb
  - multimodal
suggested_hardware: cpu-upgrade
---

# Medical PDF Ingestion System

This Gradio Space provides a powerful PDF ingestion and querying interface for building a searchable medical document library with multimodal capabilities.

## Features

- **PDF Upload & Ingestion**: Upload PDF files and extract text and images using unstructured.io
- **Intelligent Chunking**: Automatically chunks documents for optimal retrieval
- **Vector Embeddings**: Uses BAAI/bge-m3 model for high-quality text embeddings
- **Image Processing**: Extracts and embeds images using CLIP models
- **Deduplication**: Prevents duplicate ingestion of the same files using SHA-256 hashing
- **Semantic Search**: Query your document library using natural language
- **Persistent Storage**: ChromaDB database persists between sessions

## Usage

1. **Upload PDFs**: Use the file upload interface to add PDF documents to your library
2. **Ingest Documents**: Click "Ingest PDFs" to process and add them to the vector database
3. **Query Library**: Use natural language queries to search through your ingested documents

## Technical Details

- **Vector Database**: ChromaDB for efficient similarity search
- **Text Embeddings**: BAAI/bge-m3 (768-dimensional, multilingual)
- **Image Embeddings**: CLIP ViT-B/32 (512-dimensional)
- **PDF Processing**: unstructured.io for robust document parsing with OCR
- **UI Framework**: Gradio for interactive web interface
- **Deduplication**: SHA-256 hash-based system

## Requirements

This space requires significant computational resources for embedding generation and may take time to process large documents. Suggested hardware: CPU-upgrade or higher.

---

Built with ❤️ using Hugging Face Transformers, ChromaDB, and Gradio.