File size: 4,719 Bytes
fbf0154 dc51c1a fbf0154 dc51c1a fbf0154 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
title: Docsifer
emoji: π» / π
colorFrom: green
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---
# π Docsifer: Efficient Data Conversion to Markdown
**Docsifer** is a powerful FastAPI + Gradio service for converting various data formats (PDF, PowerPoint, Word, Excel, Images, Audio, HTML, etc.) to Markdown. It leverages the [MarkItDown](https://github.com/microsoft/markitdown) library and can optionally use LLMs (via OpenAI) for richer extraction (OCR, speech-to-text, etc.).
## β¨ Key Features
- **Comprehensive Format Support**:
- **PDF**: Extracts text and structure effectively.
- **PowerPoint**: Converts slides into Markdown-friendly content.
- **Word**: Processes `.docx` files with precision.
- **Excel**: Extracts tabular data as Markdown tables.
- **Images**: Reads **EXIF metadata** and applies **OCR** for text extraction.
- **Audio**: Retrieves **EXIF metadata** and performs **speech transcription**.
- **HTML**: Transforms web pages into Markdown.
- **Text-Based Formats**: Handles CSV, JSON, XML with ease.
- **ZIP Files**: Iterates over contents for batch processing.
- **LLM Integration**: Leverages OpenAI's GPT-4 for enhanced extraction quality and contextual understanding.
- **Efficient and Fast**: Optimized for speed while maintaining high accuracy.
- **Easy Deployment**: Dockerized for hassle-free setup and scalability.
- **Interactive Playground**: Test conversion processes interactively using a **Gradio-powered interface**.
- **Usage Analytics**: Tracks token usage and access statistics via Upstash Redis.
## π Use Cases
- **Knowledge Indexing**: Convert various document formats into Markdown for indexing and search.
- **Text Analysis**: Prepare data for semantic analysis and NLP tasks.
- **Content Transformation**: Simplify content preparation for blogs, documentation, or databases.
- **Metadata Extraction**: Extract meaningful metadata from images and audio for categorization and tagging.
## π οΈ Getting Started
### 1. Clone the Repository
```bash
git clone https://github.com/lh0x00/docsifer.git
cd docsifer
```
### 2. Build and Run with Docker
Make sure Docker is installed and running on your machine.
```bash
docker build -t lightweight-embeddings .
docker run -p 7860:7860 lightweight-embeddings
```
The API will now be accessible at `http://localhost:7860`.
## π API Overview
### Endpoints
- **`/v1/convert`**: Convert a file to Markdown. Supports both file uploads and file path inputs. Accepts optional OpenAI parameters to enable LLM-based enhancements.
- **`/v1/stats`**: Retrieve usage statistics, including access counts and token usage.
### Interactive Docs
- Visit the [Swagger UI](http://localhost:7860/docs) for detailed, interactive documentation.
- Explore additional resources with [ReDoc](http://localhost:7860/redoc).
## π¬ Playground
### Interactive Conversion
- Test file conversion directly in the browser using the **Gradio interface**.
- Simply visit `http://localhost:7860` after starting the server to access the playground.
### Features
- **File Upload**: Upload a file directly or provide a local file path.
- **OpenAI Integration**: Optionally provide OpenAI API details to enhance conversion with LLM capabilities.
- **Conversion Result**: View the resulting Markdown output instantly.
- **Usage Statistics**: Monitor access and token usage through the Gradio interface.
## π Resources
- **Documentation**: [Explore full documentation](https://lamhieu-docsifer.hf.space/docs)
- **Hugging Face Space**: [Try the live demo](https://huggingface.co/spaces/lh0x00/docsifer)
- **GitHub Repository**: [View source code](https://github.com/lh0x00/docsifer)
## π‘ Why Docsifer?
1. **Versatile and Comprehensive**: Handles a wide range of formats, making it a one-stop solution for content conversion.
2. **AI-Powered**: Uses OpenAI's GPT-4 to enhance extraction accuracy and adapt to complex data structures.
3. **User-Friendly**: Offers intuitive APIs and a built-in interactive interface for experimentation.
4. **Scalable and Efficient**: Optimized for performance with Docker support and asynchronous processing.
5. **Transparent Analytics**: Tracks usage metrics to help monitor and manage service consumption.
## π₯ Contributors
- **lamhieu / lh0x00** β Creator and Maintainer ([GitHub](https://github.com/lh0x00), [HuggingFace](https://huggingface.co/lamhieu))
Contributions are welcome! Check out the [contribution guidelines](https://github.com/lh0x00/docsifer/blob/main/CONTRIBUTING.md).
## π License
This project is licensed under the **MIT License**. See the [LICENSE](https://github.com/lh0x00/docsifer/blob/main/LICENSE) file for details.
|