docsifer / README.md
lamhieu's picture
chore: update something
ec3e706
|
raw
history blame
4.72 kB
---
title: Docsifer
emoji: πŸ‘» / πŸ“š
colorFrom: green
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---
# πŸ“„ Docsifer: Efficient Data Conversion to Markdown
**Docsifer** is a powerful FastAPI + Gradio service for converting various data formats (PDF, PowerPoint, Word, Excel, Images, Audio, HTML, etc.) to Markdown. It leverages the [MarkItDown](https://github.com/microsoft/markitdown) library and can optionally use LLMs (via OpenAI) for richer extraction (OCR, speech-to-text, etc.).
## ✨ Key Features
- **Comprehensive Format Support**:
- **PDF**: Extracts text and structure effectively.
- **PowerPoint**: Converts slides into Markdown-friendly content.
- **Word**: Processes `.docx` files with precision.
- **Excel**: Extracts tabular data as Markdown tables.
- **Images**: Reads **EXIF metadata** and applies **OCR** for text extraction.
- **Audio**: Retrieves **EXIF metadata** and performs **speech transcription**.
- **HTML**: Transforms web pages into Markdown.
- **Text-Based Formats**: Handles CSV, JSON, XML with ease.
- **ZIP Files**: Iterates over contents for batch processing.
- **LLM Integration**: Leverages OpenAI's GPT-4 for enhanced extraction quality and contextual understanding.
- **Efficient and Fast**: Optimized for speed while maintaining high accuracy.
- **Easy Deployment**: Dockerized for hassle-free setup and scalability.
- **Interactive Playground**: Test conversion processes interactively using a **Gradio-powered interface**.
- **Usage Analytics**: Tracks token usage and access statistics via Upstash Redis.
## πŸš€ Use Cases
- **Knowledge Indexing**: Convert various document formats into Markdown for indexing and search.
- **Text Analysis**: Prepare data for semantic analysis and NLP tasks.
- **Content Transformation**: Simplify content preparation for blogs, documentation, or databases.
- **Metadata Extraction**: Extract meaningful metadata from images and audio for categorization and tagging.
## πŸ› οΈ Getting Started
### 1. Clone the Repository
```bash
git clone https://github.com/lh0x00/docsifer.git
cd docsifer
```
### 2. Build and Run with Docker
Make sure Docker is installed and running on your machine.
```bash
docker build -t lightweight-embeddings .
docker run -p 7860:7860 lightweight-embeddings
```
The API will now be accessible at `http://localhost:7860`.
## πŸ“– API Overview
### Endpoints
- **`/v1/convert`**: Convert a file to Markdown. Supports both file uploads and file path inputs. Accepts optional OpenAI parameters to enable LLM-based enhancements.
- **`/v1/stats`**: Retrieve usage statistics, including access counts and token usage.
### Interactive Docs
- Visit the [Swagger UI](http://localhost:7860/docs) for detailed, interactive documentation.
- Explore additional resources with [ReDoc](http://localhost:7860/redoc).
## πŸ”¬ Playground
### Interactive Conversion
- Test file conversion directly in the browser using the **Gradio interface**.
- Simply visit `http://localhost:7860` after starting the server to access the playground.
### Features
- **File Upload**: Upload a file directly or provide a local file path.
- **OpenAI Integration**: Optionally provide OpenAI API details to enhance conversion with LLM capabilities.
- **Conversion Result**: View the resulting Markdown output instantly.
- **Usage Statistics**: Monitor access and token usage through the Gradio interface.
## 🌐 Resources
- **Documentation**: [Explore full documentation](https://lamhieu-docsifer.hf.space/docs)
- **Hugging Face Space**: [Try the live demo](https://huggingface.co/spaces/lh0x00/docsifer)
- **GitHub Repository**: [View source code](https://github.com/lh0x00/docsifer)
## πŸ’‘ Why Docsifer?
1. **Versatile and Comprehensive**: Handles a wide range of formats, making it a one-stop solution for content conversion.
2. **AI-Powered**: Uses OpenAI's GPT-4 to enhance extraction accuracy and adapt to complex data structures.
3. **User-Friendly**: Offers intuitive APIs and a built-in interactive interface for experimentation.
4. **Scalable and Efficient**: Optimized for performance with Docker support and asynchronous processing.
5. **Transparent Analytics**: Tracks usage metrics to help monitor and manage service consumption.
## πŸ‘₯ Contributors
- **lamhieu / lh0x00** – Creator and Maintainer ([GitHub](https://github.com/lh0x00), [HuggingFace](https://huggingface.co/lamhieu))
Contributions are welcome! Check out the [contribution guidelines](https://github.com/lh0x00/docsifer/blob/main/CONTRIBUTING.md).
## πŸ“œ License
This project is licensed under the **MIT License**. See the [LICENSE](https://github.com/lh0x00/docsifer/blob/main/LICENSE) file for details.