--- title: LLM Structured Output Docker emoji: 🤖 colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false license: mit short_description: Get structured JSON responses from LLM using Docker tags: - llama-cpp - gguf - json-schema - structured-output - llm - docker - gradio - grammar - gbnf --- # 🤖 LLM Structured Output (Docker Version) Dockerized application for getting structured responses from local GGUF language models in specified JSON format. ## ✨ Key Features - **Docker containerized** for easy deployment on HuggingFace Spaces - **Local GGUF model support** via llama-cpp-python - **Optimized for containers** with configurable resources - **JSON schema support** for structured output - **🔗 Grammar-based structured output** (GBNF) for precise JSON generation - **Dual generation modes**: Grammar mode and Schema guidance mode - **Gradio web interface** for convenient interaction - **REST API** for integration with other applications - **Memory efficient** with GGUF quantized models ## 🚀 Deployment on HuggingFace Spaces This version is specifically designed for HuggingFace Spaces with Docker SDK: 1. Clone this repository 2. Push to HuggingFace Spaces with `sdk: docker` in README.md 3. The application will automatically build and deploy ## 🐳 Local Docker Usage ### Build the image: ```bash docker build -t llm-structured-output . ``` ### Run the container: ```bash docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output ``` ### With custom configuration: ```bash docker run -p 7860:7860 \ -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \ -e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \ -e N_CTX="4096" \ -e MAX_NEW_TOKENS="512" \ llm-structured-output ``` ## 🌐 Application Access - **Web interface**: http://localhost:7860 - **API**: Available through the same port - **Health check**: http://localhost:7860/health (when API mode is enabled) ## 📝 Environment Variables Configure the application using environment variables: | Variable | Default | Description | |----------|---------|-------------| | `MODEL_REPO` | `lmstudio-community/gemma-3n-E4B-it-text-GGUF` | HuggingFace model repository | | `MODEL_FILENAME` | `gemma-3n-E4B-it-Q8_0.gguf` | Model file name | | `N_CTX` | `4096` | Context window size | | `N_GPU_LAYERS` | `0` | GPU layers (0 for CPU-only) | | `N_THREADS` | `4` | CPU threads | | `MAX_NEW_TOKENS` | `256` | Maximum response length | | `TEMPERATURE` | `0.1` | Generation temperature | | `HUGGINGFACE_TOKEN` | `` | HF token for private models | ## 📋 Usage Examples ### Example JSON Schema: ```json { "type": "object", "properties": { "summary": {"type": "string"}, "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} }, "required": ["summary", "sentiment"] } ``` ### Example Prompt: ``` Analyze this review: "The product exceeded my expectations! Great quality and fast delivery." ``` ## 🔧 Docker Optimizations This Docker version includes several optimizations: - **Reduced memory usage** with smaller context window and batch sizes - **CPU-optimized** configuration by default - **Efficient layer caching** for faster builds - **Security**: Runs as non-root user - **Multi-stage build** capabilities for production ## 🏗️ Architecture - **Base Image**: Python 3.10 slim - **ML Backend**: llama-cpp-python with OpenBLAS - **Web Interface**: Gradio 4.x - **API**: FastAPI with automatic documentation - **Model Storage**: Downloaded on first run to `/app/models/` ## 💡 Performance Tips 1. **Memory**: Start with smaller models (7B or less) 2. **CPU**: Adjust `N_THREADS` based on available cores 3. **Context**: Reduce `N_CTX` if experiencing memory issues 4. **Batch size**: Lower `N_BATCH` for memory-constrained environments ## 🔗 Grammar Mode (GBNF) This project now supports **Grammar-based Structured Output** using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation: ### ✨ What is Grammar Mode? Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides: - **100% valid JSON** - No parsing errors - **Schema compliance** - Guaranteed structure adherence - **Consistent output** - Reliable format every time - **Better performance** - Fewer retry attempts needed ### 🎛️ Usage **In Gradio Interface:** - Toggle the "🔗 Use Grammar (GBNF) Mode" checkbox - Enabled by default for best results **In API:** ```json { "prompt": "Your prompt here", "json_schema": { your_schema }, "use_grammar": true } ``` **In Python:** ```python result = llm_client.generate_structured_response( prompt="Your prompt", json_schema=schema, use_grammar=True # Enable grammar mode ) ``` ### 🔄 Mode Comparison | Feature | Grammar Mode | Schema Guidance Mode | |---------|-------------|---------------------| | JSON Validity | 100% guaranteed | High, but may need parsing | | Schema Compliance | Strict enforcement | Guidance-based | | Speed | Faster (single pass) | May need retries | | Flexibility | Structured | More creative freedom | | Best for | APIs, data extraction | Creative content with structure | ### 🛠️ Supported Schema Features - ✅ Objects with required/optional properties - ✅ Arrays with typed items - ✅ String enums - ✅ Numbers and integers - ✅ Booleans - ✅ Nested objects and arrays - ⚠️ Complex conditionals (simplified) ## 🔍 Troubleshooting ### Container fails to start: - Check available memory (minimum 4GB recommended) - Verify model repository accessibility - Ensure proper environment variable formatting ### Model download issues: - Check internet connectivity in container - Verify `HUGGINGFACE_TOKEN` for private models - Ensure sufficient disk space ### Performance issues: - Reduce `N_CTX` and `MAX_NEW_TOKENS` - Adjust `N_THREADS` to match CPU cores - Consider using smaller/quantized models ## 📄 License MIT License - see LICENSE file for details. --- For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference