lyangas's picture
move model downloading to dockerfile
f2adbf5
metadata
title: LLM Structured Output Docker
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Get structured JSON responses from LLM using Docker
tags:
  - llama-cpp
  - gguf
  - json-schema
  - structured-output
  - llm
  - docker
  - gradio
  - grammar
  - gbnf

πŸ€– LLM Structured Output (Docker Version)

Dockerized application for getting structured responses from local GGUF language models in specified JSON format.

✨ Key Features

  • Docker containerized for easy deployment on HuggingFace Spaces
  • Local GGUF model support via llama-cpp-python
  • Optimized for containers with configurable resources
  • JSON schema support for structured output
  • πŸ”— Grammar-based structured output (GBNF) for precise JSON generation
  • Dual generation modes: Grammar mode and Schema guidance mode
  • Gradio web interface for convenient interaction
  • REST API for integration with other applications
  • Memory efficient with GGUF quantized models

πŸš€ Deployment on HuggingFace Spaces

This version is specifically designed for HuggingFace Spaces with Docker SDK:

  1. Clone this repository
  2. Push to HuggingFace Spaces with sdk: docker in README.md
  3. The application will automatically build and deploy

🐳 Local Docker Usage

Build the image:

docker build -t llm-structured-output .

Run the container:

docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output

With custom configuration:

docker run -p 7860:7860 \
  -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \
  -e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \
  -e N_CTX="4096" \
  -e MAX_NEW_TOKENS="512" \
  llm-structured-output

🌐 Application Access

πŸ“ Environment Variables

Configure the application using environment variables:

Variable Default Description
MODEL_REPO lmstudio-community/gemma-3n-E4B-it-text-GGUF HuggingFace model repository
MODEL_FILENAME gemma-3n-E4B-it-Q8_0.gguf Model file name
N_CTX 4096 Context window size
N_GPU_LAYERS 0 GPU layers (0 for CPU-only)
N_THREADS 4 CPU threads
MAX_NEW_TOKENS 256 Maximum response length
TEMPERATURE 0.1 Generation temperature
HUGGINGFACE_TOKEN `` HF token for private models

πŸ“‹ Usage Examples

Example JSON Schema:

{
  "type": "object",
  "properties": {
    "summary": {"type": "string"},
    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "required": ["summary", "sentiment"]
}

Example Prompt:

Analyze this review: "The product exceeded my expectations! Great quality and fast delivery."

πŸ”§ Docker Optimizations

This Docker version includes several optimizations:

  • Reduced memory usage with smaller context window and batch sizes
  • CPU-optimized configuration by default
  • Efficient layer caching for faster builds
  • Security: Runs as non-root user
  • Multi-stage build capabilities for production

πŸ—οΈ Architecture

  • Base Image: Python 3.10 slim
  • ML Backend: llama-cpp-python with OpenBLAS
  • Web Interface: Gradio 4.x
  • API: FastAPI with automatic documentation
  • Model Storage: Downloaded on first run to /app/models/

πŸ’‘ Performance Tips

  1. Memory: Start with smaller models (7B or less)
  2. CPU: Adjust N_THREADS based on available cores
  3. Context: Reduce N_CTX if experiencing memory issues
  4. Batch size: Lower N_BATCH for memory-constrained environments

πŸ”— Grammar Mode (GBNF)

This project now supports Grammar-based Structured Output using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation:

✨ What is Grammar Mode?

Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides:

  • 100% valid JSON - No parsing errors
  • Schema compliance - Guaranteed structure adherence
  • Consistent output - Reliable format every time
  • Better performance - Fewer retry attempts needed

πŸŽ›οΈ Usage

In Gradio Interface:

  • Toggle the "πŸ”— Use Grammar (GBNF) Mode" checkbox
  • Enabled by default for best results

In API:

{
  "prompt": "Your prompt here",
  "json_schema": { your_schema },
  "use_grammar": true
}

In Python:

result = llm_client.generate_structured_response(
    prompt="Your prompt",
    json_schema=schema,
    use_grammar=True  # Enable grammar mode
)

πŸ”„ Mode Comparison

Feature Grammar Mode Schema Guidance Mode
JSON Validity 100% guaranteed High, but may need parsing
Schema Compliance Strict enforcement Guidance-based
Speed Faster (single pass) May need retries
Flexibility Structured More creative freedom
Best for APIs, data extraction Creative content with structure

πŸ› οΈ Supported Schema Features

  • βœ… Objects with required/optional properties
  • βœ… Arrays with typed items
  • βœ… String enums
  • βœ… Numbers and integers
  • βœ… Booleans
  • βœ… Nested objects and arrays
  • ⚠️ Complex conditionals (simplified)

πŸ” Troubleshooting

Container fails to start:

  • Check available memory (minimum 4GB recommended)
  • Verify model repository accessibility
  • Ensure proper environment variable formatting

Model download issues:

  • Check internet connectivity in container
  • Verify HUGGINGFACE_TOKEN for private models
  • Ensure sufficient disk space

Performance issues:

  • Reduce N_CTX and MAX_NEW_TOKENS
  • Adjust N_THREADS to match CPU cores
  • Consider using smaller/quantized models

πŸ“„ License

MIT License - see LICENSE file for details.


For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference