metadata

title: LLM Structured Output Docker
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Get structured JSON responses from LLM using Docker
tags:
  - llama-cpp
  - gguf
  - json-schema
  - structured-output
  - llm
  - docker
  - gradio
  - grammar
  - gbnf

🤖 LLM Structured Output (Docker Version)

Dockerized application for getting structured responses from local GGUF language models in specified JSON format.

✨ Key Features

Docker containerized for easy deployment on HuggingFace Spaces
Local GGUF model support via llama-cpp-python
Optimized for containers with configurable resources
JSON schema support for structured output
🔗 Grammar-based structured output (GBNF) for precise JSON generation
Dual generation modes: Grammar mode and Schema guidance mode
Gradio web interface for convenient interaction
REST API for integration with other applications
Memory efficient with GGUF quantized models

🚀 Deployment on HuggingFace Spaces

This version is specifically designed for HuggingFace Spaces with Docker SDK:

Clone this repository
Push to HuggingFace Spaces with sdk: docker in README.md
The application will automatically build and deploy

🐳 Local Docker Usage

Build the image:

docker build -t llm-structured-output .

Run the container:

docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output

With custom configuration:

docker run -p 7860:7860 \
  -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \
  -e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \
  -e N_CTX="4096" \
  -e MAX_NEW_TOKENS="512" \
  llm-structured-output

🌐 Application Access

Web interface: http://localhost:7860
API: Available through the same port
Health check: http://localhost:7860/health (when API mode is enabled)

📝 Environment Variables

Configure the application using environment variables:

Variable	Default	Description
`MODEL_REPO`	`lmstudio-community/gemma-3n-E4B-it-text-GGUF`	HuggingFace model repository
`MODEL_FILENAME`	`gemma-3n-E4B-it-Q8_0.gguf`	Model file name
`N_CTX`	`4096`	Context window size
`N_GPU_LAYERS`	`0`	GPU layers (0 for CPU-only)
`N_THREADS`	`4`	CPU threads
`MAX_NEW_TOKENS`	`256`	Maximum response length
`TEMPERATURE`	`0.1`	Generation temperature
`HUGGINGFACE_TOKEN`	``	HF token for private models

📋 Usage Examples

Example JSON Schema:

{
  "type": "object",
  "properties": {
    "summary": {"type": "string"},
    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "required": ["summary", "sentiment"]
}

Example Prompt:

Analyze this review: "The product exceeded my expectations! Great quality and fast delivery."

🔧 Docker Optimizations

This Docker version includes several optimizations:

Reduced memory usage with smaller context window and batch sizes
CPU-optimized configuration by default
Efficient layer caching for faster builds
Security: Runs as non-root user
Multi-stage build capabilities for production

🏗️ Architecture

Base Image: Python 3.10 slim
ML Backend: llama-cpp-python with OpenBLAS
Web Interface: Gradio 4.x
API: FastAPI with automatic documentation
Model Storage: Downloaded on first run to /app/models/

💡 Performance Tips

Memory: Start with smaller models (7B or less)
CPU: Adjust N_THREADS based on available cores
Context: Reduce N_CTX if experiencing memory issues
Batch size: Lower N_BATCH for memory-constrained environments

🔗 Grammar Mode (GBNF)

This project now supports Grammar-based Structured Output using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation:

✨ What is Grammar Mode?

Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides:

100% valid JSON - No parsing errors
Schema compliance - Guaranteed structure adherence
Consistent output - Reliable format every time
Better performance - Fewer retry attempts needed

🎛️ Usage

In Gradio Interface:

Toggle the "🔗 Use Grammar (GBNF) Mode" checkbox
Enabled by default for best results

In API:

{
  "prompt": "Your prompt here",
  "json_schema": { your_schema },
  "use_grammar": true
}

In Python:

result = llm_client.generate_structured_response(
    prompt="Your prompt",
    json_schema=schema,
    use_grammar=True  # Enable grammar mode
)

🔄 Mode Comparison

Feature	Grammar Mode	Schema Guidance Mode
JSON Validity	100% guaranteed	High, but may need parsing
Schema Compliance	Strict enforcement	Guidance-based
Speed	Faster (single pass)	May need retries
Flexibility	Structured	More creative freedom
Best for	APIs, data extraction	Creative content with structure

🛠️ Supported Schema Features

✅ Objects with required/optional properties
✅ Arrays with typed items
✅ String enums
✅ Numbers and integers
✅ Booleans
✅ Nested objects and arrays
⚠️ Complex conditionals (simplified)

🔍 Troubleshooting

Container fails to start:

Check available memory (minimum 4GB recommended)
Verify model repository accessibility
Ensure proper environment variable formatting

Model download issues:

Check internet connectivity in container
Verify HUGGINGFACE_TOKEN for private models
Ensure sufficient disk space

Performance issues:

Reduce N_CTX and MAX_NEW_TOKENS
Adjust N_THREADS to match CPU cores
Consider using smaller/quantized models

📄 License

MIT License - see LICENSE file for details.

For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference