---
title: LLM Structured Output Docker
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Get structured JSON responses from LLM using Docker
tags:
- llama-cpp
- gguf
- json-schema
- structured-output
- llm
- docker
- gradio
- grammar
- gbnf
---

# 🤖 LLM Structured Output (Docker Version)

Dockerized application for getting structured responses from local GGUF language models in specified JSON format.

## ✨ Key Features

- **Docker containerized** for easy deployment on HuggingFace Spaces
- **Local GGUF model support** via llama-cpp-python
- **Optimized for containers** with configurable resources
- **JSON schema support** for structured output
- **🔗 Grammar-based structured output** (GBNF) for precise JSON generation
- **Dual generation modes**: Grammar mode and Schema guidance mode
- **Gradio web interface** for convenient interaction
- **REST API** for integration with other applications
- **Memory efficient** with GGUF quantized models

## 🚀 Deployment on HuggingFace Spaces

This version is specifically designed for HuggingFace Spaces with Docker SDK:

1. Clone this repository
2. Push to HuggingFace Spaces with `sdk: docker` in README.md
3. The application will automatically build and deploy

## 🐳 Local Docker Usage

### Build the image:
```bash
docker build -t llm-structured-output .
```

### Run the container:
```bash
docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output
```

### With custom configuration:
```bash
docker run -p 7860:7860 \
  -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \
  -e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \
  -e N_CTX="4096" \
  -e MAX_NEW_TOKENS="512" \
  llm-structured-output
```

## 🌐 Application Access

- **Web interface**: http://localhost:7860
- **API**: Available through the same port
- **Health check**: http://localhost:7860/health (when API mode is enabled)

## 📝 Environment Variables

Configure the application using environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_REPO` | `lmstudio-community/gemma-3n-E4B-it-text-GGUF` | HuggingFace model repository |
| `MODEL_FILENAME` | `gemma-3n-E4B-it-Q8_0.gguf` | Model file name |
| `N_CTX` | `4096` | Context window size |
| `N_GPU_LAYERS` | `0` | GPU layers (0 for CPU-only) |
| `N_THREADS` | `4` | CPU threads |
| `MAX_NEW_TOKENS` | `256` | Maximum response length |
| `TEMPERATURE` | `0.1` | Generation temperature |
| `HUGGINGFACE_TOKEN` | `` | HF token for private models |

## 📋 Usage Examples

### Example JSON Schema:
```json
{
  "type": "object",
  "properties": {
    "summary": {"type": "string"},
    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "required": ["summary", "sentiment"]
}
```

### Example Prompt:
```
Analyze this review: "The product exceeded my expectations! Great quality and fast delivery."
```

## 🔧 Docker Optimizations

This Docker version includes several optimizations:

- **Reduced memory usage** with smaller context window and batch sizes
- **CPU-optimized** configuration by default
- **Efficient layer caching** for faster builds
- **Security**: Runs as non-root user
- **Multi-stage build** capabilities for production

## 🏗️ Architecture

- **Base Image**: Python 3.10 slim
- **ML Backend**: llama-cpp-python with OpenBLAS
- **Web Interface**: Gradio 4.x
- **API**: FastAPI with automatic documentation
- **Model Storage**: Downloaded on first run to `/app/models/`

## 💡 Performance Tips

1. **Memory**: Start with smaller models (7B or less)
2. **CPU**: Adjust `N_THREADS` based on available cores
3. **Context**: Reduce `N_CTX` if experiencing memory issues
4. **Batch size**: Lower `N_BATCH` for memory-constrained environments

## 🔗 Grammar Mode (GBNF)

This project now supports **Grammar-based Structured Output** using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation:

### ✨ What is Grammar Mode?

Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides:

- **100% valid JSON** - No parsing errors
- **Schema compliance** - Guaranteed structure adherence  
- **Consistent output** - Reliable format every time
- **Better performance** - Fewer retry attempts needed

### 🎛️ Usage

**In Gradio Interface:**
- Toggle the "🔗 Use Grammar (GBNF) Mode" checkbox
- Enabled by default for best results

**In API:**
```json
{
  "prompt": "Your prompt here",
  "json_schema": { your_schema },
  "use_grammar": true
}
```

**In Python:**
```python
result = llm_client.generate_structured_response(
    prompt="Your prompt",
    json_schema=schema,
    use_grammar=True  # Enable grammar mode
)
```

### 🔄 Mode Comparison

| Feature | Grammar Mode | Schema Guidance Mode |
|---------|-------------|---------------------|
| JSON Validity | 100% guaranteed | High, but may need parsing |
| Schema Compliance | Strict enforcement | Guidance-based |
| Speed | Faster (single pass) | May need retries |
| Flexibility | Structured | More creative freedom |
| Best for | APIs, data extraction | Creative content with structure |

### 🛠️ Supported Schema Features

- ✅ Objects with required/optional properties
- ✅ Arrays with typed items
- ✅ String enums 
- ✅ Numbers and integers
- ✅ Booleans
- ✅ Nested objects and arrays
- ⚠️ Complex conditionals (simplified)

## 🔍 Troubleshooting

### Container fails to start:
- Check available memory (minimum 4GB recommended)
- Verify model repository accessibility
- Ensure proper environment variable formatting

### Model download issues:
- Check internet connectivity in container
- Verify `HUGGINGFACE_TOKEN` for private models
- Ensure sufficient disk space

### Performance issues:
- Reduce `N_CTX` and `MAX_NEW_TOKENS`
- Adjust `N_THREADS` to match CPU cores
- Consider using smaller/quantized models

## 📄 License

MIT License - see LICENSE file for details.

---

For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference