Spaces:
Running
Running
| title: Multi-GGUF LLM Inference | |
| emoji: π§ | |
| colorFrom: pink | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: 1.44.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Run GGUF models with llama.cpp | |
| This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`. | |
| ### π Supported Models: | |
| - `Qwen/Qwen2.5-7B-Instruct-GGUF` β `qwen2.5-7b-instruct-q2_k.gguf` | |
| - `unsloth/gemma-3-4b-it-GGUF` β `gemma-3-4b-it-Q4_K_M.gguf` | |
| - `unsloth/Phi-4-mini-instruct-GGUF` β `Phi-4-mini-instruct-Q4_K_M.gguf` | |
| - `MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF` β `Meta-Llama-3.1-8B-Instruct.Q2_K.gguf` | |
| - `unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF` β `DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf` | |
| - `MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF` β `Mistral-7B-Instruct-v0.3.IQ3_XS.gguf` | |
| - `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` β `qwen2.5-coder-7b-instruct-q2_k.gguf` | |
| ### βοΈ Features: | |
| - Model selection in the sidebar | |
| - Customizable system prompt and generation parameters | |
| - Chat-style UI with streaming responses | |
| - **Markdown output rendering** for readable, styled output | |
| - **DeepSeek-compatible `<think>` tag handling** β shows model reasoning in a collapsible expander | |
| ### π§ Memory-Safe Design (for HuggingFace Spaces): | |
| - Loads only **one model at a time** to prevent memory bloat | |
| - Utilizes **manual unloading and `gc.collect()`** to free memory when switching models | |
| - Adjusts `n_ctx` context length to operate within a 16 GB RAM limit | |
| - Automatically downloads models as needed | |
| - Limits history to the **last 8 user-assistant turns** to prevent context overflow | |
| Ideal for deploying multiple GGUF chat models on **free-tier HuggingFace Spaces**! | |
| Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference | |