ZeroGPU-LLM-Inference

Running

App Files Files Community

ZeroGPU-LLM-Inference / README.md

Luigi

support reasoning tag

5db22d5 7 months ago

preview code

raw

history blame

1.82 kB

	---
	title: Multi-GGUF LLM Inference
	emoji: 🧠
	colorFrom: pink
	colorTo: purple
	sdk: streamlit
	sdk_version: 1.44.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Run GGUF models with llama.cpp
	---

	This Streamlit app enables chat-based inference on various GGUF models using `llama.cpp` and `llama-cpp-python`.

	### 🔄 Supported Models:
	- `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
	- `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q4_K_M.gguf`
	- `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q4_K_M.gguf`
	- `MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF` → `Meta-Llama-3.1-8B-Instruct.Q2_K.gguf`
	- `unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF` → `DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf`
	- `MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF` → `Mistral-7B-Instruct-v0.3.IQ3_XS.gguf`
	- `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`

	### ⚙️ Features:
	- Model selection in the sidebar
	- Customizable system prompt and generation parameters
	- Chat-style UI with streaming responses
	- Markdown output rendering for readable, styled output
	- DeepSeek-compatible `<think>` tag handling — shows model reasoning in a collapsible expander

	### 🧠 Memory-Safe Design (for HuggingFace Spaces):
	- Loads only one model at a time to prevent memory bloat
	- Utilizes manual unloading and `gc.collect()` to free memory when switching models
	- Adjusts `n_ctx` context length to operate within a 16 GB RAM limit
	- Automatically downloads models as needed
	- Limits history to the last 8 user-assistant turns to prevent context overflow

	Ideal for deploying multiple GGUF chat models on free-tier HuggingFace Spaces!

	Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference