gemma-2-9b-it-fix-system-role-awq
Modified and 4 bit quantized version of gemma-2-9b-it and update chat_template
for support system
role to handle cases:
Conversation roles must alternate user/assistant/user/assistant/...
System role not supported
Model Overview
- Model Architecture: Gemma 2
- Input: Text
- Output: Text
- Release Date: 04/12/2024
- Version: 1.0
- Quantiz method: 4 bit AutoAWQ
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
With CLI:
vllm serve --model dangvansam/gemma-2-9b-it-fix-system-role-awq
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "dangvansam/gemma-2-9b-it-fix-system-role-awq",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"}
]
}'
With Python:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "dangvansam/gemma-2-9b-it-fix-system-role-awq"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are helpfull assistant."},
{"role": "user", "content": "Who are you?"}
]
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_id)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
With docker:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model dangvansam/gemma-2-9b-it-fix-system-role-awq
With docker-compose:
services:
vllm:
image: vllm/vllm-openai:latest
restart: always
shm_size: '48gb'
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
entrypoint: python3
command: -m vllm.entrypoints.openai.api_server --port=8000 --host=0.0.0.0 --model dangvansam/gemma-2-9b-it-fix-system-role-awq --gpu-memory-utilization 1.0 --max-model-len 4096
healthcheck:
test: [ "CMD", "curl", "-f", "http://0.0.0.0:8000/v1/models" ]
interval: 30s
timeout: 10s
retries: 10
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ports:
- 8000:8000
- Downloads last month
- 30
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for dangvansam/gemma-2-9b-it-fix-system-role-awq
Base model
google/gemma-2-9b
Finetuned
google/gemma-2-9b-it
Finetuned
dangvansam/gemma-2-9b-it-fix-system-role