pytorch
/

gemma-3-12b-it-AWQ-INT4

@@ -11,7 +11,7 @@ language:
 # AWQ-INT4 google/gemma-3-12b-it model
-- **Developed by:** jerryzh168
 - **License:** apache-2.0
 - **Quantized from Model :** google/gemma-3-12b-it
 - **Quantization Method :** AWQ-INT4
@@ -33,14 +33,14 @@ pip install torchao
 Then we can serve with the following command:
 ```Shell
 # Server
-export MODEL=jerryzh168/gemma-3-12b-it-AWQ-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
 # Client
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-  "model": "jerryzh168/gemma-3-12b-it-AWQ-INT4",
   "messages": [
     {"role": "user", "content": "Give me a short introduction to large language models."}
   ],
@@ -69,7 +69,7 @@ Example:
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "jerryzh168/gemma-3-12b-it-AWQ-INT4"
 # load the tokenizer and the model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -240,7 +240,7 @@ lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it --tasks mmlu --
 ## AWQ-INT4
 ```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-AWQ-INT4
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
@@ -268,8 +268,8 @@ We can use the following code to get a sense of peak memory usage during inferen
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-# use "google/gemma-3-12b-it" or "jerryzh168/gemma-3-12b-it-AWQ-INT4"
-model_id = "jerryzh168/gemma-3-12b-it-AWQ-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -349,7 +349,7 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
 ### AWQ-INT4
 ```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-AWQ-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>

 # AWQ-INT4 google/gemma-3-12b-it model
+- **Developed by:** pytorch
 - **License:** apache-2.0
 - **Quantized from Model :** google/gemma-3-12b-it
 - **Quantization Method :** AWQ-INT4
 Then we can serve with the following command:
 ```Shell
 # Server
+export MODEL=pytorch/gemma-3-12b-it-AWQ-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
 # Client
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "pytorch/gemma-3-12b-it-AWQ-INT4",
   "messages": [
     {"role": "user", "content": "Give me a short introduction to large language models."}
   ],
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "pytorch/gemma-3-12b-it-AWQ-INT4"
 # load the tokenizer and the model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 ## AWQ-INT4
 ```Shell
+export MODEL=pytorch/gemma-3-12b-it-AWQ-INT4
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "google/gemma-3-12b-it" or "pytorch/gemma-3-12b-it-AWQ-INT4"
+model_id = "pytorch/gemma-3-12b-it-AWQ-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 ### AWQ-INT4
 ```Shell
+export MODEL=pytorch/gemma-3-12b-it-AWQ-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>