Update README.md
Browse files
README.md
CHANGED
|
@@ -150,7 +150,8 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 150 |
)
|
| 151 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 152 |
|
| 153 |
-
|
|
|
|
| 154 |
quant_config = AWQConfig(base_config, step="prepare")
|
| 155 |
quantize_(
|
| 156 |
model,
|
|
@@ -216,10 +217,13 @@ and use a token with write access, from https://huggingface.co/settings/tokens
|
|
| 216 |
# Model Quality
|
| 217 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
|
| 218 |
|
| 219 |
-
| Benchmark | |
|
| 220 |
-
|
| 221 |
-
| | google/gemma-3-12b-it |
|
| 222 |
-
| philosophy | 79.10 | 75.56
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
|
| 225 |
<details>
|
|
@@ -247,11 +251,12 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 247 |
|
| 248 |
## Results
|
| 249 |
|
| 250 |
-
| Benchmark | |
|
| 251 |
-
|
| 252 |
-
| | google/gemma-3-12b-it |
|
| 253 |
-
| Peak Memory (GB) | 24.50 | 8.57 (65% reduction)
|
| 254 |
|
|
|
|
| 255 |
|
| 256 |
<details>
|
| 257 |
<summary> Reproduce Peak Memory Usage Results </summary>
|
|
|
|
| 150 |
)
|
| 151 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 152 |
|
| 153 |
+
# AWQ only works for H100 INT4 so far
|
| 154 |
+
base_config = Int4WeightOnlyConfig(group_size=128)
|
| 155 |
quant_config = AWQConfig(base_config, step="prepare")
|
| 156 |
quantize_(
|
| 157 |
model,
|
|
|
|
| 217 |
# Model Quality
|
| 218 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
|
| 219 |
|
| 220 |
+
| Benchmark | | | |
|
| 221 |
+
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 222 |
+
| | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 223 |
+
| philosophy | 79.10 | 75.56 | 76.85 |
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
|
| 227 |
|
| 228 |
|
| 229 |
<details>
|
|
|
|
| 251 |
|
| 252 |
## Results
|
| 253 |
|
| 254 |
+
| Benchmark | | | |
|
| 255 |
+
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 256 |
+
| | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 257 |
+
| Peak Memory (GB) | 24.50 | 8.57 (65% reduction) | 12.71 (48% reduction) |
|
| 258 |
|
| 259 |
+
Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
|
| 260 |
|
| 261 |
<details>
|
| 262 |
<summary> Reproduce Peak Memory Usage Results </summary>
|