jerryzh168 commited on
Commit
fb96386
·
verified ·
1 Parent(s): f483bc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -9
README.md CHANGED
@@ -150,7 +150,8 @@ model = AutoModelForCausalLM.from_pretrained(
150
  )
151
  tokenizer = AutoTokenizer.from_pretrained(model_id)
152
 
153
- base_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
 
154
  quant_config = AWQConfig(base_config, step="prepare")
155
  quantize_(
156
  model,
@@ -216,10 +217,13 @@ and use a token with write access, from https://huggingface.co/settings/tokens
216
  # Model Quality
217
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
218
 
219
- | Benchmark | | | |
220
- |----------------------------------|------------------------|-----------------------------|---------------------------------|
221
- | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
222
- | philosophy | 79.10 | 75.56 | 76.85 |
 
 
 
223
 
224
 
225
  <details>
@@ -247,11 +251,12 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
247
 
248
  ## Results
249
 
250
- | Benchmark | | | |
251
- |----------------------------------|------------------------|-----------------------------|---------------------------------|
252
- | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
253
- | Peak Memory (GB) | 24.50 | 8.57 (65% reduction) | 12.71 (48% reduction) |
254
 
 
255
 
256
  <details>
257
  <summary> Reproduce Peak Memory Usage Results </summary>
 
150
  )
151
  tokenizer = AutoTokenizer.from_pretrained(model_id)
152
 
153
+ # AWQ only works for H100 INT4 so far
154
+ base_config = Int4WeightOnlyConfig(group_size=128)
155
  quant_config = AWQConfig(base_config, step="prepare")
156
  quantize_(
157
  model,
 
217
  # Model Quality
218
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
219
 
220
+ | Benchmark | | | |
221
+ |----------------------------------|------------------------|--------------------------------|---------------------------------|
222
+ | | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
223
+ | philosophy | 79.10 | 75.56 | 76.85 |
224
+
225
+
226
+ Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
227
 
228
 
229
  <details>
 
251
 
252
  ## Results
253
 
254
+ | Benchmark | | | |
255
+ |----------------------------------|------------------------|--------------------------------|---------------------------------|
256
+ | | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
257
+ | Peak Memory (GB) | 24.50 | 8.57 (65% reduction) | 12.71 (48% reduction) |
258
 
259
+ Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
260
 
261
  <details>
262
  <summary> Reproduce Peak Memory Usage Results </summary>