pytorch
/

gemma-3-12b-it-AWQ-INT4

@@ -150,7 +150,8 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-base_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
 quant_config = AWQConfig(base_config, step="prepare")
 quantize_(
     model,
@@ -216,10 +217,13 @@ and use a token with write access, from https://huggingface.co/settings/tokens
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
-| Benchmark                        |                        |                             |                                 |
-|----------------------------------|------------------------|-----------------------------|---------------------------------|
-|                                  | google/gemma-3-12b-it  | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
-| philosophy                       | 79.10                  |     75.56                   | 76.85                            |
 <details>
@@ -247,11 +251,12 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 ## Results
-| Benchmark                        |                        |                             |                                 |
-|----------------------------------|------------------------|-----------------------------|---------------------------------|
-|                                  | google/gemma-3-12b-it  | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
-| Peak Memory (GB)                 | 24.50	                | 8.57 (65% reduction)        | 12.71 (48% reduction)           |
 <details>
 <summary> Reproduce Peak Memory Usage Results </summary>

 )
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+# AWQ only works for H100 INT4 so far
+base_config = Int4WeightOnlyConfig(group_size=128)
 quant_config = AWQConfig(base_config, step="prepare")
 quantize_(
     model,
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
+| Benchmark                        |                        |                                |                                 |
+|----------------------------------|------------------------|--------------------------------|---------------------------------|
+|                                  | google/gemma-3-12b-it  | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
+| philosophy                       | 79.10                  |     75.56                      | 76.85                            |
+Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
 <details>
 ## Results
+| Benchmark                        |                        |                                |                                 |
+|----------------------------------|------------------------|--------------------------------|---------------------------------|
+|                                  | google/gemma-3-12b-it  | jerryzh168/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
+| Peak Memory (GB)                 | 24.50	                | 8.57 (65% reduction)           | 12.71 (48% reduction)           |
+Note: jerryzh168/gemma-3-12b-it-INT4 is the H100 optimized checkpoint for INT4
 <details>
 <summary> Reproduce Peak Memory Usage Results </summary>