Lin-K76 commited on
Commit
02ba8ec
·
verified ·
1 Parent(s): 97ae4de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -26
README.md CHANGED
@@ -33,7 +33,7 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
- It achieves an average score of 73.81 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 74.17.
37
 
38
  ### Model Optimizations
39
 
@@ -117,11 +117,11 @@ model_stub = "meta-llama/Meta-Llama-3.1-8B-Instruct"
117
  model_name = model_stub.split("/")[-1]
118
 
119
  device_map = calculate_offload_device_map(
120
- model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
121
  )
122
 
123
  model = SparseAutoModelForCausalLM.from_pretrained(
124
- model_stub, torch_dtype=torch.float16, device_map=device_map
125
  )
126
 
127
  output_dir = f"./{model_name}-FP8-dynamic"
@@ -139,7 +139,7 @@ oneshot(
139
 
140
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
141
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
142
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
143
 
144
  ### Accuracy
145
 
@@ -158,71 +158,81 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
158
  <tr>
159
  <td>MMLU (5-shot)
160
  </td>
161
- <td>67.94
162
  </td>
163
- <td>68.09
164
  </td>
165
- <td>100.2%
 
 
 
 
 
 
 
 
 
 
166
  </td>
167
  </tr>
168
  <tr>
169
  <td>ARC Challenge (0-shot)
170
  </td>
171
- <td>83.11
172
  </td>
173
- <td>82.34
174
  </td>
175
- <td>99.07%
176
  </td>
177
  </tr>
178
  <tr>
179
- <td>GSM-8K (CoT, 8-shot, strict-match)
180
  </td>
181
- <td>82.03
182
  </td>
183
- <td>82.34
184
  </td>
185
- <td>100.3%
186
  </td>
187
  </tr>
188
  <tr>
189
  <td>Hellaswag (10-shot)
190
  </td>
191
- <td>80.01
192
  </td>
193
- <td>79.68
194
  </td>
195
- <td>99.59%
196
  </td>
197
  </tr>
198
  <tr>
199
  <td>Winogrande (5-shot)
200
  </td>
201
- <td>77.90
202
  </td>
203
- <td>77.03
204
  </td>
205
- <td>98.88%
206
  </td>
207
  </tr>
208
  <tr>
209
  <td>TruthfulQA (0-shot, mc2)
210
  </td>
211
- <td>54.04
212
  </td>
213
- <td>53.37
214
  </td>
215
- <td>98.76%
216
  </td>
217
  </tr>
218
  <tr>
219
  <td><strong>Average</strong>
220
  </td>
221
- <td><strong>74.17</strong>
222
  </td>
223
- <td><strong>73.81</strong>
224
  </td>
225
- <td><strong>99.48%</strong>
226
  </td>
227
  </tr>
228
  </table>
@@ -241,6 +251,17 @@ lm_eval \
241
  --batch_size auto
242
  ```
243
 
 
 
 
 
 
 
 
 
 
 
 
244
  #### ARC-Challenge
245
  ```
246
  lm_eval \
 
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
+ It achieves an average score of 73.56 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.79.
37
 
38
  ### Model Optimizations
39
 
 
117
  model_name = model_stub.split("/")[-1]
118
 
119
  device_map = calculate_offload_device_map(
120
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
121
  )
122
 
123
  model = SparseAutoModelForCausalLM.from_pretrained(
124
+ model_stub, torch_dtype="auto", device_map=device_map
125
  )
126
 
127
  output_dir = f"./{model_name}-FP8-dynamic"
 
139
 
140
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
141
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
142
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
143
 
144
  ### Accuracy
145
 
 
158
  <tr>
159
  <td>MMLU (5-shot)
160
  </td>
161
+ <td>67.95
162
  </td>
163
+ <td>68.02
164
  </td>
165
+ <td>100.1%
166
+ </td>
167
+ </tr>
168
+ <tr>
169
+ <td>MMLU-cot (0-shot)
170
+ </td>
171
+ <td>71.24
172
+ </td>
173
+ <td>71.64
174
+ </td>
175
+ <td>100.5%
176
  </td>
177
  </tr>
178
  <tr>
179
  <td>ARC Challenge (0-shot)
180
  </td>
181
+ <td>82.00
182
  </td>
183
+ <td>81.23
184
  </td>
185
+ <td>99.06%
186
  </td>
187
  </tr>
188
  <tr>
189
+ <td>GSM-8K-cot (8-shot, strict-match)
190
  </td>
191
+ <td>81.96
192
  </td>
193
+ <td>82.03
194
  </td>
195
+ <td>100.0%
196
  </td>
197
  </tr>
198
  <tr>
199
  <td>Hellaswag (10-shot)
200
  </td>
201
+ <td>80.46
202
  </td>
203
+ <td>80.04
204
  </td>
205
+ <td>99.48%
206
  </td>
207
  </tr>
208
  <tr>
209
  <td>Winogrande (5-shot)
210
  </td>
211
+ <td>78.45
212
  </td>
213
+ <td>77.66
214
  </td>
215
+ <td>98.99%
216
  </td>
217
  </tr>
218
  <tr>
219
  <td>TruthfulQA (0-shot, mc2)
220
  </td>
221
+ <td>54.5
222
  </td>
223
+ <td>54.28
224
  </td>
225
+ <td>99.60%
226
  </td>
227
  </tr>
228
  <tr>
229
  <td><strong>Average</strong>
230
  </td>
231
+ <td><strong>73.79</strong>
232
  </td>
233
+ <td><strong>73.56</strong>
234
  </td>
235
+ <td><strong>99.70%</strong>
236
  </td>
237
  </tr>
238
  </table>
 
251
  --batch_size auto
252
  ```
253
 
254
+ #### MMLU-cot
255
+ ```
256
+ lm_eval \
257
+ --model vllm \
258
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
259
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
260
+ --apply_chat_template \
261
+ --num_fewshot 0 \
262
+ --batch_size auto
263
+ ```
264
+
265
  #### ARC-Challenge
266
  ```
267
  lm_eval \