alexmarques commited on
Commit
c639d39
·
verified ·
1 Parent(s): 7514275

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -49
README.md CHANGED
@@ -32,7 +32,7 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
- It achieves an average score of 78.54 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67.
36
 
37
  ### Model Optimizations
38
 
@@ -131,14 +131,11 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w4a16")
131
 
132
  ## Evaluation
133
 
134
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
135
- ```
136
- lm_eval \
137
- --model vllm \
138
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
139
- --tasks openllm \
140
- --batch_size auto
141
- ```
142
 
143
  ### Accuracy
144
 
@@ -148,96 +145,170 @@ lm_eval \
148
  <td><strong>Benchmark</strong>
149
  </td>
150
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
151
- </td>
152
- <td><strong>hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4</strong>
153
  </td>
154
  <td><strong>Meta-Llama-3.1-70B-Instruct-quantized.w4a16 (this model)</strong>
155
  </td>
156
- <td><strong>Recovery (this model) </strong>
157
  </td>
158
  </tr>
159
  <tr>
160
  <td>MMLU (5-shot)
161
  </td>
162
- <td>82.21
163
- </td>
164
- <td>81.42
165
  </td>
166
- <td>81.84
167
  </td>
168
- <td>99.55%
169
  </td>
170
  </tr>
171
  <tr>
172
- <td>ARC Challenge (25-shot)
173
  </td>
174
- <td>70.65
175
- </td>
176
- <td>70.13
177
  </td>
178
- <td>70.05
179
  </td>
180
- <td>99.15%
181
  </td>
182
  </tr>
183
  <tr>
184
- <td>GSM-8K (5-shot, strict-match)
 
 
 
 
185
  </td>
186
- <td>87.95
187
  </td>
188
- <td>90.59
 
 
189
  </td>
190
- <td>89.84
191
  </td>
192
- <td>102.15%
 
 
193
  </td>
194
  </tr>
195
  <tr>
196
  <td>Hellaswag (10-shot)
197
- </td>
198
- <td>86.33
199
  </td>
200
- <td>86.23
201
  </td>
202
- <td>86.24
203
  </td>
204
- <td>99.90%
205
  </td>
206
  </tr>
207
  <tr>
208
  <td>Winogrande (5-shot)
209
  </td>
210
- <td>85.00
211
  </td>
212
- <td>84.53
213
  </td>
214
- <td>84.53
215
- </td>
216
- <td>99.45%
217
  </td>
218
  </tr>
219
  <tr>
220
- <td>TruthfulQA (0-shot)
221
- </td>
222
- <td>59.90
223
  </td>
224
- <td>59.62
225
  </td>
226
  <td>58.74
227
  </td>
228
- <td>98.06%
229
  </td>
230
  </tr>
231
  <tr>
232
  <td><strong>Average</strong>
233
  </td>
234
- <td><strong>78.67</strong>
235
  </td>
236
- <td><strong>78.75</strong>
237
  </td>
238
- <td><strong>78.54</strong>
239
- </td>
240
- <td><strong>99.83%</strong>
241
  </td>
242
  </tr>
243
- </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  - **Model Developers:** Neural Magic
33
 
34
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
+ It achieves scores within 1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag and Winogrande, and within 3.2% for TruthfulQA.
36
 
37
  ### Model Optimizations
38
 
 
131
 
132
  ## Evaluation
133
 
134
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
135
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
136
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
137
+
138
+ **Note:** Results have been updated after Meta modified the chat template.
 
 
 
139
 
140
  ### Accuracy
141
 
 
145
  <td><strong>Benchmark</strong>
146
  </td>
147
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
 
 
148
  </td>
149
  <td><strong>Meta-Llama-3.1-70B-Instruct-quantized.w4a16 (this model)</strong>
150
  </td>
151
+ <td><strong>Recovery</strong>
152
  </td>
153
  </tr>
154
  <tr>
155
  <td>MMLU (5-shot)
156
  </td>
157
+ <td>83.94
 
 
158
  </td>
159
+ <td>83.55
160
  </td>
161
+ <td>99.5%
162
  </td>
163
  </tr>
164
  <tr>
165
+ <td>MMLU (CoT, 0-shot)
166
  </td>
167
+ <td>86.23
 
 
168
  </td>
169
+ <td>85.57
170
  </td>
171
+ <td>99.2%
172
  </td>
173
  </tr>
174
  <tr>
175
+ <td>ARC Challenge (0-shot)
176
+ </td>
177
+ <td>93.34
178
+ </td>
179
+ <td>92.83
180
  </td>
181
+ <td>99.5%
182
  </td>
183
+ </tr>
184
+ <tr>
185
+ <td>GSM-8K (CoT, 8-shot, strict-match)
186
  </td>
187
+ <td>95.38
188
  </td>
189
+ <td>94.39
190
+ </td>
191
+ <td>99.0%
192
  </td>
193
  </tr>
194
  <tr>
195
  <td>Hellaswag (10-shot)
 
 
196
  </td>
197
+ <td>86.66
198
  </td>
199
+ <td>86.06
200
  </td>
201
+ <td>99.3%
202
  </td>
203
  </tr>
204
  <tr>
205
  <td>Winogrande (5-shot)
206
  </td>
207
+ <td>85.32
208
  </td>
209
+ <td>85.16
210
  </td>
211
+ <td>99.8%
 
 
212
  </td>
213
  </tr>
214
  <tr>
215
+ <td>TruthfulQA (0-shot, mc2)
 
 
216
  </td>
217
+ <td>60.65
218
  </td>
219
  <td>58.74
220
  </td>
221
+ <td>96.8%
222
  </td>
223
  </tr>
224
  <tr>
225
  <td><strong>Average</strong>
226
  </td>
227
+ <td><strong>84.50</strong>
228
  </td>
229
+ <td><strong>83.76</strong>
230
  </td>
231
+ <td><strong>99.1%</strong>
 
 
232
  </td>
233
  </tr>
234
+ </table>
235
+
236
+ ### Reproduction
237
+
238
+ The results were obtained using the following commands:
239
+
240
+ #### MMLU
241
+ ```
242
+ lm_eval \
243
+ --model vllm \
244
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
245
+ --tasks mmlu_llama_3.1_instruct \
246
+ --fewshot_as_multiturn \
247
+ --apply_chat_template \
248
+ --num_fewshot 5 \
249
+ --batch_size auto
250
+ ```
251
+
252
+ #### MMLU-CoT
253
+ ```
254
+ lm_eval \
255
+ --model vllm \
256
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
257
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
258
+ --apply_chat_template \
259
+ --num_fewshot 0 \
260
+ --batch_size auto
261
+ ```
262
+
263
+ #### ARC-Challenge
264
+ ```
265
+ lm_eval \
266
+ --model vllm \
267
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
268
+ --tasks arc_challenge_llama_3.1_instruct \
269
+ --apply_chat_template \
270
+ --num_fewshot 0 \
271
+ --batch_size auto
272
+ ```
273
+
274
+ #### GSM-8K
275
+ ```
276
+ lm_eval \
277
+ --model vllm \
278
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
279
+ --tasks gsm8k_cot_llama_3.1_instruct \
280
+ --fewshot_as_multiturn \
281
+ --apply_chat_template \
282
+ --num_fewshot 8 \
283
+ --batch_size auto
284
+ ```
285
+
286
+ #### Hellaswag
287
+ ```
288
+ lm_eval \
289
+ --model vllm \
290
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
291
+ --tasks hellaswag \
292
+ --num_fewshot 10 \
293
+ --batch_size auto
294
+ ```
295
+
296
+ #### Winogrande
297
+ ```
298
+ lm_eval \
299
+ --model vllm \
300
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
301
+ --tasks winogrande \
302
+ --num_fewshot 5 \
303
+ --batch_size auto
304
+ ```
305
+
306
+ #### TruthfulQA
307
+ ```
308
+ lm_eval \
309
+ --model vllm \
310
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
311
+ --tasks truthfulqa \
312
+ --num_fewshot 0 \
313
+ --batch_size auto
314
+ ```