nm-research commited on
Commit
9b4266a
·
verified ·
1 Parent(s): 7a7c635

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -30
README.md CHANGED
@@ -178,7 +178,7 @@ OpenLLM Leaderboard V2:
178
  ```
179
  lm_eval \
180
  --model vllm \
181
- --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
182
  --tasks leaderboard \
183
  --write_out \
184
  --batch_size auto \
@@ -212,35 +212,116 @@ evalplus.evaluate \
212
 
213
  ### Accuracy
214
 
215
- #### OpenLLM Leaderboard V1 evaluation scores
216
-
217
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
218
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
219
- | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 55.12 |
220
- | GSM8K (Strict-Match, 5-shot) | 60.96 | 60.58 |
221
- | HellaSwag (Acc-Norm, 10-shot) | 75.21 | 74.60 |
222
- | MMLU (Acc, 5-shot) | 54.38 | 54.12 |
223
- | TruthfulQA (MC2, 0-shot) | 55.93 | 54.87 |
224
- | Winogrande (Acc, 5-shot) | 69.67 | 70.80 |
225
- | **Average Score** | **61.98** | **61.68** |
226
- | **Recovery** | **100.00** | **99.51** |
227
-
228
- #### OpenLLM Leaderboard V2 evaluation scores
229
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
230
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
231
- | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.03 |
232
- | BBH (Acc-Norm, 3-shot) | 44.11 | 43.53 |
233
- | Math-Hard (Exact-Match, 4-shot) | 8.66 | 8.04 |
234
- | GPQA (Acc-Norm, 0-shot) | 28.30 | 27.60 |
235
- | MUSR (Acc-Norm, 0-shot) | 35.12 | 34.58 |
236
- | MMLU-Pro (Acc, 5-shot) | 26.87 | 26.89 |
237
- | **Average Score** | **35.17** | **34.61** |
238
- | **Recovery** | **100.00** | **98.40** |
239
-
240
- #### HumanEval pass@1 scores
241
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
242
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
243
- | HumanEval Pass@1 | 53.40 | 54.9 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244
 
245
 
246
  ## Inference Performance
 
178
  ```
179
  lm_eval \
180
  --model vllm \
181
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
182
  --tasks leaderboard \
183
  --write_out \
184
  --batch_size auto \
 
212
 
213
  ### Accuracy
214
 
215
+ <table>
216
+ <thead>
217
+ <tr>
218
+ <th>Category</th>
219
+ <th>Metric</th>
220
+ <th>ibm-granite/granite-3.1-2b-instruct</th>
221
+ <th>neuralmagic/granite-3.1-2b-instruct-quantized.w8a8</th>
222
+ <th>Recovery (%)</th>
223
+ </tr>
224
+ </thead>
225
+ <tbody>
226
+ <!-- OpenLLM Leaderboard V1 -->
227
+ <tr>
228
+ <td rowspan="7"><b>OpenLLM V1</b></td>
229
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
230
+ <td>55.63</td>
231
+ <td>55.12</td>
232
+ <td>99.08</td>
233
+ </tr>
234
+ <tr>
235
+ <td>GSM8K (Strict-Match, 5-shot)</td>
236
+ <td>60.96</td>
237
+ <td>60.58</td>
238
+ <td>99.38</td>
239
+ </tr>
240
+ <tr>
241
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
242
+ <td>75.21</td>
243
+ <td>74.60</td>
244
+ <td>99.19</td>
245
+ </tr>
246
+ <tr>
247
+ <td>MMLU (Acc, 5-shot)</td>
248
+ <td>54.38</td>
249
+ <td>54.12</td>
250
+ <td>99.52</td>
251
+ </tr>
252
+ <tr>
253
+ <td>TruthfulQA (MC2, 0-shot)</td>
254
+ <td>55.93</td>
255
+ <td>54.87</td>
256
+ <td>98.10</td>
257
+ </tr>
258
+ <tr>
259
+ <td>Winogrande (Acc, 5-shot)</td>
260
+ <td>69.67</td>
261
+ <td>70.80</td>
262
+ <td>101.62</td>
263
+ </tr>
264
+ <tr>
265
+ <td><b>Average Score</b></td>
266
+ <td><b>61.98</b></td>
267
+ <td><b>61.68</b></td>
268
+ <td><b>99.51</b></td>
269
+ </tr>
270
+ <!-- OpenLLM Leaderboard V2 -->
271
+ <tr>
272
+ <td rowspan="7"><b>OpenLLM V2</b></td>
273
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
274
+ <td>67.99</td>
275
+ <td>67.03</td>
276
+ <td>98.59</td>
277
+ </tr>
278
+ <tr>
279
+ <td>BBH (Acc-Norm, 3-shot)</td>
280
+ <td>44.11</td>
281
+ <td>43.53</td>
282
+ <td>98.69</td>
283
+ </tr>
284
+ <tr>
285
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
286
+ <td>8.66</td>
287
+ <td>8.04</td>
288
+ <td>92.89</td>
289
+ </tr>
290
+ <tr>
291
+ <td>GPQA (Acc-Norm, 0-shot)</td>
292
+ <td>28.30</td>
293
+ <td>27.60</td>
294
+ <td>97.52</td>
295
+ </tr>
296
+ <tr>
297
+ <td>MUSR (Acc-Norm, 0-shot)</td>
298
+ <td>35.12</td>
299
+ <td>34.58</td>
300
+ <td>98.46</td>
301
+ </tr>
302
+ <tr>
303
+ <td>MMLU-Pro (Acc, 5-shot)</td>
304
+ <td>26.87</td>
305
+ <td>26.89</td>
306
+ <td>100.07</td>
307
+ </tr>
308
+ <tr>
309
+ <td><b>Average Score</b></td>
310
+ <td><b>35.17</b></td>
311
+ <td><b>34.61</b></td>
312
+ <td><b>98.40</b></td>
313
+ </tr>
314
+ <!-- HumanEval -->
315
+ <tr>
316
+ <td rowspan="2"><b>HumanEval</b></td>
317
+ <td>HumanEval Pass@1</td>
318
+ <td>53.40</td>
319
+ <td>54.90</td>
320
+ <td><b>102.81</b></td>
321
+ </tr>
322
+ </tbody>
323
+ </table>
324
+
325
 
326
 
327
  ## Inference Performance