shubhrapandit commited on
Commit
76efd07
·
verified ·
1 Parent(s): 1206da1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -1
README.md CHANGED
@@ -140,18 +140,135 @@ oneshot(
140
 
141
  ## Evaluation
142
 
143
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
144
 
145
  <details>
146
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
 
 
 
147
 
 
 
 
 
 
148
  ```
 
 
 
 
149
  ```
 
 
 
 
 
 
 
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  </details>
152
 
153
  ### Accuracy
154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ## Inference Performance
156
 
157
 
 
140
 
141
  ## Evaluation
142
 
143
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
144
 
145
  <details>
146
  <summary>Evaluation Commands</summary>
147
+
148
+ ### Vision Tasks
149
+ - vqav2
150
+ - docvqa
151
+ - mathvista
152
+ - mmmu
153
+ - chartqa
154
+
155
+ ```
156
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
157
 
158
+ python -m eval.run eval_vllm \
159
+ --model_name neuralmagic/pixtral-12b-quantized.w8a8 \
160
+ --url http://0.0.0.0:8000 \
161
+ --output_dir ~/tmp
162
+ --eval_name <vision_task_name>
163
  ```
164
+
165
+ ### Text-based Tasks
166
+ #### MMLU
167
+
168
  ```
169
+ lm_eval \
170
+ --model vllm \
171
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
172
+ --tasks mmlu \
173
+ --num_fewshot 5
174
+ --batch_size auto \
175
+ --output_path output_dir \
176
 
177
+ ```
178
+
179
+ #### HumanEval
180
+
181
+ ##### Generation
182
+ ```
183
+ python3 codegen/generate.py \
184
+ --model neuralmagic/pixtral-12b-quantized.w8a8 \
185
+ --bs 16 \
186
+ --temperature 0.2 \
187
+ --n_samples 50 \
188
+ --root "." \
189
+ --dataset humaneval
190
+ ```
191
+ ##### Sanitization
192
+ ```
193
+ python3 evalplus/sanitize.py \
194
+ humaneval/neuralmagic/pixtral-12b-quantized.w8a8_vllm_temp_0.2
195
+ ```
196
+ ##### Evaluation
197
+ ```
198
+ evalplus.evaluate \
199
+ --dataset humaneval \
200
+ --samples humaneval/neuralmagic/pixtral-12b-quantized.w8a8_vllm_temp_0.2-sanitized
201
+ ```
202
  </details>
203
 
204
  ### Accuracy
205
 
206
+ <table border="1">
207
+ <thead>
208
+ <tr>
209
+ <th>Category</th>
210
+ <th>Metric</th>
211
+ <th>mgoin/pixtral-12b</th>
212
+ <th>neuralmagic/pixtral-12b-quantized.w8a8</th>
213
+ <th>Recovery (%)</th>
214
+ </tr>
215
+ </thead>
216
+ <tbody>
217
+ <tr>
218
+ <td rowspan="6"><b>Vision</b></td>
219
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
220
+ <td>48.00</td>
221
+ <td>46.22</td>
222
+ <td>96.29%</td>
223
+ </tr>
224
+ <tr>
225
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
226
+ <td>78.71</td>
227
+ <td>78.00</td>
228
+ <td>99.10%</td>
229
+ </tr>
230
+ <tr>
231
+ <td>DocVQA (val)<br><i>anls</i></td>
232
+ <td>89.47</td>
233
+ <td>89.35</td>
234
+ <td>99.87%</td>
235
+ </tr>
236
+ <tr>
237
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
238
+ <td>81.68</td>
239
+ <td>81.60</td>
240
+ <td>99.90%</td>
241
+ </tr>
242
+ <tr>
243
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
244
+ <td>56.50</td>
245
+ <td>57.30</td>
246
+ <td>101.42%</td>
247
+ </tr>
248
+ <tr>
249
+ <td><b>Average Score</b></td>
250
+ <td><b>70.07</b></td>
251
+ <td><b>70.09</b></td>
252
+ <td><b>100.03%</b></td>
253
+ </tr>
254
+ <tr>
255
+ <td rowspan="2"><b>Text</b></td>
256
+ <td>HumanEval <br><i>pass@1</i></td>
257
+ <td>68.40</td>
258
+ <td>66.39</td>
259
+ <td>97.06%</td>
260
+ </tr>
261
+ <tr>
262
+ <td>MMLU (5-shot)</td>
263
+ <td>71.40</td>
264
+ <td>70.50</td>
265
+ <td>98.74%</td>
266
+ </tr>
267
+ </tbody>
268
+ </table>
269
+
270
+
271
+
272
  ## Inference Performance
273
 
274