mihaimasala commited on
Commit
785a5ad
·
verified ·
1 Parent(s): 66aced8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +566 -8
README.md CHANGED
@@ -4,6 +4,480 @@ language:
4
  - ro
5
  base_model:
6
  - OpenLLM-Ro/RoLlama2-7b-Base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  ---
9
 
@@ -73,16 +547,100 @@ outputs = model.generate(input_ids=inputs, max_new_tokens=128)
73
  print(tokenizer.decode(outputs[0]))
74
  ```
75
 
76
- ## Benchmarks
77
-
78
- | Model | Average | ARC | MMLU |Winogrande|HellaSwag | GSM8k |TruthfulQA|
79
- |--------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
80
- | Llama-2-7b-chat | 36.84 | 37.03 | 33.81 | 55.87 | 45.36 | 4.90 | 44.09 |
81
- |*RoLlama2-7b-Instruct*|***45.71***|***43.66***|***39.70***|***70.34*** | *57.36* |***18.78***| *44.44* |
82
- |RoLlama2-7b-Chat | 43.82 | 41.92 | 37.29 | 66.68 | **57.91**| 13.47 | **45.65**|
83
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## Romanian MT-Bench
88
 
 
4
  - ro
5
  base_model:
6
  - OpenLLM-Ro/RoLlama2-7b-Base
7
+ model-index:
8
+ - name: OpenLLM-Ro/RoLlama2-7b-Instruct
9
+ results:
10
+ - task:
11
+ type: text-generation
12
+ dataset:
13
+ name: RoMT-Bench
14
+ type: RoMT-Bench
15
+ metrics:
16
+ - name: Score
17
+ type: Score
18
+ value: 3.86
19
+ - task:
20
+ type: text-generation
21
+ dataset:
22
+ name: RoCulturaBench
23
+ type: RoCulturaBench
24
+ metrics:
25
+ - name: Score
26
+ type: Score
27
+ value: 3.77
28
+ - task:
29
+ type: text-generation
30
+ dataset:
31
+ name: Romanian_Academic_Benchmarks
32
+ type: Romanian_Academic_Benchmarks
33
+ metrics:
34
+ - name: Average accuracy
35
+ type: accuracy
36
+ value: 45.71
37
+ - task:
38
+ type: text-generation
39
+ dataset:
40
+ name: OpenLLM-Ro/ro_arc_challenge
41
+ type: OpenLLM-Ro/ro_arc_challenge
42
+ metrics:
43
+ - name: Average accuracy
44
+ type: accuracy
45
+ value: 43.66
46
+ - task:
47
+ type: text-generation
48
+ dataset:
49
+ name: OpenLLM-Ro/ro_mmlu
50
+ type: OpenLLM-Ro/ro_mmlu
51
+ metrics:
52
+ - name: Average accuracy
53
+ type: accuracy
54
+ value: 39.70
55
+ - task:
56
+ type: text-generation
57
+ dataset:
58
+ name: OpenLLM-Ro/ro_winogrande
59
+ type: OpenLLM-Ro/ro_winogrande
60
+ metrics:
61
+ - name: Average accuracy
62
+ type: accuracy
63
+ value: 70.34
64
+ - task:
65
+ type: text-generation
66
+ dataset:
67
+ name: OpenLLM-Ro/ro_hellaswag
68
+ type: OpenLLM-Ro/ro_hellaswag
69
+ metrics:
70
+ - name: Average accuracy
71
+ type: accuracy
72
+ value: 57.36
73
+ - task:
74
+ type: text-generation
75
+ dataset:
76
+ name: OpenLLM-Ro/ro_gsm8k
77
+ type: OpenLLM-Ro/ro_gsm8k
78
+ metrics:
79
+ - name: Average accuracy
80
+ type: accuracy
81
+ value: 18.78
82
+ - task:
83
+ type: text-generation
84
+ dataset:
85
+ name: OpenLLM-Ro/ro_truthfulqa
86
+ type: OpenLLM-Ro/ro_truthfulqa
87
+ metrics:
88
+ - name: Average accuracy
89
+ type: accuracy
90
+ value: 44.44
91
+ - task:
92
+ type: text-generation
93
+ dataset:
94
+ name: LaRoSeDa_binary
95
+ type: LaRoSeDa_binary
96
+ metrics:
97
+ - name: Average macro-f1
98
+ type: macro-f1
99
+ value: 97.48
100
+ - task:
101
+ type: text-generation
102
+ dataset:
103
+ name: LaRoSeDa_multiclass
104
+ type: LaRoSeDa_multiclass
105
+ metrics:
106
+ - name: Average macro-f1
107
+ type: macro-f1
108
+ value: 65.26
109
+ - task:
110
+ type: text-generation
111
+ dataset:
112
+ name: LaRoSeDa_binary_finetuned
113
+ type: LaRoSeDa_binary_finetuned
114
+ metrics:
115
+ - name: Average macro-f1
116
+ type: macro-f1
117
+ value: 98.83
118
+ - task:
119
+ type: text-generation
120
+ dataset:
121
+ name: LaRoSeDa_multiclass_finetuned
122
+ type: LaRoSeDa_multiclass_finetuned
123
+ metrics:
124
+ - name: Average macro-f1
125
+ type: macro-f1
126
+ value: 87.28
127
+ - task:
128
+ type: text-generation
129
+ dataset:
130
+ name: WMT_EN-RO
131
+ type: WMT_EN-RO
132
+ metrics:
133
+ - name: Average bleu
134
+ type: bleu
135
+ value: 27.38
136
+ - task:
137
+ type: text-generation
138
+ dataset:
139
+ name: WMT_RO-EN
140
+ type: WMT_RO-EN
141
+ metrics:
142
+ - name: Average bleu
143
+ type: bleu
144
+ value: 10.32
145
+ - task:
146
+ type: text-generation
147
+ dataset:
148
+ name: WMT_EN-RO_finetuned
149
+ type: WMT_EN-RO_finetuned
150
+ metrics:
151
+ - name: Average bleu
152
+ type: bleu
153
+ value: 27.59
154
+ - task:
155
+ type: text-generation
156
+ dataset:
157
+ name: WMT_RO-EN_finetuned
158
+ type: WMT_RO-EN_finetuned
159
+ metrics:
160
+ - name: Average bleu
161
+ type: bleu
162
+ value: 40.13
163
+ - task:
164
+ type: text-generation
165
+ dataset:
166
+ name: XQuAD
167
+ type: XQuAD
168
+ metrics:
169
+ - name: Average exact_match
170
+ type: exact_match
171
+ value: 44.52
172
+ - task:
173
+ type: text-generation
174
+ dataset:
175
+ name: XQuAD
176
+ type: XQuAD
177
+ metrics:
178
+ - name: Average f1
179
+ type: f1
180
+ value: 64.75
181
+ - task:
182
+ type: text-generation
183
+ dataset:
184
+ name: XQuAD_finetuned
185
+ type: XQuAD_finetuned
186
+ metrics:
187
+ - name: Average exact_match
188
+ type: exact_match
189
+ value: 54.96
190
+ - task:
191
+ type: text-generation
192
+ dataset:
193
+ name: XQuAD_finetuned
194
+ type: XQuAD_finetuned
195
+ metrics:
196
+ - name: Average f1
197
+ type: f1
198
+ value: 70.20
199
+ - task:
200
+ type: text-generation
201
+ dataset:
202
+ name: STS
203
+ type: STS
204
+ metrics:
205
+ - name: Average spearman
206
+ type: spearman
207
+ value: 65.50
208
+ - task:
209
+ type: text-generation
210
+ dataset:
211
+ name: STS
212
+ type: STS
213
+ metrics:
214
+ - name: Average pearson
215
+ type: pearson
216
+ value: 67.79
217
+ - task:
218
+ type: text-generation
219
+ dataset:
220
+ name: STS_finetuned
221
+ type: STS_finetuned
222
+ metrics:
223
+ - name: Average spearman
224
+ type: spearman
225
+ value: 84.44
226
+ - task:
227
+ type: text-generation
228
+ dataset:
229
+ name: STS_finetuned
230
+ type: STS_finetuned
231
+ metrics:
232
+ - name: Average pearson
233
+ type: pearson
234
+ value: 84.76
235
+ - task:
236
+ type: text-generation
237
+ dataset:
238
+ name: RoMT-Bench
239
+ type: RoMT-Bench
240
+ metrics:
241
+ - name: First turn
242
+ type: Score
243
+ value: 4.67
244
+ - name: Second turn
245
+ type: Score
246
+ value: 3.04
247
+ - task:
248
+ type: text-generation
249
+ dataset:
250
+ name: OpenLLM-Ro/ro_arc_challenge
251
+ type: OpenLLM-Ro/ro_arc_challenge
252
+ metrics:
253
+ - name: 0-shot
254
+ type: accuracy
255
+ value: 41.73
256
+ - name: 1-shot
257
+ type: accuracy
258
+ value: 42.16
259
+ - name: 3-shot
260
+ type: accuracy
261
+ value: 43.53
262
+ - name: 5-shot
263
+ type: accuracy
264
+ value: 44.90
265
+ - name: 10-shot
266
+ type: accuracy
267
+ value: 44.99
268
+ - name: 25-shot
269
+ type: accuracy
270
+ value: 44.64
271
+ - task:
272
+ type: text-generation
273
+ dataset:
274
+ name: OpenLLM-Ro/ro_mmlu
275
+ type: OpenLLM-Ro/ro_mmlu
276
+ metrics:
277
+ - name: 0-shot
278
+ type: accuracy
279
+ value: 38.54
280
+ - name: 1-shot
281
+ type: accuracy
282
+ value: 39.36
283
+ - name: 3-shot
284
+ type: accuracy
285
+ value: 40.82
286
+ - name: 5-shot
287
+ type: accuracy
288
+ value: 40.07
289
+ - task:
290
+ type: text-generation
291
+ dataset:
292
+ name: OpenLLM-Ro/ro_winogrande
293
+ type: OpenLLM-Ro/ro_winogrande
294
+ metrics:
295
+ - name: 0-shot
296
+ type: accuracy
297
+ value: 72.61
298
+ - name: 1-shot
299
+ type: accuracy
300
+ value: 69.93
301
+ - name: 3-shot
302
+ type: accuracy
303
+ value: 70.40
304
+ - name: 5-shot
305
+ type: accuracy
306
+ value: 68.43
307
+ - task:
308
+ type: text-generation
309
+ dataset:
310
+ name: OpenLLM-Ro/ro_hellaswag
311
+ type: OpenLLM-Ro/ro_hellaswag
312
+ metrics:
313
+ - name: 0-shot
314
+ type: accuracy
315
+ value: 56.90
316
+ - name: 1-shot
317
+ type: accuracy
318
+ value: 57.07
319
+ - name: 3-shot
320
+ type: accuracy
321
+ value: 57.56
322
+ - name: 5-shot
323
+ type: accuracy
324
+ value: 57.35
325
+ - name: 10-shot
326
+ type: accuracy
327
+ value: 57.93
328
+ - task:
329
+ type: text-generation
330
+ dataset:
331
+ name: OpenLLM-Ro/ro_gsm8k
332
+ type: OpenLLM-Ro/ro_gsm8k
333
+ metrics:
334
+ - name: 0-shot
335
+ type: accuracy
336
+ value: 11.22
337
+ - name: 1-shot
338
+ type: accuracy
339
+ value: 21.38
340
+ - name: 3-shot
341
+ type: accuracy
342
+ value: 23.73
343
+ - task:
344
+ type: text-generation
345
+ dataset:
346
+ name: LaRoSeDa_binary
347
+ type: LaRoSeDa_binary
348
+ metrics:
349
+ - name: 0-shot
350
+ type: macro-f1
351
+ value: 97.67
352
+ - name: 1-shot
353
+ type: macro-f1
354
+ value: 96.77
355
+ - name: 3-shot
356
+ type: macro-f1
357
+ value: 97.60
358
+ - name: 5-shot
359
+ type: macro-f1
360
+ value: 97.87
361
+ - task:
362
+ type: text-generation
363
+ dataset:
364
+ name: LaRoSeDa_multiclass
365
+ type: LaRoSeDa_multiclass
366
+ metrics:
367
+ - name: 0-shot
368
+ type: macro-f1
369
+ value: 61.82
370
+ - name: 1-shot
371
+ type: macro-f1
372
+ value: 58.84
373
+ - name: 3-shot
374
+ type: macro-f1
375
+ value: 68.67
376
+ - name: 5-shot
377
+ type: macro-f1
378
+ value: 71.71
379
+ - task:
380
+ type: text-generation
381
+ dataset:
382
+ name: WMT_EN-RO
383
+ type: WMT_EN-RO
384
+ metrics:
385
+ - name: 0-shot
386
+ type: bleu
387
+ value: 19.71
388
+ - name: 1-shot
389
+ type: bleu
390
+ value: 29.62
391
+ - name: 3-shot
392
+ type: bleu
393
+ value: 30.11
394
+ - name: 5-shot
395
+ type: bleu
396
+ value: 30.10
397
+ - task:
398
+ type: text-generation
399
+ dataset:
400
+ name: WMT_RO-EN
401
+ type: WMT_RO-EN
402
+ metrics:
403
+ - name: 0-shot
404
+ type: bleu
405
+ value: 1.86
406
+ - name: 1-shot
407
+ type: bleu
408
+ value: 4.41
409
+ - name: 3-shot
410
+ type: bleu
411
+ value: 14.95
412
+ - name: 5-shot
413
+ type: bleu
414
+ value: 20.07
415
+ - task:
416
+ type: text-generation
417
+ dataset:
418
+ name: XQuAD_EM
419
+ type: XQuAD_EM
420
+ metrics:
421
+ - name: 0-shot
422
+ type: exact_match
423
+ value: 34.87
424
+ - name: 1-shot
425
+ type: exact_match
426
+ value: 44.96
427
+ - name: 3-shot
428
+ type: exact_match
429
+ value: 48.40
430
+ - name: 5-shot
431
+ type: exact_match
432
+ value: 49.83
433
+ - task:
434
+ type: text-generation
435
+ dataset:
436
+ name: XQuAD_F1
437
+ type: XQuAD_F1
438
+ metrics:
439
+ - name: 0-shot
440
+ type: f1
441
+ value: 58.07
442
+ - name: 1-shot
443
+ type: f1
444
+ value: 63.93
445
+ - name: 3-shot
446
+ type: f1
447
+ value: 67.89
448
+ - name: 5-shot
449
+ type: f1
450
+ value: 69.10
451
+ - task:
452
+ type: text-generation
453
+ dataset:
454
+ name: STS
455
+ type: STS
456
+ metrics:
457
+ - name: 0-shot
458
+ type: spearman
459
+ value: 61.14
460
+ - name: 1-shot
461
+ type: spearman
462
+ value: 66.91
463
+ - name: 3-shot
464
+ type: spearman
465
+ value: 68.46
466
+ - task:
467
+ type: text-generation
468
+ dataset:
469
+ name: STS
470
+ type: STS
471
+ metrics:
472
+ - name: 0-shot
473
+ type: pearson
474
+ value: 61.88
475
+ - name: 1-shot
476
+ type: pearson
477
+ value: 70.04
478
+ - name: 3-shot
479
+ type: pearson
480
+ value: 71.46
481
 
482
  ---
483
 
 
547
  print(tokenizer.decode(outputs[0]))
548
  ```
549
 
550
+ ## Academic Benchmarks
551
+
552
+ <table>
553
+ <tbody>
554
+ <tr>
555
+ <td><strong>Model</strong></td>
556
+ <td><strong><center>Average</center></strong></td>
557
+ <td><strong><center>ARC</center></strong></td>
558
+ <td><strong><center>MMLU</center></strong></td>
559
+ <td><strong><center>Winogrande</center></strong></td>
560
+ <td><strong><center>Hellaswag</center></strong></td>
561
+ <td><strong><center>GSM8k</center></strong></td>
562
+ <td><strong><center>TruthfulQA</center></strong></td>
563
+ </tr>
564
+ <tr>
565
+ <td>Llama-2-7b-chat</td><td><center>36.84</center></td><td><center>37.03</center></td><td><center>33.80</center></td><td><center>55.87</center></td><td><center>45.36</center></td><td><center>4.90</center></td><td><center>44.09</center></td>
566
+ </tr>
567
+ <tr>
568
+ <td><em>RoLlama2-7b-Instruct</em></td><td><center><em><strong>45.71</strong></em></center></td><td><center><em><strong>43.66</strong></em></center></td><td><center><em><strong>39.70</strong></em></center></td><td><center><em><strong>70.34</strong></em></center></td><td><center><em><strong>57.36</strong></em></center></td><td><center><em><strong>18.78</strong></em></center></td><td><center><em><strong>44.44</strong></em></center></td>
569
+ </tr>
570
+ </tbody>
571
+ </table>
572
+
573
+ ## Downstream tasks
574
 
575
 
576
+ <table>
577
+ <tbody>
578
+ <tr>
579
+ <td></td>
580
+ <td colspan="4"><center><strong>LaRoSeDa</strong></center></td>
581
+ <td colspan="4"><center><strong>WMT</strong></center></td>
582
+ </tr>
583
+ <tr>
584
+ <td></td>
585
+ <td colspan="2"><center><strong>Few-shot</strong></center></td>
586
+ <td colspan="2"><center><strong>Finetuned</strong></center></td>
587
+ <td colspan="2"><center><strong>Few-shot</strong></center></td>
588
+ <td colspan="2"><center><strong>Finetuned</strong></center></td>
589
+ </tr>
590
+ <tr>
591
+ <td><strong>Model</strong></td>
592
+ <td><center><strong>Binary<br>(Macro F1)</strong></center></td>
593
+ <td><center><strong>Multiclass<br>(Macro F1)</strong></center></td>
594
+ <td><center><strong>Binary<br>(Macro F1)</strong></center></td>
595
+ <td><center><strong>Multiclass<br>(Macro F1)</strong></center></td>
596
+ <td><center><strong>EN-RO<br>(Bleu)</strong></center></td>
597
+ <td><center><strong>RO-EN<br>(Bleu)</strong></center></td>
598
+ <td><center><strong>EN-RO<br>(Bleu)</strong></center></td>
599
+ <td><center><strong>RO-EN<br>(Bleu)</strong></center>
600
+ </tr>
601
+ <tr>
602
+ <td>Llama-2-7b-chat</td><td><center>87.78</center></td><td><center>52.81</center></td><td><center>97.27</center></td><td><center>82.02</center></td><td><center>15.55</center></td><td><center><strong>28.53</strong></center></td><td><center>19.99</center></td><td><center>31.48</center></td>
603
+ </tr>
604
+ <tr>
605
+ <td><em>RoLlama2-7b-Instruct</em></td><td><center><em><strong>97.48</strong></em></center></td><td><center><em><strong>65.26</strong></em></center></td><td><center><em><strong>98.83</strong></em></center></td><td><center><em><strong>87.28</strong></em></center></td><td><center><em><strong>27.38</strong></em></center></td><td><center><em>10.32</em></center></td><td><center><em><strong>27.59</strong></em></center></td><td><center><em><strong>40.13</strong></em></center></td>
606
+ </tr>
607
+ </tbody>
608
+ </table>
609
+
610
+
611
+ <table>
612
+ <tbody>
613
+ <tr>
614
+ <td></td>
615
+ <td colspan="4"><center><strong>XQuAD</strong></center></td>
616
+ <td colspan="4"><center><strong>STS</strong></center></td>
617
+ </tr>
618
+ <tr>
619
+ <td></td>
620
+ <td colspan="2"><center><strong>Few-shot</strong></center></td>
621
+ <td colspan="2"><center><strong>Finetuned</strong></center></td>
622
+ <td colspan="2"><center><strong>Few-shot</strong></center></td>
623
+ <td colspan="2"><center><strong>Finetuned</strong></center></td>
624
+ </tr>
625
+ <tr>
626
+ <td><strong>Model</strong></td>
627
+ <td><center><strong>(EM)</strong></center></td>
628
+ <td><center><strong>(F1)</strong></center></td>
629
+ <td><center><strong>(EM)</strong></center></td>
630
+ <td><center><strong>(F1)</strong></center></td>
631
+ <td><center><strong>(Spearman)</strong></center></td>
632
+ <td><center><strong>(Pearson)</strong></center></td>
633
+ <td><center><strong>(Spearman)</strong></center></td>
634
+ <td><center><strong>(Pearson)</strong></center></td>
635
+ </tr>
636
+ <tr>
637
+ <td>Llama-2-7b-chat</td><td><center>32.35</center></td><td><center>54.00</center></td><td><center><strong>60.34</strong></center></td><td><center><strong>75.98</strong></center></td><td><center>32.56</center></td><td><center>31.99</center></td><td><center>74.08</center></td><td><center>72.64</center></td>
638
+ </tr>
639
+ <tr>
640
+ <td><em>RoLlama2-7b-Instruct</em></td><td><center><em><strong>44.52</strong></em></center></td><td><center><em><strong>64.75</strong></em></center></td><td><center><em>54.96</em></center></td><td><center><em>70.20</em></center></td><td><center><em><strong>65.50</strong></em></center></td><td><center><em><strong>67.79</strong></em></center></td><td><center><em><strong>84.44</strong></em></center></td><td><center><em><strong>84.76</strong></em></center></td>
641
+ </tr>
642
+ </tbody>
643
+ </table>
644
 
645
  ## Romanian MT-Bench
646