Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ru
|
5 |
+
- en
|
6 |
+
base_model:
|
7 |
+
- Qwen/Qwen2.5-7B
|
8 |
+
---
|
9 |
+
|
10 |
+
This is an instruction following model (based on Qwen2.5-7B base) optimized for Russian language.
|
11 |
+
|
12 |
+
The model was trained in two phases: SFT (training data composition is similar to kolibri-mistral-0427) and RLHF.
|
13 |
+
|
14 |
+
Current RLHF pipeline leads to degradation on IFEval, but the overall 'vibe' of the model improves significantly.
|
15 |
+
I am currently investigating the causes of this degradation and exploring methods to further enhance instruction-following capabilities.
|
16 |
+
|
17 |
+
The model uses ChatML template. Adding a system prompt will likely improve the model's performance on your tasks (experiment with it).
|
18 |
+
|
19 |
+
## Instruction following evals
|
20 |
+
The model was tested using the following benchmarks:
|
21 |
+
- [ruIFEval](https://github.com/NLP-Core-Team/ruIFEval)
|
22 |
+
- [ifeval](https://github.com/google-research/google-research/tree/master/instruction_following_eval)
|
23 |
+
|
24 |
+
| Eval name |Strict Value| Loose Value
|
25 |
+
|---------------------------------|----|----|
|
26 |
+
|Avg. |*43.00*|*49.17*|
|
27 |
+
|ifeval-prompt-level |38.63|46.21|
|
28 |
+
|ifeval-instruction-level |51.20|57.5|
|
29 |
+
|ru-ifeval-prompt-level |35.30|40.48|
|
30 |
+
|ru-ifeval-instruction-level |46.88|52.52|
|
31 |
+
|
32 |
+
## Russian LLM Arena (proxy eval via JINA)
|
33 |
+
|
34 |
+
The table below approximates [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb)
|
35 |
+
scores using the [JINA Judge model](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-041024).
|
36 |
+
Take it with a grain of salt.
|
37 |
+
|
38 |
+
| Model Name | Score | 95% CI | Avg Tokens |
|
39 |
+
|--------------------------------------------------|--------|---------------------|------------|
|
40 |
+
| gpt-4-1106-preview | 82.8 | (-2.8, 2.6) | 541 |
|
41 |
+
| gpt-4o-mini | 75.3 | (-2.2, 2.8) | 448 |
|
42 |
+
| qwen-2.5-72b-it | 73.1 | (-3.0, 3.1) | 557 |
|
43 |
+
| gemma-2-9b-it-sppo-iter3 | 70.6 | (-3.7, 3.0) | 509 |
|
44 |
+
| gemma-2-27b-it | 68.7 | (-2.9, 3.8) | 472 |
|
45 |
+
| t-lite-instruct-0.1 | 67.5 | (-4.2, 2.7) | 810 |
|
46 |
+
| gemma-2-9b-it | 67.0 | (-3.0, 3.8) | 459 |
|
47 |
+
| suzume-llama-3-8B-multilingual-orpo-borda-half | 62.4 | (-3.0, 3.3) | 682 |
|
48 |
+
| glm-4-9b-chat | 61.5 | (-3.9, 3.3) | 568 |
|
49 |
+
| phi-3-medium-4k-instruct | 60.4 | (-3.8, 3.6) | 566 |
|
50 |
+
| sfr-iterative-dpo-llama-3-8b-r | 57.2 | (-3.8, 4.0) | 516 |
|
51 |
+
| **kolibri-qwen2.5-7b-060225-rlhf-1** | 55.4 | (-3.1, 4.4) | 383 |
|
52 |
+
| c4ai-command-r-v01 | 55.0 | (-3.7, 4.4) | 529 |
|
53 |
+
| suzume-llama-3-8b-multilingual | 51.9 | (-3.1, 3.4) | 641 |
|
54 |
+
| mistral-nemo-instruct-2407 | 51.9 | (-3.0, 3.0) | 403 |
|
55 |
+
| yandex_gpt_pro | 50.3 | (-3.5, 3.0) | 345 |
|
56 |
+
| gpt-3.5-turbo-0125 | 50.0 | (0.0, 0.0) | 220 |
|
57 |
+
| hermes-2-theta-llama-3-8b | 49.3 | (-3.2, 3.7) | 485 |
|
58 |
+
| starling-lm-7b-beta | 48.3 | (-3.7, 3.9) | 629 |
|
59 |
+
| llama-3-8b-saiga-suzume-ties | 47.9 | (-3.9, 5.0) | 763 |
|
60 |
+
| llama-3-smaug-8b | 47.6 | (-4.3, 2.9) | 524 |
|
61 |
+
| **vikhr-it-5.4-fp16-orpo-v2** | 46.8 | (-2.4, 2.2) | 379 |
|
62 |
+
| aya-23-8b | 46.1 | (-3.3, 3.6) | 554 |
|
63 |
+
| **saiga_llama3_8b_v6** | 44.8 | (-2.9, 3.2) | 471 |
|
64 |
+
| qwen2-7b-instruct | 43.6 | (-3.5, 3.0) | 340 |
|
65 |
+
| vikhr-it-5.2-fp16-cp | 43.6 | (-3.6, 3.3) | 543 |
|
66 |
+
| openchat-3.5-0106 | 42.8 | (-2.5, 3.8) | 492 |
|
67 |
+
| **kolibri-mistral-0427-upd** | 42.3 | (-4.1, 4.0) | 551 |
|
68 |
+
| paralex-llama-3-8b-sft | 41.8 | (-3.7, 3.9) | 688 |
|
69 |
+
| llama-3-instruct-8b-sppo-iter3 | 41.7 | (-4.0, 3.6) | 502 |
|
70 |
+
| gpt-3.5-turbo-1106 | 41.5 | (-2.7, 2.5) | 191 |
|
71 |
+
| mistral-7b-instruct-v0.3 | 41.1 | (-4.1, 2.9) | 469 |
|
72 |
+
| gigachat_pro | 40.9 | (-3.2, 2.8) | 294 |
|
73 |
+
| openchat-3.6-8b-20240522 | 39.1 | (-2.9, 3.8) | 428 |
|
74 |
+
| vikhr-it-5.3-fp16-32k | 38.8 | (-3.2, 3.3) | 519 |
|
75 |
+
| hermes-2-pro-llama-3-8b | 38.4 | (-3.9, 3.9) | 463 |
|
76 |
+
| kolibri-vikhr-mistral-0427 | 34.5 | (-2.9, 3.1) | 489 |
|
77 |
+
| vikhr-it-5.3-fp16 | 33.5 | (-3.0, 3.8) | 523 |
|
78 |
+
| llama-3-instruct-8b-simpo | 32.7 | (-3.2, 2.7) | 417 |
|
79 |
+
| meta-llama-3-8b-instruct | 32.1 | (-3.6, 4.2) | 450 |
|
80 |
+
| neural-chat-7b-v3-3 | 25.9 | (-3.1, 3.2) | 927 |
|
81 |
+
| gigachat_lite | 25.4 | (-3.5, 2.7) | 276 |
|
82 |
+
| snorkel-mistral-pairrm-dpo | 10.3 | (-2.3, 2.6) | 773 |
|
83 |
+
| storm-7b | 3.7 | (-1.9, 1.7) | 419 |
|