Update README.md
Browse filesadding Arena Hard benchmark
README.md
CHANGED
@@ -82,6 +82,24 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
|
|
82 |
|
83 |
![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
|
84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
# Limitations
|
86 |
|
87 |
Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
|
|
|
82 |
|
83 |
![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
|
84 |
|
85 |
+
|
86 |
+
## Arena Hard
|
87 |
+
|
88 |
+
Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
|
89 |
+
|
90 |
+
| Model-name | Score |
|
91 |
+
| gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2)
|
92 |
+
| claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1)
|
93 |
+
| gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0)
|
94 |
+
| **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4)
|
95 |
+
| meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6)
|
96 |
+
| claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3)
|
97 |
+
| claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5)
|
98 |
+
| gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2)
|
99 |
+
| mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8)
|
100 |
+
| Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4)
|
101 |
+
| command-r-plus | 33.1 | 95% CI: (-2.0, 1.9)
|
102 |
+
|
103 |
# Limitations
|
104 |
|
105 |
Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
|