Romain-Cosentino commited on
Commit
cb2002c
·
verified ·
1 Parent(s): ac053aa

Update README.md

Browse files

adding Arena Hard benchmark

Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -82,6 +82,24 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
82
 
83
  ![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  # Limitations
86
 
87
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
 
82
 
83
  ![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
84
 
85
+
86
+ ## Arena Hard
87
+
88
+ Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
89
+
90
+ | Model-name | Score |
91
+ | gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2)
92
+ | claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1)
93
+ | gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0)
94
+ | **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4)
95
+ | meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6)
96
+ | claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3)
97
+ | claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5)
98
+ | gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2)
99
+ | mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8)
100
+ | Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4)
101
+ | command-r-plus | 33.1 | 95% CI: (-2.0, 1.9)
102
+
103
  # Limitations
104
 
105
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.