Update README.md
Browse files
README.md
CHANGED
@@ -87,18 +87,19 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
|
|
87 |
|
88 |
Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
|
89 |
|
90 |
-
| Model-name | Score |
|
91 |
-
|
92 |
-
|
|
93 |
-
|
|
94 |
-
|
|
95 |
-
|
|
96 |
-
|
|
97 |
-
| claude-3-
|
98 |
-
|
|
99 |
-
|
|
100 |
-
|
|
101 |
-
|
|
|
|
102 |
|
103 |
# Limitations
|
104 |
|
|
|
87 |
|
88 |
Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
|
89 |
|
90 |
+
| Model-name | Score | |
|
91 |
+
|--------------------------------|--------|---------------------|
|
92 |
+
| gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2) |
|
93 |
+
| claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1) |
|
94 |
+
| gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0) |
|
95 |
+
| **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4) |
|
96 |
+
| meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6) |
|
97 |
+
| claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3) |
|
98 |
+
| claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5) |
|
99 |
+
| gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2) |
|
100 |
+
| mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8) |
|
101 |
+
| Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4) |
|
102 |
+
| command-r-plus | 33.1 | 95% CI: (-2.0, 1.9) |
|
103 |
|
104 |
# Limitations
|
105 |
|