Romain-Cosentino commited on
Commit
d37e3d2
·
verified ·
1 Parent(s): cb2002c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -87,18 +87,19 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
87
 
88
  Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
89
 
90
- | Model-name | Score |
91
- | gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2)
92
- | claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1)
93
- | gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0)
94
- | **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4)
95
- | meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6)
96
- | claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3)
97
- | claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5)
98
- | gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2)
99
- | mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8)
100
- | Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4)
101
- | command-r-plus | 33.1 | 95% CI: (-2.0, 1.9)
 
102
 
103
  # Limitations
104
 
 
87
 
88
  Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
89
 
90
+ | Model-name | Score | |
91
+ |--------------------------------|--------|---------------------|
92
+ | gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2) |
93
+ | claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1) |
94
+ | gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0) |
95
+ | **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4) |
96
+ | meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6) |
97
+ | claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3) |
98
+ | claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5) |
99
+ | gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2) |
100
+ | mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8) |
101
+ | Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4) |
102
+ | command-r-plus | 33.1 | 95% CI: (-2.0, 1.9) |
103
 
104
  # Limitations
105