tenyx
/

Llama3-TenyxChat-70B

Text Generation

tenyx-fine-tuning

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Romain-Cosentino commited on Apr 30, 2024

Commit

cb2002c

·

verified ·

1 Parent(s): ac053aa

Update README.md

adding Arena Hard benchmark

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -82,6 +82,24 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
 ![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.

 ![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
+## Arena Hard
+Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
+| Model-name                     | Score  |
+| gpt-4-0125-preview             |  78.0  | 95% CI: (-1.8, 2.2)
+| claude-3-opus-20240229         |  60.4  | 95% CI: (-2.6, 2.1)
+| gpt-4-0314                     |  50.0  | 95% CI:  (0.0, 0.0)
+| **tenyx/Llama3-TenyxChat-70B** |  49.0  | 95% CI: (-3.0, 2.4)
+| meta-llama/Meta-Llama-3-70B-In |  47.3  | 95% CI: (-1.7, 2.6)
+| claude-3-sonnet-20240229       |  46.8  | 95% CI: (-2.7, 2.3)
+| claude-3-haiku-20240307        |  41.5  | 95% CI: (-2.4, 2.5)
+| gpt-4-0613                     |  37.9  | 95% CI: (-2.1, 2.2)
+| mistral-large-2402             |  37.7  | 95% CI: (-2.9, 2.8)
+| Qwen1.5-72B-Chat               |  36.1  | 95% CI: (-2.1, 2.4)
+| command-r-plus                 |  33.1  | 95% CI: (-2.0, 1.9)
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.