Commit
·
2703078
1
Parent(s):
aa9912a
Update README.md
Browse files
README.md
CHANGED
@@ -19,11 +19,11 @@ GodziLLa-30B is an experimental combination of various proprietary Maya LoRAs wi
|
|
19 |
## Open LLM Leaderboard Metrics
|
20 |
| Metric | Value |
|
21 |
|-----------------------|-------|
|
22 |
-
| MMLU (5-shot) |
|
23 |
-
| ARC (25-shot) |
|
24 |
-
| HellaSwag (10-shot) |
|
25 |
-
| TruthfulQA (0-shot) | 55.
|
26 |
-
| Average |
|
27 |
|
28 |
According to the leaderboard description, here are the benchmarks used for the evaluation:
|
29 |
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
@@ -31,6 +31,11 @@ According to the leaderboard description, here are the benchmarks used for the e
|
|
31 |
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
32 |
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
33 |
|
|
|
|
|
|
|
|
|
|
|
34 |
## Recommended Prompt Format
|
35 |
Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
|
36 |
|
|
|
19 |
## Open LLM Leaderboard Metrics
|
20 |
| Metric | Value |
|
21 |
|-----------------------|-------|
|
22 |
+
| MMLU (5-shot) | 54.2 |
|
23 |
+
| ARC (25-shot) | 61.5 |
|
24 |
+
| HellaSwag (10-shot) | 82.1 |
|
25 |
+
| TruthfulQA (0-shot) | 55.9 |
|
26 |
+
| Average | 63.4 |
|
27 |
|
28 |
According to the leaderboard description, here are the benchmarks used for the evaluation:
|
29 |
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
|
|
31 |
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
32 |
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
33 |
|
34 |
+
## Leaderboard Highlights (as of July 21, 2023)
|
35 |
+
- GodziLLa-30B is on par with [Falcon-40B-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) (June 2023's Rank #1).
|
36 |
+
- GodziLLa-30B outperforms Meta AI's LLaMA [30B and 65B](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) models.
|
37 |
+
- GodziLLa-30B ranks 3rd worldwide in the [TruthfulQA](https://arxiv.org/abs/2109.07958) metric, the standard LLM benchmark to measure whether a language model is truthful in generating answers to questions.
|
38 |
+
|
39 |
## Recommended Prompt Format
|
40 |
Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
|
41 |
|