jaspercatapang commited on
Commit
2703078
·
1 Parent(s): aa9912a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -19,11 +19,11 @@ GodziLLa-30B is an experimental combination of various proprietary Maya LoRAs wi
19
  ## Open LLM Leaderboard Metrics
20
  | Metric | Value |
21
  |-----------------------|-------|
22
- | MMLU (5-shot) | 53.3 |
23
- | ARC (25-shot) | 54.2 |
24
- | HellaSwag (10-shot) | 79.7 |
25
- | TruthfulQA (0-shot) | 55.1 |
26
- | Average | 60.6 |
27
 
28
  According to the leaderboard description, here are the benchmarks used for the evaluation:
29
  - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
@@ -31,6 +31,11 @@ According to the leaderboard description, here are the benchmarks used for the e
31
  - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
32
  - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
33
 
 
 
 
 
 
34
  ## Recommended Prompt Format
35
  Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
36
 
 
19
  ## Open LLM Leaderboard Metrics
20
  | Metric | Value |
21
  |-----------------------|-------|
22
+ | MMLU (5-shot) | 54.2 |
23
+ | ARC (25-shot) | 61.5 |
24
+ | HellaSwag (10-shot) | 82.1 |
25
+ | TruthfulQA (0-shot) | 55.9 |
26
+ | Average | 63.4 |
27
 
28
  According to the leaderboard description, here are the benchmarks used for the evaluation:
29
  - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
 
31
  - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
32
  - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
33
 
34
+ ## Leaderboard Highlights (as of July 21, 2023)
35
+ - GodziLLa-30B is on par with [Falcon-40B-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) (June 2023's Rank #1).
36
+ - GodziLLa-30B outperforms Meta AI's LLaMA [30B and 65B](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) models.
37
+ - GodziLLa-30B ranks 3rd worldwide in the [TruthfulQA](https://arxiv.org/abs/2109.07958) metric, the standard LLM benchmark to measure whether a language model is truthful in generating answers to questions.
38
+
39
  ## Recommended Prompt Format
40
  Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
41