MayaPH
/

GodziLLa-30B

Text Generation

text-generation-inference

Model card Files Files and versions

jaspercatapang commited on Jul 21, 2023

Commit

2703078

·

1 Parent(s): aa9912a

Update README.md

Files changed (1) hide show

README.md +10 -5

README.md CHANGED Viewed

@@ -19,11 +19,11 @@ GodziLLa-30B is an experimental combination of various proprietary Maya LoRAs wi
 ## Open LLM Leaderboard Metrics
 | Metric                | Value |
 |-----------------------|-------|
-| MMLU (5-shot)         | 53.3  |
-| ARC (25-shot)         | 54.2  |
-| HellaSwag (10-shot)   | 79.7  |
-| TruthfulQA (0-shot)   | 55.1  |
-| Average               | 60.6  |
 According to the leaderboard description, here are the benchmarks used for the evaluation:
 - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
@@ -31,6 +31,11 @@ According to the leaderboard description, here are the benchmarks used for the e
 - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
 ## Recommended Prompt Format
 Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.

 ## Open LLM Leaderboard Metrics
 | Metric                | Value |
 |-----------------------|-------|
+| MMLU (5-shot)         | 54.2  |
+| ARC (25-shot)         | 61.5  |
+| HellaSwag (10-shot)   | 82.1  |
+| TruthfulQA (0-shot)   | 55.9  |
+| Average               | 63.4  |
 According to the leaderboard description, here are the benchmarks used for the evaluation:
 - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
 - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
+## Leaderboard Highlights (as of July 21, 2023)
+- GodziLLa-30B is on par with [Falcon-40B-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) (June 2023's Rank #1).
+- GodziLLa-30B outperforms Meta AI's LLaMA [30B and 65B](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) models.
+- GodziLLa-30B ranks 3rd worldwide in the [TruthfulQA](https://arxiv.org/abs/2109.07958) metric, the standard LLM benchmark to measure whether a language model is truthful in generating answers to questions.
 ## Recommended Prompt Format
 Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.