Update benchmark scores and averages for 1.1
Browse files
README.md
CHANGED
@@ -362,7 +362,7 @@ metrics to cover different aspects of text generation:
|
|
362 |
| [MMLU](https://arxiv.org/abs/2009.03300) | 5-shot, top-1 | 42.3 | 64.3 |
|
363 |
| [HellaSwag](https://arxiv.org/abs/1905.07830) | 0-shot |71.4 | 81.2 |
|
364 |
| [PIQA](https://arxiv.org/abs/1911.11641) | 0-shot | 77.3 | 81.2 |
|
365 |
-
| [SocialIQA](https://arxiv.org/abs/1904.09728) | 0-shot |
|
366 |
| [BooIQ](https://arxiv.org/abs/1905.10044) | 0-shot | 69.4 | 83.2 |
|
367 |
| [WinoGrande](https://arxiv.org/abs/1907.10641) | partial score | 65.4 | 72.3 |
|
368 |
| [CommonsenseQA](https://arxiv.org/abs/1811.00937) | 7-shot | 65.3 | 71.3 |
|
@@ -370,7 +370,7 @@ metrics to cover different aspects of text generation:
|
|
370 |
| [ARC-e](https://arxiv.org/abs/1911.01547) | | 73.2 | 81.5 |
|
371 |
| [ARC-c](https://arxiv.org/abs/1911.01547) | | 42.1 | 53.2 |
|
372 |
| [TriviaQA](https://arxiv.org/abs/1705.03551) | 5-shot | 53.2 | 63.4 |
|
373 |
-
| [Natural Questions](https://github.com/google-research-datasets/natural-questions) | 5-shot |
|
374 |
| [HumanEval](https://arxiv.org/abs/2107.03374) | pass@1 | 22.0 | 32.3 |
|
375 |
| [MBPP](https://arxiv.org/abs/2108.07732) | 3-shot | 29.2 | 44.4 |
|
376 |
| [GSM8K](https://arxiv.org/abs/2110.14168) | maj@1 | 17.7 | 46.4 |
|
@@ -378,7 +378,8 @@ metrics to cover different aspects of text generation:
|
|
378 |
| [AGIEval](https://arxiv.org/abs/2304.06364) | | 24.2 | 41.7 |
|
379 |
| [BIG-Bench](https://arxiv.org/abs/2206.04615) | | 35.2 | 55.1 |
|
380 |
| ------------------------------ | ------------- | ----------- | --------- |
|
381 |
-
| **Average** | | **
|
|
|
382 |
|
383 |
## Ethics and Safety
|
384 |
|
|
|
362 |
| [MMLU](https://arxiv.org/abs/2009.03300) | 5-shot, top-1 | 42.3 | 64.3 |
|
363 |
| [HellaSwag](https://arxiv.org/abs/1905.07830) | 0-shot |71.4 | 81.2 |
|
364 |
| [PIQA](https://arxiv.org/abs/1911.11641) | 0-shot | 77.3 | 81.2 |
|
365 |
+
| [SocialIQA](https://arxiv.org/abs/1904.09728) | 0-shot | 49.7 | 51.8 |
|
366 |
| [BooIQ](https://arxiv.org/abs/1905.10044) | 0-shot | 69.4 | 83.2 |
|
367 |
| [WinoGrande](https://arxiv.org/abs/1907.10641) | partial score | 65.4 | 72.3 |
|
368 |
| [CommonsenseQA](https://arxiv.org/abs/1811.00937) | 7-shot | 65.3 | 71.3 |
|
|
|
370 |
| [ARC-e](https://arxiv.org/abs/1911.01547) | | 73.2 | 81.5 |
|
371 |
| [ARC-c](https://arxiv.org/abs/1911.01547) | | 42.1 | 53.2 |
|
372 |
| [TriviaQA](https://arxiv.org/abs/1705.03551) | 5-shot | 53.2 | 63.4 |
|
373 |
+
| [Natural Questions](https://github.com/google-research-datasets/natural-questions) | 5-shot | 12.5 | 23 |
|
374 |
| [HumanEval](https://arxiv.org/abs/2107.03374) | pass@1 | 22.0 | 32.3 |
|
375 |
| [MBPP](https://arxiv.org/abs/2108.07732) | 3-shot | 29.2 | 44.4 |
|
376 |
| [GSM8K](https://arxiv.org/abs/2110.14168) | maj@1 | 17.7 | 46.4 |
|
|
|
378 |
| [AGIEval](https://arxiv.org/abs/2304.06364) | | 24.2 | 41.7 |
|
379 |
| [BIG-Bench](https://arxiv.org/abs/2206.04615) | | 35.2 | 55.1 |
|
380 |
| ------------------------------ | ------------- | ----------- | --------- |
|
381 |
+
| **Average** | | **45.0** | **56.9** |
|
382 |
+
|
383 |
|
384 |
## Ethics and Safety
|
385 |
|