suryabhupa commited on
Commit
64779af
·
verified ·
1 Parent(s): fdc848b

Update benchmark scores and averages for 1.1

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -362,7 +362,7 @@ metrics to cover different aspects of text generation:
362
  | [MMLU](https://arxiv.org/abs/2009.03300) | 5-shot, top-1 | 42.3 | 64.3 |
363
  | [HellaSwag](https://arxiv.org/abs/1905.07830) | 0-shot |71.4 | 81.2 |
364
  | [PIQA](https://arxiv.org/abs/1911.11641) | 0-shot | 77.3 | 81.2 |
365
- | [SocialIQA](https://arxiv.org/abs/1904.09728) | 0-shot | 59.7 | 51.8 |
366
  | [BooIQ](https://arxiv.org/abs/1905.10044) | 0-shot | 69.4 | 83.2 |
367
  | [WinoGrande](https://arxiv.org/abs/1907.10641) | partial score | 65.4 | 72.3 |
368
  | [CommonsenseQA](https://arxiv.org/abs/1811.00937) | 7-shot | 65.3 | 71.3 |
@@ -370,7 +370,7 @@ metrics to cover different aspects of text generation:
370
  | [ARC-e](https://arxiv.org/abs/1911.01547) | | 73.2 | 81.5 |
371
  | [ARC-c](https://arxiv.org/abs/1911.01547) | | 42.1 | 53.2 |
372
  | [TriviaQA](https://arxiv.org/abs/1705.03551) | 5-shot | 53.2 | 63.4 |
373
- | [Natural Questions](https://github.com/google-research-datasets/natural-questions) | 5-shot | - | 23 |
374
  | [HumanEval](https://arxiv.org/abs/2107.03374) | pass@1 | 22.0 | 32.3 |
375
  | [MBPP](https://arxiv.org/abs/2108.07732) | 3-shot | 29.2 | 44.4 |
376
  | [GSM8K](https://arxiv.org/abs/2110.14168) | maj@1 | 17.7 | 46.4 |
@@ -378,7 +378,8 @@ metrics to cover different aspects of text generation:
378
  | [AGIEval](https://arxiv.org/abs/2304.06364) | | 24.2 | 41.7 |
379
  | [BIG-Bench](https://arxiv.org/abs/2206.04615) | | 35.2 | 55.1 |
380
  | ------------------------------ | ------------- | ----------- | --------- |
381
- | **Average** | | **54.0** | **56.4** |
 
382
 
383
  ## Ethics and Safety
384
 
 
362
  | [MMLU](https://arxiv.org/abs/2009.03300) | 5-shot, top-1 | 42.3 | 64.3 |
363
  | [HellaSwag](https://arxiv.org/abs/1905.07830) | 0-shot |71.4 | 81.2 |
364
  | [PIQA](https://arxiv.org/abs/1911.11641) | 0-shot | 77.3 | 81.2 |
365
+ | [SocialIQA](https://arxiv.org/abs/1904.09728) | 0-shot | 49.7 | 51.8 |
366
  | [BooIQ](https://arxiv.org/abs/1905.10044) | 0-shot | 69.4 | 83.2 |
367
  | [WinoGrande](https://arxiv.org/abs/1907.10641) | partial score | 65.4 | 72.3 |
368
  | [CommonsenseQA](https://arxiv.org/abs/1811.00937) | 7-shot | 65.3 | 71.3 |
 
370
  | [ARC-e](https://arxiv.org/abs/1911.01547) | | 73.2 | 81.5 |
371
  | [ARC-c](https://arxiv.org/abs/1911.01547) | | 42.1 | 53.2 |
372
  | [TriviaQA](https://arxiv.org/abs/1705.03551) | 5-shot | 53.2 | 63.4 |
373
+ | [Natural Questions](https://github.com/google-research-datasets/natural-questions) | 5-shot | 12.5 | 23 |
374
  | [HumanEval](https://arxiv.org/abs/2107.03374) | pass@1 | 22.0 | 32.3 |
375
  | [MBPP](https://arxiv.org/abs/2108.07732) | 3-shot | 29.2 | 44.4 |
376
  | [GSM8K](https://arxiv.org/abs/2110.14168) | maj@1 | 17.7 | 46.4 |
 
378
  | [AGIEval](https://arxiv.org/abs/2304.06364) | | 24.2 | 41.7 |
379
  | [BIG-Bench](https://arxiv.org/abs/2206.04615) | | 35.2 | 55.1 |
380
  | ------------------------------ | ------------- | ----------- | --------- |
381
+ | **Average** | | **45.0** | **56.9** |
382
+
383
 
384
  ## Ethics and Safety
385