BAAI
/

ldwang commited on
Commit
0833bde
·
verified ·
1 Parent(s): 7622e34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -1
README.md CHANGED
@@ -7,7 +7,25 @@ We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-paramet
7
  Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
8
 
9
  ## Evalation
10
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Usage Instructions
13
  ```python
 
7
  Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
8
 
9
  ## Evalation
10
+ We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ.
11
+ All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
12
+ | Metrics | Score |
13
+ |-------------------|---------|
14
+ | HellaSwag | 42.09 |
15
+ | ARC (Average) | 40.11 |
16
+ | PIQA | 67.14 |
17
+ | MMLU (cloze) | 31.29 |
18
+ | CommonsenseQA | 28.17 |
19
+ | TriviaQA | 6.51 |
20
+ | Winograde | 51.38 |
21
+ | OpenBookQA | 33.00 |
22
+ | GSM8K (5-shot) | 6.67 |
23
+ | SIQA | 41.86 |
24
+ | CEval | 30.19 |
25
+ | CMMLU | 30.25 |
26
+ | **Average-English** | **34.82** |
27
+ | **Average-Chinese** | **30.22** |
28
+ | **Overall Average** | **32.52** |
29
 
30
  ## Usage Instructions
31
  ```python