Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,25 @@ We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-paramet
|
|
7 |
Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
|
8 |
|
9 |
## Evalation
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## Usage Instructions
|
13 |
```python
|
|
|
7 |
Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
|
8 |
|
9 |
## Evalation
|
10 |
+
We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ.
|
11 |
+
All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
|
12 |
+
| Metrics | Score |
|
13 |
+
|-------------------|---------|
|
14 |
+
| HellaSwag | 42.09 |
|
15 |
+
| ARC (Average) | 40.11 |
|
16 |
+
| PIQA | 67.14 |
|
17 |
+
| MMLU (cloze) | 31.29 |
|
18 |
+
| CommonsenseQA | 28.17 |
|
19 |
+
| TriviaQA | 6.51 |
|
20 |
+
| Winograde | 51.38 |
|
21 |
+
| OpenBookQA | 33.00 |
|
22 |
+
| GSM8K (5-shot) | 6.67 |
|
23 |
+
| SIQA | 41.86 |
|
24 |
+
| CEval | 30.19 |
|
25 |
+
| CMMLU | 30.25 |
|
26 |
+
| **Average-English** | **34.82** |
|
27 |
+
| **Average-Chinese** | **30.22** |
|
28 |
+
| **Overall Average** | **32.52** |
|
29 |
|
30 |
## Usage Instructions
|
31 |
```python
|