BAAI
/

OpenSeek-Small-v1-Baseline

Model card Files Files and versions Community

ldwang commited on May 26

Commit

0833bde

·

verified ·

1 Parent(s): 7622e34

Update README.md

Files changed (1) hide show

README.md +19 -1

README.md CHANGED Viewed

@@ -7,7 +7,25 @@ We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-paramet
 Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
 ## Evalation
 ## Usage Instructions
 ```python

 Our training curves have been recorded in Weights & Biases [Aquila-1_4B-A0_4B-Baseline](https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/Aquila-1_4B-A0_4B-Baseline-rank-31).
 ## Evalation
+We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ.
+All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
+| Metrics           | Score   |
+|-------------------|---------|
+| HellaSwag         | 42.09   |
+| ARC (Average)     | 40.11   |
+| PIQA              | 67.14   |
+| MMLU (cloze)      | 31.29   |
+| CommonsenseQA     | 28.17   |
+| TriviaQA          | 6.51    |
+| Winograde         | 51.38   |
+| OpenBookQA        | 33.00   |
+| GSM8K (5-shot)    | 6.67    |
+| SIQA              | 41.86   |
+| CEval             | 30.19   |
+| CMMLU             | 30.25   |
+| **Average-English** | **34.82** |
+| **Average-Chinese** | **30.22** |
+| **Overall Average** | **32.52** |
 ## Usage Instructions
 ```python