jiazhengli
/

Llama-3.1-8B-RoleMRC-sft

Model card Files Files and versions Community

jiazhengli commited on 1 day ago

Commit

f59fb3b

·

verified ·

1 Parent(s): 869895d

Update README.md

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -27,12 +27,12 @@ Reference-based Evaluation Result
 General Benchmark
-| Model                                  | GSM8K 8-shot | Math 4-shot | GPQA 0-shot | IFEval 3-shot | MMLU-Pro 5-shot | MMLU 0-shot | PiQA 3-shot | MUSR 0-shot | TruthfulQA 3-shot / Avg. |
-|----------------------------------------|-------------|------------|-------------|--------------|---------------|-----------|-----------|-----------|------------------------|
-| LLAMA3.1-8B                            | 48.98       | 17.78      | 12.5        | 16.67        | 35.21         | 63.27     | 81.77     | 38.1      | 28.52                  |
-| LLAMA3.1-8B-INSTRUCT                   | 77.41       | 34.1       | 12.72       | 57.67        | 40.77         | 68.1      | 82.1      | 39.81     | 36.47                  |
-| **LLaMA3.1-8B-RoleMRC-SFT**                | 56.18       | 12.78      | 19.64       | 42.09        | 31.58         | 59.3      | 82.64     | 40.34     | 35.01                  |
-| LLaMA3.1-8B-RoleMRC-DPO                | 58.53       | 13.5       | 20.09       | 46.64        | 31.8          | 59.96     | 82.7      | 39.42     | 37.33                  |
 ## Evaluation Details
 Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):

 General Benchmark
+| Model                                  | GSM8K 8-shot | Math 4-shot | GPQA 0-shot | IFEval 3-shot | MMLU-Pro 5-shot | MMLU 0-shot | PiQA 3-shot | MUSR 0-shot | TruthfulQA 3-shot| / Avg. |
+|----------------------------------------|-------------|------------|-------------|--------------|---------------|-----------|-----------|-----------|------------------------|------|
+| LLAMA3.1-8B                            | 48.98       | 17.78      | 12.5        | 16.67        | 35.21         | 63.27     | 81.77     | 38.1      | 28.52                  | 38.09 |
+| LLAMA3.1-8B-INSTRUCT                   | 77.41       | 34.1       | 12.72       | 57.67        | 40.77         | 68.1      | 82.1      | 39.81     | 36.47                  | 49.91 |
+| **LLaMA3.1-8B-RoleMRC-SFT**                | 56.18       | 12.78      | 19.64       | 42.09        | 31.58         | 59.3      | 82.64     | 40.34     | 35.01                  | 42.17 |
+| LLaMA3.1-8B-RoleMRC-DPO                | 58.53       | 13.5       | 20.09       | 46.64        | 31.8          | 59.96     | 82.7      | 39.42     | 37.33                  | 43.33 |
 ## Evaluation Details
 Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):