hawei commited on
Commit
6fb290f
·
verified ·
1 Parent(s): 8da1cd7

Add paper link

Browse files
Files changed (1) hide show
  1. README.md +117 -114
README.md CHANGED
@@ -1,114 +1,117 @@
1
- ---
2
- license: llama3.1
3
- datasets:
4
- - nvidia/OpenMathInstruct-2
5
- language:
6
- - en
7
- base_model:
8
- - meta-llama/Llama-3.1-8B-Instruct
9
- model-index:
10
- - name: Control-LLM-Llama3.1-8B-Math16
11
- results:
12
- - task:
13
- type: code-evaluation
14
- dataset:
15
- type: mixed
16
- name: Code Evaluation Dataset
17
- metrics:
18
- - name: pass_at_1,n=1 (code_instruct)
19
- type: pass_at_1
20
- value: 0.7840083073727934
21
- stderr: 0.013257237506304915
22
- verified: false
23
- - name: pass_at_1,n=1 (humaneval_greedy_instruct)
24
- type: pass_at_1
25
- value: 0.8170731707317073
26
- stderr: 0.03028135999593353
27
- verified: false
28
- - name: pass_at_1,n=1 (humaneval_plus_greedy_instruct)
29
- type: pass_at_1
30
- value: 0.7439024390243902
31
- stderr: 0.03418746588364997
32
- verified: false
33
- - name: pass_at_1,n=1 (mbpp_plus_0shot_instruct)
34
- type: pass_at_1
35
- value: 0.8042328042328042
36
- stderr: 0.0204357309715418
37
- verified: false
38
- - name: pass_at_1,n=1 (mbpp_sanitized_0shot_instruct)
39
- type: pass_at_1
40
- value: 0.7587548638132295
41
- stderr: 0.02673991635681605
42
- verified: false
43
- - task:
44
- type: original-capability
45
- dataset:
46
- type: meta/Llama-3.1-8B-Instruct-evals
47
- name: Llama-3.1-8B-Instruct-evals Dataset
48
- dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
49
- dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
50
- metrics:
51
- - name: exact_match,strict-match (original_capability_instruct)
52
- type: exact_match
53
- value: 0.5630801459168563
54
- stderr: 0.0028483348465514185
55
- verified: false
56
- - name: exact_match,strict-match (meta_arc_0shot_instruct)
57
- type: exact_match
58
- value: 0.8248927038626609
59
- stderr: 0.01113972223585952
60
- verified: false
61
- - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
62
- type: exact_match
63
- value: 0.296875
64
- stderr: 0.021609729061250887
65
- verified: false
66
- - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
67
- type: exact_match
68
- value: 0.6815980629539952
69
- stderr: 0.003931452244804845
70
- verified: false
71
- - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
72
- type: exact_match
73
- value: 0.4093251329787234
74
- stderr: 0.004482884901882547
75
- verified: false
76
- ---
77
- # Control-LLM-Llama3.1-8B-Math16
78
- This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenCoder SFT dataset.
79
-
80
- ## Evaluation Results
81
- Here is an overview of the evaluation results and findings:
82
-
83
- ### Benchmark Result and Catastrophic Forgetting on OpenCoder
84
- The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenCoder SFT dataset.
85
-
86
- ![Catastrophic Forgetting](plots/catastrophic_forgetting_opencoder.png)
87
-
88
- ### Benchmark Results Table
89
- The table below summarizes evaluation results across coding tasks and original capabilities.
90
-
91
- | **Model** | **MB+** | **MS** | **HE+** | **HE** | **C-Avg** | **ARC** | **GP** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
92
- |--------------------|---------|---------|---------|---------|-----------|---------|---------|---------|----------|-----------|-------------|
93
- | Llama3.1-8B-Ins | 70.4 | 67.7 | 66.5 | 70.7 | 69.1 | 83.4 | 29.9 | 72.4 | 46.7 | 60.5 | 64.8 |
94
- | OpenCoder-8B-Ins | 81.2 | 76.3 | 78.0 | 82.3 | 79.5 | 8.2 | 25.4 | 37.4 | 11.3 | 24.6 | 52.1 |
95
- | **Full Param Tune**| 75.1 | 69.6 | 71.3 | 76.8 | 73.3 | 24.4 | 21.9 | 43.0 | 19.2 | 31.5 | 52.4 |
96
- | Partial Param Tune | 75.7 | 71.6 | 74.4 | 79.3 | 75.0 | 70.2 | 28.1 | 60.7 | 32.4 | 48.3 | 61.7 |
97
- | Stack Expansion | 77.2 | 72.8 | 73.2 | 78.7 | 75.6 | 80.0 | 26.3 | 66.6 | 38.2 | 54.2 | 64.9 |
98
- | Hybrid Expansion* | 77.5 | 73.5 | **76.2**| **82.3**| 77.1 | 80.9 | **32.6**| 68.1 | 40.3 | 56.0 | 66.6 |
99
- | **Control LLM*** | **80.4**| **75.9**| 74.4 | 81.1 | **78.3** | **82.5**| 29.7 | **68.2**| **40.9** | **56.3** | **67.3** |
100
-
101
- ---
102
-
103
- ### Explanation:
104
- - **MB+**: MBPP Plus
105
- - **MS**: MBPP Sanitized
106
- - **HE+**: HumanEval Plus
107
- - **HE**: HumanEval
108
- - **C-Avg**: Coding - Size Weighted Average across MB+, MS, HE+, and HE
109
- - **ARC**: ARC benchmark
110
- - **GP**: GPQA benchmark
111
- - **MLU**: MMLU (Massive Multitask Language Understanding)
112
- - **MLUP**: MMLU Pro
113
- - **O-Avg**: Original Capability - Size Weighted Average across ARC, GPQA, MMLU, and MMLU Pro
114
- - **Overall**: Combined average across all tasks
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ datasets:
4
+ - nvidia/OpenMathInstruct-2
5
+ language:
6
+ - en
7
+ base_model:
8
+ - meta-llama/Llama-3.1-8B-Instruct
9
+ model-index:
10
+ - name: Control-LLM-Llama3.1-8B-Math16
11
+ results:
12
+ - task:
13
+ type: code-evaluation
14
+ dataset:
15
+ type: mixed
16
+ name: Code Evaluation Dataset
17
+ metrics:
18
+ - name: pass_at_1,n=1 (code_instruct)
19
+ type: pass_at_1
20
+ value: 0.7840083073727934
21
+ stderr: 0.013257237506304915
22
+ verified: false
23
+ - name: pass_at_1,n=1 (humaneval_greedy_instruct)
24
+ type: pass_at_1
25
+ value: 0.8170731707317073
26
+ stderr: 0.03028135999593353
27
+ verified: false
28
+ - name: pass_at_1,n=1 (humaneval_plus_greedy_instruct)
29
+ type: pass_at_1
30
+ value: 0.7439024390243902
31
+ stderr: 0.03418746588364997
32
+ verified: false
33
+ - name: pass_at_1,n=1 (mbpp_plus_0shot_instruct)
34
+ type: pass_at_1
35
+ value: 0.8042328042328042
36
+ stderr: 0.0204357309715418
37
+ verified: false
38
+ - name: pass_at_1,n=1 (mbpp_sanitized_0shot_instruct)
39
+ type: pass_at_1
40
+ value: 0.7587548638132295
41
+ stderr: 0.02673991635681605
42
+ verified: false
43
+ - task:
44
+ type: original-capability
45
+ dataset:
46
+ type: meta/Llama-3.1-8B-Instruct-evals
47
+ name: Llama-3.1-8B-Instruct-evals Dataset
48
+ dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
49
+ dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
50
+ metrics:
51
+ - name: exact_match,strict-match (original_capability_instruct)
52
+ type: exact_match
53
+ value: 0.5630801459168563
54
+ stderr: 0.0028483348465514185
55
+ verified: false
56
+ - name: exact_match,strict-match (meta_arc_0shot_instruct)
57
+ type: exact_match
58
+ value: 0.8248927038626609
59
+ stderr: 0.01113972223585952
60
+ verified: false
61
+ - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
62
+ type: exact_match
63
+ value: 0.296875
64
+ stderr: 0.021609729061250887
65
+ verified: false
66
+ - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
67
+ type: exact_match
68
+ value: 0.6815980629539952
69
+ stderr: 0.003931452244804845
70
+ verified: false
71
+ - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
72
+ type: exact_match
73
+ value: 0.4093251329787234
74
+ stderr: 0.004482884901882547
75
+ verified: false
76
+ ---
77
+ # Control-LLM-Llama3.1-8B-Math16
78
+ This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenCoder SFT dataset.
79
+
80
+ ## Linked Paper
81
+ This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2501.10979).
82
+
83
+ ## Evaluation Results
84
+ Here is an overview of the evaluation results and findings:
85
+
86
+ ### Benchmark Result and Catastrophic Forgetting on OpenCoder
87
+ The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenCoder SFT dataset.
88
+
89
+ ![Catastrophic Forgetting](plots/catastrophic_forgetting_opencoder.png)
90
+
91
+ ### Benchmark Results Table
92
+ The table below summarizes evaluation results across coding tasks and original capabilities.
93
+
94
+ | **Model** | **MB+** | **MS** | **HE+** | **HE** | **C-Avg** | **ARC** | **GP** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
95
+ |--------------------|---------|---------|---------|---------|-----------|---------|---------|---------|----------|-----------|-------------|
96
+ | Llama3.1-8B-Ins | 70.4 | 67.7 | 66.5 | 70.7 | 69.1 | 83.4 | 29.9 | 72.4 | 46.7 | 60.5 | 64.8 |
97
+ | OpenCoder-8B-Ins | 81.2 | 76.3 | 78.0 | 82.3 | 79.5 | 8.2 | 25.4 | 37.4 | 11.3 | 24.6 | 52.1 |
98
+ | **Full Param Tune**| 75.1 | 69.6 | 71.3 | 76.8 | 73.3 | 24.4 | 21.9 | 43.0 | 19.2 | 31.5 | 52.4 |
99
+ | Partial Param Tune | 75.7 | 71.6 | 74.4 | 79.3 | 75.0 | 70.2 | 28.1 | 60.7 | 32.4 | 48.3 | 61.7 |
100
+ | Stack Expansion | 77.2 | 72.8 | 73.2 | 78.7 | 75.6 | 80.0 | 26.3 | 66.6 | 38.2 | 54.2 | 64.9 |
101
+ | Hybrid Expansion* | 77.5 | 73.5 | **76.2**| **82.3**| 77.1 | 80.9 | **32.6**| 68.1 | 40.3 | 56.0 | 66.6 |
102
+ | **Control LLM*** | **80.4**| **75.9**| 74.4 | 81.1 | **78.3** | **82.5**| 29.7 | **68.2**| **40.9** | **56.3** | **67.3** |
103
+
104
+ ---
105
+
106
+ ### Explanation:
107
+ - **MB+**: MBPP Plus
108
+ - **MS**: MBPP Sanitized
109
+ - **HE+**: HumanEval Plus
110
+ - **HE**: HumanEval
111
+ - **C-Avg**: Coding - Size Weighted Average across MB+, MS, HE+, and HE
112
+ - **ARC**: ARC benchmark
113
+ - **GP**: GPQA benchmark
114
+ - **MLU**: MMLU (Massive Multitask Language Understanding)
115
+ - **MLUP**: MMLU Pro
116
+ - **O-Avg**: Original Capability - Size Weighted Average across ARC, GPQA, MMLU, and MMLU Pro
117
+ - **Overall**: Combined average across all tasks