File size: 4,220 Bytes
4beffcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
677843e
 
4beffcc
677843e
4beffcc
 
 
 
677843e
4beffcc
1baa582
 
 
4beffcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
677843e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: llama3.1
datasets:
- nvidia/OpenMathInstruct-2
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
model-index:
- name: Control-LLM-Llama3.1-8B-Math16
  results:
  - task:
      type: math-evaluation
    dataset:
      type: parquet
      name: Math, Math Hard, GSM8K
      dataset_kwargs:
        data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
    metrics:
    - name: exact_match,none
      type: exact_match
      value: 0.6205678398534606
      stderr: 0.005249520342473376
      verified: false
    - name: exact_match,none (gsm8k_0shot_instruct)
      type: exact_match
      value: 0.8968915845337376
      stderr: 0.008376436987507811
      verified: false
    - name: exact_match,none (meta_math_0shot_instruct)
      type: exact_match
      value: 0.6166
      stderr: 0.006876797660918556
      verified: false
    - name: exact_match,none (meta_math_hard_0shot_instruct)
      type: exact_match
      value: 0.36027190332326287
      stderr: 0.013198755610252931
      verified: false
  - task:
      type: original-capability
    dataset:
      type: meta/Llama-3.1-8B-Instruct-evals
      name: Llama-3.1-8B-Instruct-evals Dataset
      dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
      dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
    metrics:
    - name: exact_match,strict-match
      type: exact_match
      value: 0.6001372485281902
      stderr: 0.002821514831773572
      verified: false
    - name: exact_match,strict-match (meta_arc_0shot_instruct)
      type: exact_match
      value: 0.8248927038626609
      stderr: 0.011139722235859526
      verified: false
    - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
      type: exact_match
      value: 0.3080357142857143
      stderr: 0.021836780796366417
      verified: false
    - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
      type: exact_match
      value: 0.7159948725252813
      stderr: 0.00380556397209409
      verified: false
    - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
      type: exact_match
      value: 0.45403922872340424
      stderr: 0.004539171007529716
      verified: false
library_name: transformers
pipeline_tag: text-generation
---

# Control-LLM-Llama3.1-8B-Math16
This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.

## Linked Paper
This model is associated with the paper: [Control-LLM](https://huggingface.co/papers/2501.10979).

## Linked Open Source code - training, eval and benchmark
This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

## Evaluation Results
Here is an overview of the evaluation results and findings:

### Benchmark Results Table
The table below summarizes evaluation results across mathematical tasks and original capabilities.

| **Model**         | **MH** | **M**  | **G8K** | **M-Avg** | **ARC** | **GPQA** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
|-------------------|--------|--------|---------|-----------|---------|----------|---------|----------|-----------|-------------|
| Llama3.1-8B-Inst  | 23.7   | 50.9   | 85.6    | 52.1      | 83.4    | 29.9     | 72.4    | 46.7     | 60.5      | 56.3        |
| **Control LLM***   | 36.0   | 61.7   | **89.7**| 62.5      | 82.5    | 30.8     | **71.6**| 45.4     | **57.6**  | **60.0**    |

---
### Explanation:
- **MH**: MathHard
- **M**: Math
- **G8K**: GSM8K
- **M-Avg**: Math - Average across MathHard, Math, and GSM8K
- **ARC**: ARC benchmark
- **GPQA**: General knowledge QA
- **MLU**: MMLU (Massive Multitask Language Understanding)
- **MLUP**: MMLU Pro
- **O-Avg**: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
- **Overall**: Combined average across all tasks

### Catastrophic Forgetting on OpenMath
The following plot illustrates and compares catastrophic forgetting mitigation during training

![Catastrophic Forgetting](plots/ControlLLM_CF_Plot_Math.png)

### Alignment Result
The plot below highlights the alignment result of the model trained with Control LLM.

![Alignment](plots/alignment_best.png)