Llama3.1-8B-OpenMath16-Instruct / README.md

Add library name, pipeline tag, and link to paper (#1)

f2c2ff4 verified about 2 months ago

5.12 kB

	---
	license: llama3.1
	datasets:
	- nvidia/OpenMathInstruct-2
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	model-index:
	- name: Control-LLM-Llama3.1-8B-Math16
	results:
	- task:
	type: math-evaluation
	dataset:
	type: parquet
	name: Math, Math Hard, GSM8K
	dataset_kwargs:
	data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
	metrics:
	- name: exact_match,none
	type: exact_match
	value: 0.6327358367133324
	stderr: 0.0052245703347459605
	verified: false
	- name: exact_match,none (gsm8k_0shot_instruct)
	type: exact_match
	value: 0.9052312357846853
	stderr: 0.008067791560015407
	verified: false
	- name: exact_match,none (meta_math_0shot_instruct)
	type: exact_match
	value: 0.6276
	stderr: 0.006837616441401548
	verified: false
	- name: exact_match,none (meta_math_hard_0shot_instruct)
	type: exact_match
	value: 0.3806646525679758
	stderr: 0.013349170720370741
	verified: false
	- task:
	type: original-capability
	dataset:
	type: meta/Llama-3.1-8B-Instruct-evals
	name: Llama-3.1-8B-Instruct-evals Dataset
	dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
	dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
	metrics:
	- name: exact_match,strict-match
	type: exact_match
	value: 0.5723263625528227
	stderr: 0.002858377993520894
	verified: false
	- name: exact_match,strict-match (meta_arc_0shot_instruct)
	type: exact_match
	value: 0.7974248927038626
	stderr: 0.01178043813618557
	verified: false
	- name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
	type: exact_match
	value: 0.25223214285714285
	stderr: 0.02054139101648797
	verified: false
	- name: exact_match,strict-match (meta_mmlu_0shot_instruct)
	type: exact_match
	value: 0.6837345107534539
	stderr: 0.0039243761987253515
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
	type: exact_match
	value: 0.4324301861702128
	stderr: 0.004516653585262379
	verified: false
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Control-LLM-Llama3.1-8B-Math16
	This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset, as described in the paper [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

	## Linked Paper
	This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2501.10979).

	## Linked Open Source code - training, eval and benchmark
	This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Benchmark Result and Catastrophic Forgetting on OpenMath
	The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenMath2 dataset.

	![Catastrophic Forgetting](plots/catastrophic_forgetting_openmath.png)

	### Alignment Comparison
	The plot below highlights the alignment comparison of the model trained with Control LLM and Full Parameter Tuning.

	![Alignment Comparison](plots/alignment_comparison.png)

	### Benchmark Results Table
	The table below summarizes evaluation results across mathematical tasks and original capabilities.

	\| Model \| MH \| M \| G8K \| M-Avg \| ARC \| GPQA \| MLU \| MLUP \| O-Avg \| Overall \|
	\|-------------------\|--------\|--------\|---------\|-----------\|---------\|----------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B-Inst \| 23.7 \| 50.9 \| 85.6 \| 52.1 \| 83.4 \| 29.9 \| 72.4 \| 46.7 \| 60.5 \| 56.3 \|
	\| OpenMath2-Llama3 \| 38.4 \| 64.1 \| 90.3 \| 64.3 \| 45.8 \| 1.3 \| 4.5 \| 19.5 \| 12.9 \| 38.6 \|
	\| Full Tune \| 38.5\| 63.7\| 90.2 \| 63.9 \| 58.2 \| 1.1 \| 7.3 \| 23.5 \| 16.5 \| 40.1 \|
	\| Partial Tune \| 36.4 \| 61.4 \| 89.0 \| 61.8 \| 66.2 \| 6.0 \| 25.7 \| 30.9 \| 29.3 \| 45.6 \|
	\| Stack Exp. \| 35.6 \| 61.0 \| 90.8 \| 61.8 \| 69.3 \| 18.8 \| 61.8 \| 43.1 \| 53.3 \| 57.6 \|
	\| Hybrid Exp. \| 34.4 \| 61.1 \| 90.1 \| 61.5 \| 81.8\| 25.9 \| 67.2 \| 43.9 \| 57.1 \| 59.3 \|
	\| Control LLM* \| 38.1 \| 62.7 \| 90.4\| 63.2 \| 79.7 \| 25.2 \| 68.1\| 43.6 \| 57.2 \| 60.2 \|

	---
	### Explanation:
	- MH: MathHard
	- M: Math
	- G8K: GSM8K
	- M-Avg: Math - Average across MathHard, Math, and GSM8K
	- ARC: ARC benchmark
	- GPQA: General knowledge QA
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Orginal Capability - Average across ARC, GPQA, MMLU, and MMLUP
	- Overall: Combined average across all tasks