Add pipeline tag and library name (#1)

b428b99 verified about 2 months ago

4.67 kB

	---
	license: llama3.1
	datasets:
	- survivi/Llama-3-SynE-Dataset
	- hfl/stem_zh_instruction
	- llamafactory/alpaca_zh
	- llamafactory/alpaca_gpt4_zh
	- hfl/ruozhiba_gpt4
	- codingsteven/Llama-3-8B-chat
	language:
	- zh
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B
	model-index:
	- name: Control-LLM-Llama3.1-8B-SynE-Full-Parameter-Tuning
	results:
	- task:
	type: pretraining-evaluation
	dataset:
	type: mixed
	name: Pretraining Evaluation Dataset
	metrics:
	- name: exact_match,strict-match (meta_pretrain)
	type: exact_match
	value: 0.45445720757159036
	stderr: 0.0035036029889520047
	verified: false
	- name: exact_match,strict-match (meta_bbh_3shot_cot_pretrain)
	type: exact_match
	value: 0.6482875134387959
	stderr: 0.005918167158231359
	verified: false
	- name: acc,none (meta_mmlu_5shot_pretrain)
	type: accuracy
	value: 0.649480131035465
	stderr: 0.004026616190778244
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_pretrain)
	type: exact_match
	value: 0.34956781914893614
	stderr: 0.004347262544061378
	verified: false
	- task:
	type: chinese-evaluation
	dataset:
	type: mixed
	name: Chinese Evaluation Dataset
	metrics:
	- name: acc,none (ceval-valid)
	type: accuracy
	value: 0.5898959881129272
	stderr: 0.012699457390113113
	verified: false
	- name: exact_match,strict-match (ceval-valid-pretrain-cot_zh)
	type: exact_match
	value: 0.40193164933135217
	stderr: 0.01265090064840271
	verified: false
	- name: acc,none (cmmlu)
	type: accuracy
	value: 0.6018822310481782
	stderr: 0.004420298073040671
	verified: false
	- name: exact_match,strict-match (cmmlu_pretrain_cot_zh)
	type: exact_match
	value: 0.4425833189431877
	stderr: 0.004506238417180843
	verified: false
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Control-LLM-Llama3.1-8B-SynE-Full-Parameter-Tuning
	This is a fine-tuned model of Llama-3.1-8B for muliligual-Chinese tasks on SynE dataset.

	## Linked Paper
	This model is associated with the paper: [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

	## Linked Open Source code - training, eval and benchmark
	This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Benchmark Results Table

	The table below summarizes evaluation results across Chinese tasks and original capabilities.

	\| Model \| CEval \| CEvalC \| CMMLU \| CMMLUC \| C-Avg \| BBH \| MLU \| MLUP \| O-Avg \| Overall \|
	\|--------------------\|-----------\|------------\|-----------\|------------\|-----------\|---------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B \| 48.3 \| 12.8 \| 51.1 \| 14.1 \| 13.9 \| 65.2 \| 65.4 \| 35.5 \| 45.9 \| 29.9 \|
	\| Llama-3-SynE \| 57.7 \| 22.3 \| 57.1 \| 22.8 \| 22.8 \| 61.9 \| 64.0 \| 32.6 \| 42.9 \| 32.9 \|
	\| Full Param Tune\| 59.0 \| 40.2 \| 60.2 \| 44.3 \| 43.8 \| 64.8 \| 64.9 \| 35.0 \| 45.4 \| 44.6 \|
	\| Stack Expansion \| 56.0 \| 32.7 \| 55.2 \| 33.4 \| 33.3 \| 62.3 \| 65.6 \| 35.3 \| 44.8 \| 39.1 \|
	\| Concat-Lerp* \| 57.1 \| 34.8 \| 57.0 \| 37.4 \| 37.1 \| 64.4 \| 64.6 \| 35.8 \| 45.9 \| 41.5 \|
	\| Hybrid Expansion\| 58.9 \| 44.7 \| 57.9 \| 44.3 \| 44.4 \| 65.1 \| 65.7\| 36.9 \| 46.8 \| 45.6 \|
	\| Control LLM* \| 57.0 \| 44.7 \| 56.0 \| 44.9 \| 44.8 \| 68.2\| 65.6 \| 37.9 \| 48.5 \| 46.7 \|

	---

	### Explanation:
	- CEval: Chinese Evaluation
	- CEvalC: Chinese Evaluation (CoT - Chain of Thought)
	- CMMLU: Chinese MMLU
	- CMMLUC: Chinese MMLU (CoT)
	- C-Avg: Chinese - Size Weighted Average across CEval, CEvalC, CMMLU, and CMMLUC
	- BBH: BigBench Hard
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Original Capability - Size Weighted Average across BBH, MLU, and MLUP
	- Overall: Combined average across all tasks

	### Full Parameter Tuning on Chinese-SynE
	The following plot illustrates the Catastrophic Forgetting of full parameter tuning in terms of hidden states alignment drift.

	![Catastrophic Forgetting](plots/alignment_worst.png)