Model Card for Llama-3.1-8B-RoleMRC-sft

This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed RoleMRC dataset. We obey all licenses mentioned in llama3's work.

Performance

Reference-based Evaluation Result

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	METEOR	BERTScore F1
LLaMA3.1-8B-Instruct	0.0226	0.2277	0.0615	0.1509	0.1650	0.2594	0.8478
LLaMA3.1-70B-Instruct	0.0232	0.2258	0.0646	0.1500	0.1661	0.2632	0.8480
LLaMA3.1-8B-RoleMRC-SFT	0.1782	0.4628	0.2676	0.3843	0.3853	0.3975	0.8831
LLaMA3.1-8B-RoleMRC-DPO	0.1056	0.3989	0.1785	0.2988	0.3001	0.4051	0.8805

General Benchmark

Model	GSM8K 8-shot	Math 4-shot	GPQA 0-shot	IFEval 3-shot	MMLU-Pro 5-shot	MMLU 0-shot	PiQA 3-shot	MUSR 0-shot	TruthfulQA 3-shot	/ Avg.
LLAMA3.1-8B	48.98	17.78	12.5	16.67	35.21	63.27	81.77	38.1	28.52	38.09
LLAMA3.1-8B-INSTRUCT	77.41	34.1	12.72	57.67	40.77	68.1	82.1	39.81	36.47	49.91
LLaMA3.1-8B-RoleMRC-SFT	56.18	12.78	19.64	42.09	31.58	59.3	82.64	40.34	35.01	42.17
LLaMA3.1-8B-RoleMRC-DPO	58.53	13.5	20.09	46.64	31.8	59.96	82.7	39.42	37.33	43.33

Five conditional benchmarks, using lm-evaluation-harness:

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

The following hyperparameters were used during training: