Model Card for Qwen2.5-7B-RoleMRC-dpo

This repository provides a fine-tuned version of Qwen2.5-7B, using our proposed RoleMRC dataset. We obey all licenses mentioned in Qwen2's work.

Performance

Reference-based Evaluation Result

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	METEOR	BERTScore F1
Qwen2.5-7B-Instruct	0.0224	0.2283	0.0621	0.1518	0.1599	0.2490	0.8471
Qwen2.5-72B-Instruct	0.0245	0.2350	0.0656	0.1554	0.1660	0.2579	0.8485
Qwen2.5-7B-RoleMRC-SFT	0.1963	0.4764	0.2744	0.3959	0.3968	0.4337	0.9063
Qwen2.5-7B-RoleMRC-DPO	0.1244	0.4178	0.1916	0.3164	0.3177	0.4205	0.8931

General Benchmark

Model	GSM8K 8-shot	Math 4-shot	GPQA 0-shot	IFEval 3-shot	MMLU-Pro 5-shot	MMLU 0-shot	PiQA 3-shot	MUSR 0-shot	TruthfulQA 3-shot	/ Avg.
QWEN2.5-7B	78.7	36.78	16.74	38.25	44.87	71.75	81.23	44.31	38.8	50.16
QWEN2.5-7B-INSTRUCT	81.2	40.28	13.39	65.71	40.85	71.76	80.25	42.86	47.86	53.8
QWEN2.5-7B-RoleMRC-SFT	78.54	32.7	16.52	42.81	43.43	71.19	80.63	45.11	37.58	49.83
Qwen2.5-7B-RoleMRC-DPO	79.38	32.72	18.97	47.96	43.39	71.21	80.36	45.37	39.41	50.97

Five conditional benchmarks, using lm-evaluation-harness:

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

The following hyperparameters were used during training: