---
model-index:
- name: jiazhengli/Qwen2.5-7B-RoleMRC-sft
  results: []
datasets:
- Junrulu/RoleMRC
language:
- en
base_model: Qwen/Qwen2.5-7B
license: llama3
---

# Model Card for Qwen2.5-7B-RoleMRC-sft

This repository provides a fine-tuned version of Qwen2.5-7B, using our proposed [RoleMRC dataset](https://huggingface.co/datasets/Junrulu/RoleMRC). We obey all licenses mentioned in Qwen 2's work.

## Performance

Reference-based Evaluation Result

| Model                         | BLEU   | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | METEOR | BERTScore F1    |
|--------------------------------|--------|---------|---------|---------|------------|--------|-----------|
| Qwen2.5-7B-Instruct           | 0.0224 | 0.2283  | 0.0621  | 0.1518  | 0.1599     | 0.2490 | 0.8471    |       |
| Qwen2.5-72B-Instruct          | 0.0245 | 0.2350  | 0.0656  | 0.1554  | 0.1660     | 0.2579 | 0.8485    |       |
| **Qwen2.5-7B-RoleMRC-SFT**        | 0.1963 | 0.4764  | 0.2744  | 0.3959  | 0.3968     | 0.4337 | 0.9063    |       |
| Qwen2.5-7B-RoleMRC-DPO        | 0.1244 | 0.4178  | 0.1916  | 0.3164  | 0.3177     | 0.4205 | 0.8931    |       |

General Benchmark

| Model                                  | GSM8K 8-shot | Math 4-shot | GPQA 0-shot | IFEval 3-shot | MMLU-Pro 5-shot | MMLU 0-shot | PiQA 3-shot | MUSR 0-shot | TruthfulQA 3-shot| / Avg. |
|----------------------------------------|-------------|------------|-------------|--------------|---------------|-----------|-----------|-----------|------------------------|------|
| QWEN2.5-7B                             | 78.7        | 36.78      | 16.74       | 38.25        | 44.87         | 71.75     | 81.23     | 44.31     | 38.8                   | 50.16 |
| QWEN2.5-7B-INSTRUCT                    | 81.2        | 40.28      | 13.39       | 65.71        | 40.85         | 71.76     | 80.25     | 42.86     | 47.86                  | 53.8 |
| **QWEN2.5-7B-ROLEMRC-SFT**                 | 78.54       | 32.7       | 16.52       | 42.81        | 43.43         | 71.19     | 80.63     | 45.11     | 37.58                  | 49.83 |
| QWEN2.5-7B-ROLEMRC-DPO                 | 79.38       | 32.72      | 18.97       | 47.96        | 43.39         | 71.21     | 80.36     | 45.37     | 39.41                  | 50.97 |

## Evaluation Details
Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
- GSM8K: 8-shot, report strict match
- IFEval: 3-shot, report instruction-level strict accuracy
- PiQA: 3-shot, report accuracy
- MMLU: 0-shot, report normalized accuracy
- TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

## Input Format

The model is trained to use the following format:
```
<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}
```

## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-5
- total_train_batch_size: 16
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.04
- num_epochs: 1.0