---
model-index:
- name: jiazhengli/Llama-3.1-8B-RoleMRC-sft
  results: []
datasets:
- Junrulu/RoleMRC
language:
- en
base_model: meta-llama/Meta-Llama-3.1-8B
license: llama3
---

# Model Card for Llama-3.1-8B-RoleMRC-sft

This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed [RoleMRC dataset](https://huggingface.co/datasets/Junrulu/RoleMRC). We obey all licenses mentioned in llama3's work.

## Performance

Reference-based Evaluation Result

| Model                         | BLEU   | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | METEOR | BERTScore F1    |
|--------------------------------|--------|---------|---------|---------|------------|--------|-----------|
| LLaMA3.1-8B-Instruct          | 0.0226 | 0.2277  | 0.0615  | 0.1509  | 0.1650     | 0.2594 | 0.8478   |
| LLaMA3.1-70B-Instruct         | 0.0232 | 0.2258  | 0.0646  | 0.1500  | 0.1661     | 0.2632 | 0.8480    |       
| **LLaMA3.1-8B-RoleMRC-SFT**       | 0.1782 | 0.4628  | 0.2676  | 0.3843  | 0.3853     | 0.3975 | 0.8831    |       
| LLaMA3.1-8B-RoleMRC-DPO       | 0.1056 | 0.3989  | 0.1785  | 0.2988  | 0.3001     | 0.4051 | 0.8805    |       

General Benchmark

| Model                                  | GSM8K 8-shot | Math 4-shot | GPQA 0-shot | IFEval 3-shot | MMLU-Pro 5-shot | MMLU 0-shot | PiQA 3-shot | MUSR 0-shot | TruthfulQA 3-shot| / Avg. |
|----------------------------------------|-------------|------------|-------------|--------------|---------------|-----------|-----------|-----------|------------------------|------|
| LLAMA3.1-8B                            | 48.98       | 17.78      | 12.5        | 16.67        | 35.21         | 63.27     | 81.77     | 38.1      | 28.52                  | 38.09 |
| LLAMA3.1-8B-INSTRUCT                   | 77.41       | 34.1       | 12.72       | 57.67        | 40.77         | 68.1      | 82.1      | 39.81     | 36.47                  | 49.91 |
| **LLaMA3.1-8B-RoleMRC-SFT**                | 56.18       | 12.78      | 19.64       | 42.09        | 31.58         | 59.3      | 82.64     | 40.34     | 35.01                  | 42.17 |
| LLaMA3.1-8B-RoleMRC-DPO                | 58.53       | 13.5       | 20.09       | 46.64        | 31.8          | 59.96     | 82.7      | 39.42     | 37.33                  | 43.33 |

## Evaluation Details
Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
- GSM8K: 8-shot, report strict match
- IFEval: 3-shot, report instruction-level strict accuracy
- PiQA: 3-shot, report accuracy
- MMLU: 0-shot, report normalized accuracy
- TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

## Input Format

The model is trained to use the following format:
```
<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}
```

## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-5
- total_train_batch_size: 16
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.04
- num_epochs: 1.0