metadata

model-index:
  - name: jiazhengli/Llama-3.1-8B-RoleMRC-sft
    results: []
datasets:
  - Junrulu/RoleMRC
language:
  - en
base_model: meta-llama/Meta-Llama-3.1-8B
license: llama3

Model Card for Llama-3.1-8B-RoleMRC-sft

This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed RoleMRC dataset. We obey all licenses mentioned in llama3's work.

Performance

Reference-based Evaluation Result

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	METEOR	BERTScore F1
LLaMA3.1-8B-Instruct	0.0226	0.2277	0.0615	0.1509	0.1650	0.2594	0.8478
LLaMA3.1-70B-Instruct	0.0232	0.2258	0.0646	0.1500	0.1661	0.2632	0.8480
LLaMA3.1-8B-RoleMRC-SFT	0.1782	0.4628	0.2676	0.3843	0.3853	0.3975	0.8831
LLaMA3.1-8B-RoleMRC-DPO	0.1056	0.3989	0.1785	0.2988	0.3001	0.4051	0.8805

General Benchmark

Model	GSM8K 8-shot	Math 4-shot	GPQA 0-shot	IFEval 3-shot	MMLU-Pro 5-shot	MMLU 0-shot	PiQA 3-shot	MUSR 0-shot	TruthfulQA 3-shot	/ Avg.
LLAMA3.1-8B	48.98	17.78	12.5	16.67	35.21	63.27	81.77	38.1	28.52	38.09
LLAMA3.1-8B-INSTRUCT	77.41	34.1	12.72	57.67	40.77	68.1	82.1	39.81	36.47	49.91
LLaMA3.1-8B-RoleMRC-SFT	56.18	12.78	19.64	42.09	31.58	59.3	82.64	40.34	35.01	42.17
LLaMA3.1-8B-RoleMRC-DPO	58.53	13.5	20.09	46.64	31.8	59.96	82.7	39.42	37.33	43.33

Evaluation Details

Five conditional benchmarks, using lm-evaluation-harness:

GSM8K: 8-shot, report strict match
IFEval: 3-shot, report instruction-level strict accuracy
PiQA: 3-shot, report accuracy
MMLU: 0-shot, report normalized accuracy
TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

One open-ended benchmark, using official alpaca_eval:

AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer

Input Format

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

Training hyperparameters

The following hyperparameters were used during DPO/SamPO training:

learning_rate: 1e-5
total_train_batch_size: 16
optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.04
num_epochs: 1.0
Specifically add above input format over training samples