jiazhengli's picture
Update README.md
f59fb3b verified
|
raw
history blame
3.58 kB
metadata
model-index:
  - name: jiazhengli/Llama-3.1-8B-RoleMRC-sft
    results: []
datasets:
  - Junrulu/RoleMRC
language:
  - en
base_model: meta-llama/Meta-Llama-3.1-8B
license: llama3

Model Card for Llama-3.1-8B-RoleMRC-sft

This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed RoleMRC dataset. We obey all licenses mentioned in llama3's work.

Performance

Reference-based Evaluation Result

Model BLEU ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum METEOR BERTScore F1
LLaMA3.1-8B-Instruct 0.0226 0.2277 0.0615 0.1509 0.1650 0.2594 0.8478
LLaMA3.1-70B-Instruct 0.0232 0.2258 0.0646 0.1500 0.1661 0.2632 0.8480
LLaMA3.1-8B-RoleMRC-SFT 0.1782 0.4628 0.2676 0.3843 0.3853 0.3975 0.8831
LLaMA3.1-8B-RoleMRC-DPO 0.1056 0.3989 0.1785 0.2988 0.3001 0.4051 0.8805

General Benchmark

Model GSM8K 8-shot Math 4-shot GPQA 0-shot IFEval 3-shot MMLU-Pro 5-shot MMLU 0-shot PiQA 3-shot MUSR 0-shot TruthfulQA 3-shot / Avg.
LLAMA3.1-8B 48.98 17.78 12.5 16.67 35.21 63.27 81.77 38.1 28.52 38.09
LLAMA3.1-8B-INSTRUCT 77.41 34.1 12.72 57.67 40.77 68.1 82.1 39.81 36.47 49.91
LLaMA3.1-8B-RoleMRC-SFT 56.18 12.78 19.64 42.09 31.58 59.3 82.64 40.34 35.01 42.17
LLaMA3.1-8B-RoleMRC-DPO 58.53 13.5 20.09 46.64 31.8 59.96 82.7 39.42 37.33 43.33

Evaluation Details

Five conditional benchmarks, using lm-evaluation-harness:

  • GSM8K: 8-shot, report strict match
  • IFEval: 3-shot, report instruction-level strict accuracy
  • PiQA: 3-shot, report accuracy
  • MMLU: 0-shot, report normalized accuracy
  • TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

One open-ended benchmark, using official alpaca_eval:

  • AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
  • LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
  • Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer

Input Format

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

Training hyperparameters

The following hyperparameters were used during DPO/SamPO training:

  • learning_rate: 1e-5
  • total_train_batch_size: 16
  • optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.04
  • num_epochs: 1.0
  • Specifically add above input format over training samples