jiazhengli
/

Llama-3.1-8B-RoleMRC-sft

Safetensors

English

Model card Files Files and versions Community

jiazhengli commited on 1 day ago

Commit

869895d

verified ·

1 Parent(s): 2eaa010

Create README.md

Browse files

Files changed (1) hide show

README.md +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+model-index:
+- name: jiazhengli/Llama-3.1-8B-RoleMRC-sft
+  results: []
+datasets:
+- Junrulu/RoleMRC
+language:
+- en
+base_model: meta-llama/Meta-Llama-3.1-8B
+license: llama3
+---
+# Model Card for Llama-3.1-8B-RoleMRC-sft
+This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed [RoleMRC dataset](https://huggingface.co/datasets/Junrulu/RoleMRC). We obey all licenses mentioned in llama3's work.
+## Performance
+Reference-based Evaluation Result
+| Model                         | BLEU   | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | METEOR | BERTScore F1    |
+|--------------------------------|--------|---------|---------|---------|------------|--------|-----------|
+| LLaMA3.1-8B-Instruct          | 0.0226 | 0.2277  | 0.0615  | 0.1509  | 0.1650     | 0.2594 | 0.8478   |
+| LLaMA3.1-70B-Instruct         | 0.0232 | 0.2258  | 0.0646  | 0.1500  | 0.1661     | 0.2632 | 0.8480    |
+| **LLaMA3.1-8B-RoleMRC-SFT**       | 0.1782 | 0.4628  | 0.2676  | 0.3843  | 0.3853     | 0.3975 | 0.8831    |
+| LLaMA3.1-8B-RoleMRC-DPO       | 0.1056 | 0.3989  | 0.1785  | 0.2988  | 0.3001     | 0.4051 | 0.8805    |
+General Benchmark
+| Model                                  | GSM8K 8-shot | Math 4-shot | GPQA 0-shot | IFEval 3-shot | MMLU-Pro 5-shot | MMLU 0-shot | PiQA 3-shot | MUSR 0-shot | TruthfulQA 3-shot / Avg. |
+|----------------------------------------|-------------|------------|-------------|--------------|---------------|-----------|-----------|-----------|------------------------|
+| LLAMA3.1-8B                            | 48.98       | 17.78      | 12.5        | 16.67        | 35.21         | 63.27     | 81.77     | 38.1      | 28.52                  |
+| LLAMA3.1-8B-INSTRUCT                   | 77.41       | 34.1       | 12.72       | 57.67        | 40.77         | 68.1      | 82.1      | 39.81     | 36.47                  |
+| **LLaMA3.1-8B-RoleMRC-SFT**                | 56.18       | 12.78      | 19.64       | 42.09        | 31.58         | 59.3      | 82.64     | 40.34     | 35.01                  |
+| LLaMA3.1-8B-RoleMRC-DPO                | 58.53       | 13.5       | 20.09       | 46.64        | 31.8          | 59.96     | 82.7      | 39.42     | 37.33                  |
+## Evaluation Details
+Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
+- GSM8K: 8-shot, report strict match
+- IFEval: 3-shot, report instruction-level strict accuracy
+- PiQA: 3-shot, report accuracy
+- MMLU: 0-shot, report normalized accuracy
+- TruthfulQA: 3-shot, report accuracy of single-true mc1 setting
+One open-ended benchmark, using official [alpaca_eval](https://github.com/tatsu-lab/alpaca_eval/):
+- AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
+- LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
+- Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer
+## Input Format
+The model is trained to use the following format:
+```
+<|start_header_id|>user<|end_header_id|>
+{PROMPT}<|eot_id|>
+<|start_header_id|>assistant<|end_header_id|>
+{Response}
+```
+## Training hyperparameters
+The following hyperparameters were used during DPO/SamPO training:
+- learning_rate: 1e-5
+- total_train_batch_size: 16
+- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_ratio: 0.04
+- num_epochs: 1.0
+- Specifically add above input format over training samples