dpo / README.md
narekvslife's picture
dpo_p69wqnv2
e4b57a5 verified
|
raw
history blame
2.35 kB
metadata
library_name: peft
tags:
  - trl
  - dpo
  - generated_from_trainer
base_model: Meta-Llama-3-8B-Instruct
model-index:
  - name: dpo
    results: []

dpo

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6845
  • Rewards/chosen: 0.3768
  • Rewards/rejected: 0.3499
  • Rewards/accuracies: 0.5373
  • Rewards/margins: 0.0269
  • Logps/rejected: -233.9739
  • Logps/chosen: -254.7565
  • Logits/rejected: -0.6114
  • Logits/chosen: -0.7300

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 0
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 2
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • training_steps: 2500

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6689 0.0567 750 0.6852 0.2546 0.2335 0.5373 0.0211 -235.1381 -255.9783 -0.6108 -0.7325
0.6831 0.1133 1500 0.6835 0.3555 0.3270 0.5597 0.0285 -234.2029 -254.9690 -0.6135 -0.7317
0.6821 0.1700 2250 0.6855 0.3655 0.3411 0.5485 0.0243 -234.0616 -254.8697 -0.6115 -0.7293

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.1
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1