qwen-2-7b-dpo-ultrafeedback-5e-7-SFTed-paged_adamw_32bit-fixed-1.0

This is a model released from the preprint: DPO-Shift: Shifting the Distribution of Direct Preference Optimization. Please refer to our repository for more details.

This model is a fine-tuned version of NoManDeRY/DPO-Shift-Qwen-2-7B-UltraChat200K-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5603
  • Rewards/chosen: -0.6046
  • Rewards/rejected: -1.1004
  • Dpo Lambda: 1.0
  • Rewards/accuracies: 0.7381
  • Rewards/margins: 0.4957
  • Logps/rejected: -416.7722
  • Logps/chosen: -393.7754
  • Logits/rejected: -1.0300
  • Logits/chosen: -0.9521

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 128
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Dpo Lambda Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6874 0.1047 50 0.6866 0.0419 0.0289 1.0 0.6706 0.0129 -303.8425 -329.1250 -1.2956 -1.1731
0.6513 0.2093 100 0.6640 0.0547 -0.0154 1.0 0.7302 0.0702 -308.2813 -327.8370 -1.2875 -1.1603
0.6568 0.3140 150 0.6347 -0.0883 -0.2516 1.0 0.7222 0.1634 -331.9008 -342.1388 -1.2942 -1.1718
0.6272 0.4186 200 0.6070 -0.2437 -0.5057 1.0 0.7143 0.2620 -357.3109 -357.6843 -1.2251 -1.1152
0.5866 0.5233 250 0.5879 -0.3786 -0.7293 1.0 0.7302 0.3506 -379.6648 -371.1762 -1.1431 -1.0513
0.5892 0.6279 300 0.5753 -0.4577 -0.8757 1.0 0.7421 0.4180 -394.3050 -379.0780 -1.0982 -1.0094
0.5996 0.7326 350 0.5656 -0.5648 -1.0341 1.0 0.7460 0.4693 -410.1455 -389.7921 -1.0571 -0.9754
0.5734 0.8373 400 0.5625 -0.5734 -1.0606 1.0 0.7381 0.4872 -412.8011 -390.6537 -1.0352 -0.9564
0.5382 0.9419 450 0.5603 -0.6046 -1.1004 1.0 0.7381 0.4957 -416.7722 -393.7754 -1.0300 -0.9521

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1
Downloads last month
2
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for NoManDeRY/DPO-Shift-Qwen-2-7B-Ultrafeedback-fixed-1.0

Base model

Qwen/Qwen2-7B
Finetuned
(2)
this model

Dataset used to train NoManDeRY/DPO-Shift-Qwen-2-7B-Ultrafeedback-fixed-1.0