Llama-3-dpo-5e-7-SFTed-paged_adamw_32bit-0.95

This is a model released from the preprint: DPO-Shift: Shifting the Distribution of Direct Preference Optimization. Please refer to our repository for more details.

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5600
  • Rewards/chosen: -0.2594
  • Rewards/rejected: -0.7680
  • Rewards/accuracies: 0.7280
  • Rewards/margins: 0.5087
  • Logps/rejected: -344.3741
  • Logps/chosen: -316.7488
  • Logits/rejected: -0.8779
  • Logits/chosen: -0.8397

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 128
  • total_eval_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6819 0.1047 50 0.6800 0.1050 0.0798 0.6400 0.0252 -259.5905 -280.3077 -0.7374 -0.6591
0.6361 0.2094 100 0.6362 0.0108 -0.1367 0.7080 0.1476 -281.2423 -289.7269 -0.8269 -0.7622
0.5998 0.3141 150 0.5975 -0.1439 -0.4466 0.7120 0.3027 -312.2311 -305.2002 -0.7868 -0.7374
0.5873 0.4187 200 0.5900 -0.1226 -0.4679 0.7160 0.3454 -314.3644 -303.0681 -0.8278 -0.7815
0.5692 0.5234 250 0.5732 -0.2556 -0.6926 0.7300 0.4370 -336.8325 -316.3727 -0.8732 -0.8325
0.5668 0.6281 300 0.5730 -0.3147 -0.7937 0.7160 0.4790 -346.9373 -322.2795 -0.8503 -0.8084
0.5415 0.7328 350 0.5626 -0.2087 -0.6908 0.7320 0.4822 -336.6547 -311.6794 -0.8694 -0.8289
0.5595 0.8375 400 0.5604 -0.2196 -0.7069 0.7300 0.4873 -338.2576 -312.7687 -0.8715 -0.8329
0.5552 0.9422 450 0.5600 -0.2594 -0.7680 0.7280 0.5087 -344.3741 -316.7488 -0.8779 -0.8397

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1
Downloads last month
1
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95

Finetuned
(27)
this model

Dataset used to train NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95