metadata

library_name: peft
tags:
  - trl
  - dpo
  - generated_from_trainer
base_model: Meta-Llama-3-8B-Instruct
model-index:
  - name: dpo
    results: []

dpo

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6689	0.0567	750	0.6852	0.2546	0.2335	0.5373	0.0211	-235.1381	-255.9783	-0.6108	-0.7325
0.6831	0.1133	1500	0.6835	0.3555	0.3270	0.5597	0.0285	-234.2029	-254.9690	-0.6135	-0.7317
0.6821	0.1700	2250	0.6855	0.3655	0.3411	0.5485	0.0243	-234.0616	-254.8697	-0.6115	-0.7293