Llama-2-7b-hf-DPO-LookAhead3_FullEval_TTree1.4_TLoop0.7_TEval0.2_Filter0.2_V2.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6639	0.3037	53	0.6613	0.0051	-0.0607	0.875	0.0658	-108.1335	-132.6158	-0.6137	-0.5723
0.6476	0.6074	106	0.6171	-0.2010	-0.3748	0.625	0.1738	-111.2741	-134.6767	-0.6554	-0.6155
0.6552	0.9112	159	0.6850	-0.4026	-0.4336	0.5	0.0310	-111.8621	-136.6923	-0.6025	-0.5605
0.271	1.2149	212	0.5592	-1.1775	-1.5117	0.75	0.3342	-122.6435	-144.4414	-0.6651	-0.6240
0.2321	1.5186	265	0.6523	-1.6722	-1.8791	0.5	0.2069	-126.3177	-149.3886	-0.7461	-0.7056
0.3961	1.8223	318	0.5176	-1.1964	-1.6762	0.875	0.4798	-124.2882	-144.6302	-0.8107	-0.7719
0.1421	2.1261	371	0.6029	-2.4068	-2.8869	0.625	0.4801	-136.3952	-156.7344	-1.0103	-0.9720
0.5702	2.4298	424	0.6557	-3.1785	-3.6978	0.625	0.5193	-144.5047	-164.4516	-1.0897	-1.0539
0.2376	2.7335	477	0.6353	-3.2199	-3.7792	0.625	0.5593	-145.3183	-164.8658	-1.1220	-1.0854