--- license: apache-2.0 base_model: alignment-handbook/zephyr-7b-sft-full tags: - alignment_handbook-handbook - generated_from_trainer datasets: - HuggingFaceH4/ultrafeedback_binarized model-index: - name: mistral-7B-DPO results: - task: type: text-generation dataset: name: IFEval type: IFEval metrics: - name: inst_level_strict_acc type: IFEval value: 53.06 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - task: type: text-generation dataset: name: BBH type: BBH metrics: - name: acc_norm type: Big Bench Hard (BBH) value: 21.78 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - task: type: text-generation dataset: name: MATH type: MATH metrics: - name: exact_match type: Math Challenges value: 2.87 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - task: type: text-generation dataset: name: GPQA type: GPQA metrics: - name: acc_norm type: Generalized Purpose Question Answering (GPQA) value: 3.47 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - task: type: text-generation dataset: name: MuSR type: MuSR metrics: - name: acc_norm type: MuSR value: 7.54 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - task: type: text-generation dataset: name: MMLU-PRO type: MMLU-PRO metrics: - name: acc type: MMLU-PRO value: 19.59 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard --- # MistralForCausalLM_Cal_DPO This model is a fine-tuned version of [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) on the HuggingFaceH4/ultrafeedback_binarized dataset. ## Model description The Cal-DPO algorithm effectively addresses the alignment problem between large language models and human preferences by calibrating the implicit rewards in comparative preference learning to match the real rewards. It has demonstrated excellent performance in multiple task benchmark tests. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-07 - train_batch_size: 8 - eval_batch_size: 4 - seed: 42 - distributed_type: multi-GPU - num_devices: 4 - gradient_accumulation_steps: 2 - total_train_batch_size: 64 - total_eval_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 1 ### Training results We evaluate models on 6 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks. - IFEval (https://arxiv.org/abs/2311.07911) - BBH (Big Bench Hard) (https://arxiv.org/abs/2210.09261) - GPQA (Graduate-Level Google-Proof Q&A Benchmark) (https://arxiv.org/abs/2311.12022) - MuSR (Multistep Soft Reasoning) (https://arxiv.org/abs/2310.16049) - MMLU-PRO (Massive Multitask Language Understanding - Professional) (https://arxiv.org/abs/2406.01574) ### Framework versions - Transformers 4.40.2 - Pytorch 2.1.2+cu121 - Datasets 2.14.6 - Tokenizers 0.19.1