--- base_model: mistralai/Mistral-7B-Instruct-v0.2 tags: - alignment-handbook - generated_from_trainer datasets: - princeton-nlp/mistral-instruct-ultrafeedback model-index: - name: tpo-alignment/Mistral-Instruct-7B-TPO-y4 results: [] license: mit --- # Mistral-Instruct-7B-TPO-y4 Model Card TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/). ## Model Details ### Model Description We fine-tuned [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response. - **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral - **Model type:** Causal Language Model - **License:** mistral - **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) ### Model Sources - **Repository:** https://github.com/sahsaeedi/TPO - **Paper:** https://arxiv.org/abs/2405.16681 ## How to Get Started with the Model ``` import torch from transformers import pipeline model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4" generator = pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device="cuda", ) outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}], do_sample=False, eos_token_id=[generator.tokenizer.convert_tokens_to_ids(""), generator.tokenizer.eos_token_id], max_new_tokens=200) print(outputs[0]['generated_text']) ``` ## Training Details ### Training Data We use [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) as the preference optimization dataset. #### Training Hyperparameters The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO). ## Technical Specifications ### Model Architecture and Objective The model architecture is based on [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681). #### Hardware We used 8xA100 GPUs for model training. ## Citation TPO paper: ``` @misc{saeidi2025triplepreferenceoptimizationachieving, title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral}, year={2025}, eprint={2405.16681}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2405.16681}, } ```