---
base_model: mistralai/Mistral-7B-Instruct-v0.2
tags:
- alignment-handbook
- generated_from_trainer
datasets:
- princeton-nlp/mistral-instruct-ultrafeedback
model-index:
- name: tpo-alignment/Mistral-Instruct-7B-TPO-y4 
  results: []
license: mit
---
# Mistral-Instruct-7B-TPO-y4 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/).

## Model Details

### Model Description

We fine-tuned [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response.

- **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- **Model type:** Causal Language Model
- **License:** mistral
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/sahsaeedi/TPO
- **Paper:** https://arxiv.org/abs/2405.16681


## How to Get Started with the Model
```
import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4"
generator = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
                      do_sample=False,
                      eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
                      max_new_tokens=200)
print(outputs[0]['generated_text'])
```

## Training Details

### Training Data

We use [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) as the preference optimization dataset.

#### Training Hyperparameters

The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO).


## Technical Specifications

### Model Architecture and Objective

The model architecture is based on [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681).

#### Hardware

We used 8xA100 GPUs for model training.


## Citation

TPO paper:
```
@misc{saeidi2025triplepreferenceoptimizationachieving,
      title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
      author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
      year={2025},
      eprint={2405.16681},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.16681}, 
}
```