Quantization made by Richard Erkhov.
Mistral7B-PairRM-SPPO - GGUF
- Model creator: https://huggingface.co/UCLA-AGI/
- Original model: https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO/
Name | Quant method | Size |
---|---|---|
Mistral7B-PairRM-SPPO.Q2_K.gguf | Q2_K | 2.53GB |
Mistral7B-PairRM-SPPO.Q3_K_S.gguf | Q3_K_S | 2.95GB |
Mistral7B-PairRM-SPPO.Q3_K.gguf | Q3_K | 3.28GB |
Mistral7B-PairRM-SPPO.Q3_K_M.gguf | Q3_K_M | 3.28GB |
Mistral7B-PairRM-SPPO.Q3_K_L.gguf | Q3_K_L | 3.56GB |
Mistral7B-PairRM-SPPO.IQ4_XS.gguf | IQ4_XS | 3.67GB |
Mistral7B-PairRM-SPPO.Q4_0.gguf | Q4_0 | 3.83GB |
Mistral7B-PairRM-SPPO.IQ4_NL.gguf | IQ4_NL | 3.87GB |
Mistral7B-PairRM-SPPO.Q4_K_S.gguf | Q4_K_S | 3.86GB |
Mistral7B-PairRM-SPPO.Q4_K.gguf | Q4_K | 4.07GB |
Mistral7B-PairRM-SPPO.Q4_K_M.gguf | Q4_K_M | 4.07GB |
Mistral7B-PairRM-SPPO.Q4_1.gguf | Q4_1 | 4.24GB |
Mistral7B-PairRM-SPPO.Q5_0.gguf | Q5_0 | 4.65GB |
Mistral7B-PairRM-SPPO.Q5_K_S.gguf | Q5_K_S | 4.65GB |
Mistral7B-PairRM-SPPO.Q5_K.gguf | Q5_K | 4.78GB |
Mistral7B-PairRM-SPPO.Q5_K_M.gguf | Q5_K_M | 4.78GB |
Mistral7B-PairRM-SPPO.Q5_1.gguf | Q5_1 | 5.07GB |
Mistral7B-PairRM-SPPO.Q6_K.gguf | Q6_K | 5.53GB |
Mistral7B-PairRM-SPPO.Q8_0.gguf | Q8_0 | 7.17GB |
Original model description:
license: apache-2.0 datasets: - openbmb/UltraFeedback language: - en pipeline_tag: text-generation
Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
Mistral7B-PairRM-SPPO
This model was developed using Self-Play Preference Optimization, based on the mistralai/Mistral-7B-Instruct-v0.2 architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.
While K = 5, this model uses three samples to estimate the soft probabilities P(y_w > y_l) and P(y_l > y_w). These samples include the winner, the loser, and another random sample. This approach has shown to deliver better performance on AlpacaEval 2.0 compared to the results reported in our paper.
❗Please refer to the original checkpoint at UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 as reported in our paper. We anticipate that the version in the paper demonstrates a more consistent performance improvement across all evaluation tasks.
Links to Other Models
- Mistral7B-PairRM-SPPO-Iter1
- Mistral7B-PairRM-SPPO-Iter2
- Mistral7B-PairRM-SPPO-Iter3
- Mistral7B-PairRM-SPPO
Model Description
- Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
- Language(s) (NLP): Primarily English
- License: Apache-2.0
- Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2
AlpacaEval Leaderboard Evaluation Results
Model | LC. Win Rate | Win Rate | Avg. Length |
---|---|---|---|
Mistral7B-PairRM-SPPO | 30.46 | 32.14 | 2114 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- eta: 1000
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 1
- seed: 42
- distributed_type: deepspeed_zero3
- num_devices: 8
- optimizer: RMSProp
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_train_epochs: 18.0 (stop at epoch=1.0)
Citation
@misc{wu2024self,
title={Self-Play Preference Optimization for Language Model Alignment},
author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
year={2024},
eprint={2405.00675},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- Downloads last month
- 47