File size: 1,683 Bytes
2fff992 3d5b45a b4a2e88 2fff992 b4a2e88 2fff992 b4a2e88 2fff992 b4a2e88 5cc542c 2fff992 b4a2e88 2fff992 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
**PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
We train [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3)
with a vanilla reward model [OpenRLHF/Llama-3-8b-rm-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-rm-mixture).
- **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
- **Finetuned from model:** [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture)
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration)
- **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)
<!-- - **Demo [optional]:** [More Information Needed] -->
|