File size: 3,267 Bytes
67f4075 1524bde 3194c28 1524bde 0825fd9 1524bde |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
title: README
emoji: π
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
<div align="center">
# Process Reinforcement Through Implicit Rewards
</div>
# Links
- [Paper](https://arxiv.org/abs/2502.01456)
- [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
- [GitHub](https://github.com/PRIME-RL/PRIME)
# Evaluation
Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks.
The final results are presented below:
| | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |
| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |
| AIME 2024 | **26.7 (+23.3)** | 3.3 | 13.3 | 16.7 | 9.3 |
| MATH-500 | 79.2 (+14.1) | 65.1 | **79.8** | 64.6 | 76.4 |
| AMC | **57.8 (+27.7)** | 30.1 | 50.6 | 30.1 | 45.8 |
| Minerva Math | **38.6 (+5.9)** | 32.7 | 34.6 | 35.3 | 36.8 |
| OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | **43.3** |
| Avg. | **48.9 (+16.7)** | 32.2 | 43.8 | 35.7 | 43.3 |
We achieved this with only 1/10 data and model resources compared with Qwen-Math.
| | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** |
| ---------- | ---------------------------------- | ------------------------------- |
| Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
| SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) |
| RM Data | **0** | 618K (in-house) |
| RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) |
| RL Data | **150K queries \times 4 samples** | 66K queries \times 32 samples |
# Citation
If you find PRIME or ImplicitPRM helpful, please cite us.
```
@article{cui2025process,
title={Process reinforcement through implicit rewards},
author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
journal={arXiv preprint arXiv:2502.01456},
year={2025}
}
```
```
@article{yuan2024implicitprm,
title={Free Process Rewards without Process Labels},
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
journal={arXiv preprint arXiv:2412.01981},
year={2024}
}
```
|