|
--- |
|
title: README |
|
emoji: π |
|
colorFrom: blue |
|
colorTo: indigo |
|
sdk: static |
|
pinned: false |
|
--- |
|
|
|
<div align="center"> |
|
|
|
# Process Reinforcement Through Implicit Rewards |
|
|
|
</div> |
|
|
|
|
|
# Links |
|
|
|
- [Paper](https://arxiv.org/abs/2502.01456) |
|
- [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f) |
|
- [GitHub](https://github.com/PRIME-RL/PRIME) |
|
|
|
# Evaluation |
|
Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. |
|
The final results are presented below: |
|
| | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** | |
|
| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- | |
|
| AIME 2024 | **26.7 (+23.3)** | 3.3 | 13.3 | 16.7 | 9.3 | |
|
| MATH-500 | 79.2 (+14.1) | 65.1 | **79.8** | 64.6 | 76.4 | |
|
| AMC | **57.8 (+27.7)** | 30.1 | 50.6 | 30.1 | 45.8 | |
|
| Minerva Math | **38.6 (+5.9)** | 32.7 | 34.6 | 35.3 | 36.8 | |
|
| OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | **43.3** | |
|
| Avg. | **48.9 (+16.7)** | 32.2 | 43.8 | 35.7 | 43.3 | |
|
|
|
|
|
We achieved this with only 1/10 data and model resources compared with Qwen-Math. |
|
| | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** | |
|
| ---------- | ---------------------------------- | ------------------------------- | |
|
| Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B | |
|
| SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) | |
|
| RM Data | **0** | 618K (in-house) | |
|
| RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) | |
|
| RL Data | **150K queries \times 4 samples** | 66K queries \times 32 samples | |
|
|
|
# Citation |
|
If you find PRIME or ImplicitPRM helpful, please cite us. |
|
|
|
``` |
|
@article{cui2025process, |
|
title={Process reinforcement through implicit rewards}, |
|
author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others}, |
|
journal={arXiv preprint arXiv:2502.01456}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
``` |
|
@article{yuan2024implicitprm, |
|
title={Free Process Rewards without Process Labels}, |
|
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng}, |
|
journal={arXiv preprint arXiv:2412.01981}, |
|
year={2024} |
|
} |
|
``` |
|
|