--- title: README emoji: 🏃 colorFrom: blue colorTo: indigo sdk: static pinned: false ---
# Process Reinforcement Through Implicit Rewards
# Links - [Paper](https://arxiv.org/abs/2502.01456) - [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f) - [GitHub](https://github.com/PRIME-RL/PRIME) # Evaluation Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. The final results are presented below: | | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** | | ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- | | AIME 2024 | **26.7 (+23.3)** | 3.3 | 13.3 | 16.7 | 9.3 | | MATH-500 | 79.2 (+14.1) | 65.1 | **79.8** | 64.6 | 76.4 | | AMC | **57.8 (+27.7)** | 30.1 | 50.6 | 30.1 | 45.8 | | Minerva Math | **38.6 (+5.9)** | 32.7 | 34.6 | 35.3 | 36.8 | | OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | **43.3** | | Avg. | **48.9 (+16.7)** | 32.2 | 43.8 | 35.7 | 43.3 | We achieved this with only 1/10 data and model resources compared with Qwen-Math. | | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** | | ---------- | ---------------------------------- | ------------------------------- | | Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B | | SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) | | RM Data | **0** | 618K (in-house) | | RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) | | RL Data | **150K queries \times 4 samples** | 66K queries \times 32 samples | # Citation If you find PRIME or ImplicitPRM helpful, please cite us. ``` @article{cui2025process, title={Process reinforcement through implicit rewards}, author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others}, journal={arXiv preprint arXiv:2502.01456}, year={2025} } ``` ``` @article{yuan2024implicitprm, title={Free Process Rewards without Process Labels}, author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng}, journal={arXiv preprint arXiv:2412.01981}, year={2024} } ```