Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,7 @@ tags:
|
|
| 23 |
|
| 24 |
## Introduction
|
| 25 |
|
| 26 |
-
PathFinder-PRM-7B is a hierarchical discriminative Process Reward Model(PRM) designed to identify errors and reward correct math reasoning in multi-step outputs from large language models (LLMs). Instead of treating evaluation as a single correct-or-wrong decision, PathFinder-PRM-7B breaks down its error judgment into 2 parts: whether the reasoning is mathematically correct, and logically consistent. It predicts these aspects separately and then combines them to decide if the current reasoning steps leads to a correct final solution. PathFinder-PRM-7B is trained on a combination of high-quality human annotated data (PRM800K) and additional automatically annotated samples, enabling robustness to common failure patterns and strong generalization across diverse benchmarks such as ProcessBench and PRMBench.
|
| 27 |
|
| 28 |
## Model Details
|
| 29 |
|
|
@@ -243,9 +243,9 @@ Note: All results are computed using reward-guided greedy search with Qwen2.5‑
|
|
| 243 |
title={Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision},
|
| 244 |
author={Tej Deep Pala and Panshul Sharma and Amir Zadeh and Chuan Li and Soujanya Poria},
|
| 245 |
year={2025},
|
| 246 |
-
eprint={2505.
|
| 247 |
archivePrefix={arXiv},
|
| 248 |
-
primaryClass={cs.
|
| 249 |
-
url={https://arxiv.org/abs/2505.
|
| 250 |
}
|
| 251 |
```
|
|
|
|
| 23 |
|
| 24 |
## Introduction
|
| 25 |
|
| 26 |
+
PathFinder-PRM-7B is a hierarchical discriminative Process Reward Model (PRM) designed to identify errors and reward correct math reasoning in multi-step outputs from large language models (LLMs). Instead of treating evaluation as a single correct-or-wrong decision, PathFinder-PRM-7B breaks down its error judgment into 2 parts: whether the reasoning is mathematically correct, and logically consistent. It predicts these aspects separately and then combines them to decide if the current reasoning steps leads to a correct final solution. PathFinder-PRM-7B is trained on a combination of high-quality human annotated data (PRM800K) and additional automatically annotated samples, enabling robustness to common failure patterns and strong generalization across diverse benchmarks such as ProcessBench and PRMBench.
|
| 27 |
|
| 28 |
## Model Details
|
| 29 |
|
|
|
|
| 243 |
title={Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision},
|
| 244 |
author={Tej Deep Pala and Panshul Sharma and Amir Zadeh and Chuan Li and Soujanya Poria},
|
| 245 |
year={2025},
|
| 246 |
+
eprint={2505.19706},
|
| 247 |
archivePrefix={arXiv},
|
| 248 |
+
primaryClass={cs.CL},
|
| 249 |
+
url={https://arxiv.org/abs/2505.19706},
|
| 250 |
}
|
| 251 |
```
|