soujanyaporia commited on
Commit
796e6eb
·
verified ·
1 Parent(s): eddd220

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -23,7 +23,7 @@ tags:
23
 
24
  ## Introduction
25
 
26
- PathFinder-PRM-7B is a hierarchical discriminative Process Reward Model(PRM) designed to identify errors and reward correct math reasoning in multi-step outputs from large language models (LLMs). Instead of treating evaluation as a single correct-or-wrong decision, PathFinder-PRM-7B breaks down its error judgment into 2 parts: whether the reasoning is mathematically correct, and logically consistent. It predicts these aspects separately and then combines them to decide if the current reasoning steps leads to a correct final solution. PathFinder-PRM-7B is trained on a combination of high-quality human annotated data (PRM800K) and additional automatically annotated samples, enabling robustness to common failure patterns and strong generalization across diverse benchmarks such as ProcessBench and PRMBench.
27
 
28
  ## Model Details
29
 
@@ -243,9 +243,9 @@ Note: All results are computed using reward-guided greedy search with Qwen2.5‑
243
  title={Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision},
244
  author={Tej Deep Pala and Panshul Sharma and Amir Zadeh and Chuan Li and Soujanya Poria},
245
  year={2025},
246
- eprint={2505.12345},
247
  archivePrefix={arXiv},
248
- primaryClass={cs.LG},
249
- url={https://arxiv.org/abs/2505.12345},
250
  }
251
  ```
 
23
 
24
  ## Introduction
25
 
26
+ PathFinder-PRM-7B is a hierarchical discriminative Process Reward Model (PRM) designed to identify errors and reward correct math reasoning in multi-step outputs from large language models (LLMs). Instead of treating evaluation as a single correct-or-wrong decision, PathFinder-PRM-7B breaks down its error judgment into 2 parts: whether the reasoning is mathematically correct, and logically consistent. It predicts these aspects separately and then combines them to decide if the current reasoning steps leads to a correct final solution. PathFinder-PRM-7B is trained on a combination of high-quality human annotated data (PRM800K) and additional automatically annotated samples, enabling robustness to common failure patterns and strong generalization across diverse benchmarks such as ProcessBench and PRMBench.
27
 
28
  ## Model Details
29
 
 
243
  title={Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision},
244
  author={Tej Deep Pala and Panshul Sharma and Amir Zadeh and Chuan Li and Soujanya Poria},
245
  year={2025},
246
+ eprint={2505.19706},
247
  archivePrefix={arXiv},
248
+ primaryClass={cs.CL},
249
+ url={https://arxiv.org/abs/2505.19706},
250
  }
251
  ```