Safetensors
qwen2
yuchenFan commited on
Commit
7f2b61d
·
verified ·
1 Parent(s): 46d8117

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -183,10 +183,11 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
183
 
184
  We evaluate **EurusPRM-Stage 1** and **EurusPRM-Stage 2** on **ProcessBench**.
185
  The threshold is obtained by converting the original score of each step using sigmoid function and iterating to find the highest F1 on GSM8k sub-benchmark. The threshold for **EurusPRM-Stage 1** and **EurusPRM-Stage 2** is 0.5015 and 0.5005 respectively.
186
- For leveraging the capibility of **EurusPRM** better, we add ``Step K`` (where K is the actual index of the step) in front of each step in **ProcessBench**.
 
187
 
188
  | Reward Model | GSM8k | MATH | OlympiadBench | Omni-Math | Avg |
189
- | --- | --- | --- | --- | --- | --- | --- |
190
  | Math-Shepherd-PRM-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
191
  | RLHFlow-PRM-Mistral-8B | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
192
  | RLHFlow-PRM-Deepseek-8B | 38.8 | 33.8 | 16.9 | 16.9 | 26.6 |
 
183
 
184
  We evaluate **EurusPRM-Stage 1** and **EurusPRM-Stage 2** on **ProcessBench**.
185
  The threshold is obtained by converting the original score of each step using sigmoid function and iterating to find the highest F1 on GSM8k sub-benchmark. The threshold for **EurusPRM-Stage 1** and **EurusPRM-Stage 2** is 0.5015 and 0.5005 respectively.
186
+
187
+ To leverage the capibility of **EurusPRM** better, we add ``Step K`` (where K is the actual index of the step) in front of each step in **ProcessBench**.
188
 
189
  | Reward Model | GSM8k | MATH | OlympiadBench | Omni-Math | Avg |
190
+ | --- | --- | --- | --- | --- | --- |
191
  | Math-Shepherd-PRM-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
192
  | RLHFlow-PRM-Mistral-8B | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
193
  | RLHFlow-PRM-Deepseek-8B | 38.8 | 33.8 | 16.9 | 16.9 | 26.6 |