RowitZou nielsr HF Staff commited on
Commit
af82088
·
verified ·
1 Parent(s): 7ae0fb6

Add pipeline tag and library name to metadata (#2)

Browse files

- Add pipeline tag and library name to metadata (55cb02eb834550f0fef62f019b4951e33b641fab)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +45 -38
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - internlm/internlm2_5-7b
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
 
 
13
  ---
 
14
  <div align="center">
15
 
16
  <img src="./misc/logo.png" width="400"/><br>
@@ -35,13 +38,13 @@ tags:
35
 
36
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
37
 
38
- * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
39
 
40
- * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
41
 
42
- * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
43
 
44
- * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
45
 
46
  <img src="./misc/intro.jpeg"/><br>
47
 
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
60
 
61
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
62
 
63
- - It is recommended to build a Python-3.10 virtual environment using conda
64
 
65
- ```bash
66
- conda create --name xtuner-env python=3.10 -y
67
- conda activate xtuner-env
68
- ```
69
 
70
- - Install xtuner via pip
71
 
72
- ```shell
73
- pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
74
- ```
75
 
76
  ## Inference
77
 
@@ -129,7 +132,7 @@ print(rewards)
129
 
130
  #### Requirements
131
 
132
- - lmdeploy >= 0.9.1
133
 
134
  #### Server Launch
135
 
@@ -159,7 +162,7 @@ print(rewards)
159
 
160
  #### Requirements
161
 
162
- - 0.4.3.post4 <= sglang <= 0.4.4.post1
163
 
164
  #### Server Launch
165
 
@@ -190,7 +193,7 @@ print(rewards)
190
 
191
  #### Requirements
192
 
193
- - vllm >= 0.8.0
194
 
195
  #### Server Launch
196
 
@@ -221,8 +224,8 @@ print(rewards)
221
 
222
  ### Requirements
223
 
224
- - flash_attn
225
- - tensorboard
226
 
227
  ### Data format
228
 
@@ -239,31 +242,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
239
 
240
  ### Training steps
241
 
242
- - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
243
 
244
- - **Step 1:** Start fine-tuning.
245
 
246
  ```shell
247
  xtuner train ${CONFIG_FILE_PATH}
248
  ```
249
 
250
- For example, you can start the fine-tuning of POLAR-7B-Base by
251
 
252
- ```shell
253
- # On a single GPU
254
- xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
255
 
256
- # On multiple GPUs
257
- NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
258
- ```
259
 
260
- Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
261
 
262
- - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
263
 
264
- ```shell
265
- xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
266
- ```
267
 
268
  # Examples
269
 
@@ -298,7 +301,9 @@ rewards = client(data)
298
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
299
 
300
  for output, reward in sorted_res:
301
- print(f"Output: {output}\nReward: {reward}\n")
 
 
302
  ```
303
 
304
  ```txt
@@ -314,7 +319,7 @@ Reward: -6.70703125
314
  Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
315
  Reward: -7.10546875
316
 
317
- Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.
318
  Reward: -7.1328125
319
 
320
  Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
@@ -352,7 +357,9 @@ rewards = client(data)
352
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
353
 
354
  for output, reward in sorted_res:
355
- print(f"Output: {output}\nReward: {reward}\n")
 
 
356
  ```
357
 
358
  ```txt
 
1
  ---
2
+ base_model:
3
+ - internlm/internlm2_5-7b
4
  language:
5
  - en
6
  - zh
7
+ license: apache-2.0
 
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
13
+ pipeline_tag: text-ranking
14
+ library_name: transformers
15
  ---
16
+
17
  <div align="center">
18
 
19
  <img src="./misc/logo.png" width="400"/><br>
 
38
 
39
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
40
 
41
+ * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
42
 
43
+ * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
44
 
45
+ * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
46
 
47
+ * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
48
 
49
  <img src="./misc/intro.jpeg"/><br>
50
 
 
63
 
64
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
65
 
66
+ - It is recommended to build a Python-3.10 virtual environment using conda
67
 
68
+ ```bash
69
+ conda create --name xtuner-env python=3.10 -y
70
+ conda activate xtuner-env
71
+ ```
72
 
73
+ - Install xtuner via pip
74
 
75
+ ```shell
76
+ pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
77
+ ```
78
 
79
  ## Inference
80
 
 
132
 
133
  #### Requirements
134
 
135
+ - lmdeploy >= 0.9.1
136
 
137
  #### Server Launch
138
 
 
162
 
163
  #### Requirements
164
 
165
+ - 0.4.3.post4 <= sglang <= 0.4.4.post1
166
 
167
  #### Server Launch
168
 
 
193
 
194
  #### Requirements
195
 
196
+ - vllm >= 0.8.0
197
 
198
  #### Server Launch
199
 
 
224
 
225
  ### Requirements
226
 
227
+ - flash_attn
228
+ - tensorboard
229
 
230
  ### Data format
231
 
 
242
 
243
  ### Training steps
244
 
245
+ - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
246
 
247
+ - **Step 1:** Start fine-tuning.
248
 
249
  ```shell
250
  xtuner train ${CONFIG_FILE_PATH}
251
  ```
252
 
253
+ For example, you can start the fine-tuning of POLAR-7B-Base by
254
 
255
+ ```shell
256
+ # On a single GPU
257
+ xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
258
 
259
+ # On multiple GPUs
260
+ NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
261
+ ```
262
 
263
+ Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
264
 
265
+ - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
266
 
267
+ ```shell
268
+ xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
269
+ ```
270
 
271
  # Examples
272
 
 
301
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
302
 
303
  for output, reward in sorted_res:
304
+ print(f"Output: {output}
305
+ Reward: {reward}
306
+ ")
307
  ```
308
 
309
  ```txt
 
319
  Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
320
  Reward: -7.10546875
321
 
322
+ Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
323
  Reward: -7.1328125
324
 
325
  Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
 
357
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
358
 
359
  for output, reward in sorted_res:
360
+ print(f"Output: {output}
361
+ Reward: {reward}
362
+ ")
363
  ```
364
 
365
  ```txt