Add pipeline tag and library name to metadata (#2)
Browse files- Add pipeline tag and library name to metadata (55cb02eb834550f0fef62f019b4951e33b641fab)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,16 +1,19 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
language:
|
4 |
- en
|
5 |
- zh
|
6 |
-
|
7 |
-
- internlm/internlm2_5-7b
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
|
|
|
|
13 |
---
|
|
|
14 |
<div align="center">
|
15 |
|
16 |
<img src="./misc/logo.png" width="400"/><br>
|
@@ -35,13 +38,13 @@ tags:
|
|
35 |
|
36 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
37 |
|
38 |
-
*
|
39 |
|
40 |
-
*
|
41 |
|
42 |
-
*
|
43 |
|
44 |
-
*
|
45 |
|
46 |
<img src="./misc/intro.jpeg"/><br>
|
47 |
|
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
|
|
60 |
|
61 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
62 |
|
63 |
-
-
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
|
70 |
-
-
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
|
76 |
## Inference
|
77 |
|
@@ -129,7 +132,7 @@ print(rewards)
|
|
129 |
|
130 |
#### Requirements
|
131 |
|
132 |
-
-
|
133 |
|
134 |
#### Server Launch
|
135 |
|
@@ -159,7 +162,7 @@ print(rewards)
|
|
159 |
|
160 |
#### Requirements
|
161 |
|
162 |
-
-
|
163 |
|
164 |
#### Server Launch
|
165 |
|
@@ -190,7 +193,7 @@ print(rewards)
|
|
190 |
|
191 |
#### Requirements
|
192 |
|
193 |
-
-
|
194 |
|
195 |
#### Server Launch
|
196 |
|
@@ -221,8 +224,8 @@ print(rewards)
|
|
221 |
|
222 |
### Requirements
|
223 |
|
224 |
-
-
|
225 |
-
-
|
226 |
|
227 |
### Data format
|
228 |
|
@@ -239,31 +242,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
|
|
239 |
|
240 |
### Training steps
|
241 |
|
242 |
-
-
|
243 |
|
244 |
-
-
|
245 |
|
246 |
```shell
|
247 |
xtuner train ${CONFIG_FILE_PATH}
|
248 |
```
|
249 |
|
250 |
-
|
251 |
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
|
260 |
-
|
261 |
|
262 |
-
-
|
263 |
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
|
268 |
# Examples
|
269 |
|
@@ -298,7 +301,9 @@ rewards = client(data)
|
|
298 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
299 |
|
300 |
for output, reward in sorted_res:
|
301 |
-
print(f"Output: {output}
|
|
|
|
|
302 |
```
|
303 |
|
304 |
```txt
|
@@ -314,7 +319,7 @@ Reward: -6.70703125
|
|
314 |
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
|
315 |
Reward: -7.10546875
|
316 |
|
317 |
-
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are
|
318 |
Reward: -7.1328125
|
319 |
|
320 |
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
|
@@ -352,7 +357,9 @@ rewards = client(data)
|
|
352 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
353 |
|
354 |
for output, reward in sorted_res:
|
355 |
-
print(f"Output: {output}
|
|
|
|
|
356 |
```
|
357 |
|
358 |
```txt
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- internlm/internlm2_5-7b
|
4 |
language:
|
5 |
- en
|
6 |
- zh
|
7 |
+
license: apache-2.0
|
|
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
13 |
+
pipeline_tag: text-ranking
|
14 |
+
library_name: transformers
|
15 |
---
|
16 |
+
|
17 |
<div align="center">
|
18 |
|
19 |
<img src="./misc/logo.png" width="400"/><br>
|
|
|
38 |
|
39 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
40 |
|
41 |
+
* **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
|
42 |
|
43 |
+
* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
|
44 |
|
45 |
+
* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
|
46 |
|
47 |
+
* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
|
48 |
|
49 |
<img src="./misc/intro.jpeg"/><br>
|
50 |
|
|
|
63 |
|
64 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
65 |
|
66 |
+
- It is recommended to build a Python-3.10 virtual environment using conda
|
67 |
|
68 |
+
```bash
|
69 |
+
conda create --name xtuner-env python=3.10 -y
|
70 |
+
conda activate xtuner-env
|
71 |
+
```
|
72 |
|
73 |
+
- Install xtuner via pip
|
74 |
|
75 |
+
```shell
|
76 |
+
pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
|
77 |
+
```
|
78 |
|
79 |
## Inference
|
80 |
|
|
|
132 |
|
133 |
#### Requirements
|
134 |
|
135 |
+
- lmdeploy >= 0.9.1
|
136 |
|
137 |
#### Server Launch
|
138 |
|
|
|
162 |
|
163 |
#### Requirements
|
164 |
|
165 |
+
- 0.4.3.post4 <= sglang <= 0.4.4.post1
|
166 |
|
167 |
#### Server Launch
|
168 |
|
|
|
193 |
|
194 |
#### Requirements
|
195 |
|
196 |
+
- vllm >= 0.8.0
|
197 |
|
198 |
#### Server Launch
|
199 |
|
|
|
224 |
|
225 |
### Requirements
|
226 |
|
227 |
+
- flash_attn
|
228 |
+
- tensorboard
|
229 |
|
230 |
### Data format
|
231 |
|
|
|
242 |
|
243 |
### Training steps
|
244 |
|
245 |
+
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
|
246 |
|
247 |
+
- **Step 1:** Start fine-tuning.
|
248 |
|
249 |
```shell
|
250 |
xtuner train ${CONFIG_FILE_PATH}
|
251 |
```
|
252 |
|
253 |
+
For example, you can start the fine-tuning of POLAR-7B-Base by
|
254 |
|
255 |
+
```shell
|
256 |
+
# On a single GPU
|
257 |
+
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
258 |
|
259 |
+
# On multiple GPUs
|
260 |
+
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
261 |
+
```
|
262 |
|
263 |
+
Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
|
264 |
|
265 |
+
- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
|
266 |
|
267 |
+
```shell
|
268 |
+
xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
|
269 |
+
```
|
270 |
|
271 |
# Examples
|
272 |
|
|
|
301 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
302 |
|
303 |
for output, reward in sorted_res:
|
304 |
+
print(f"Output: {output}
|
305 |
+
Reward: {reward}
|
306 |
+
")
|
307 |
```
|
308 |
|
309 |
```txt
|
|
|
319 |
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
|
320 |
Reward: -7.10546875
|
321 |
|
322 |
+
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
|
323 |
Reward: -7.1328125
|
324 |
|
325 |
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
|
|
|
357 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
358 |
|
359 |
for output, reward in sorted_res:
|
360 |
+
print(f"Output: {output}
|
361 |
+
Reward: {reward}
|
362 |
+
")
|
363 |
```
|
364 |
|
365 |
```txt
|