Add pipeline tag and library name to model card (#1)
Browse files- Add pipeline tag and library name to model card (2fb9d5985ba6004b5aa4d7931e97c690ca266270)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,16 +1,19 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
-
|
| 7 |
-
- internlm/internlm2_5-7b
|
| 8 |
tags:
|
| 9 |
- Reward
|
| 10 |
- RL
|
| 11 |
- RFT
|
| 12 |
- Reward Model
|
|
|
|
|
|
|
| 13 |
---
|
|
|
|
| 14 |
<div align="center">
|
| 15 |
|
| 16 |
<img src="./misc/logo.png" width="400"/><br>
|
|
@@ -35,13 +38,13 @@ tags:
|
|
| 35 |
|
| 36 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
| 37 |
|
| 38 |
-
*
|
| 39 |
|
| 40 |
-
*
|
| 41 |
|
| 42 |
-
*
|
| 43 |
|
| 44 |
-
*
|
| 45 |
|
| 46 |
<img src="./misc/intro.jpeg"/><br>
|
| 47 |
|
|
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
|
|
| 60 |
|
| 61 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
| 62 |
|
| 63 |
-
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
|
| 70 |
-
-
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
|
| 76 |
## Inference
|
| 77 |
|
|
@@ -128,7 +131,7 @@ print(rewards)
|
|
| 128 |
|
| 129 |
#### Requirements
|
| 130 |
|
| 131 |
-
-
|
| 132 |
|
| 133 |
#### Server Launch
|
| 134 |
|
|
@@ -158,7 +161,7 @@ print(rewards)
|
|
| 158 |
|
| 159 |
#### Requirements
|
| 160 |
|
| 161 |
-
-
|
| 162 |
|
| 163 |
#### Server Launch
|
| 164 |
|
|
@@ -189,7 +192,7 @@ print(rewards)
|
|
| 189 |
|
| 190 |
#### Requirements
|
| 191 |
|
| 192 |
-
-
|
| 193 |
|
| 194 |
#### Server Launch
|
| 195 |
|
|
@@ -220,8 +223,8 @@ print(rewards)
|
|
| 220 |
|
| 221 |
### Requirements
|
| 222 |
|
| 223 |
-
-
|
| 224 |
-
-
|
| 225 |
|
| 226 |
### Data format
|
| 227 |
|
|
@@ -238,31 +241,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
|
|
| 238 |
|
| 239 |
### Training steps
|
| 240 |
|
| 241 |
-
-
|
| 242 |
|
| 243 |
-
-
|
| 244 |
|
| 245 |
```shell
|
| 246 |
xtuner train ${CONFIG_FILE_PATH}
|
| 247 |
```
|
| 248 |
|
| 249 |
-
|
| 250 |
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
|
| 259 |
-
|
| 260 |
|
| 261 |
-
-
|
| 262 |
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
|
| 267 |
# Examples
|
| 268 |
|
|
@@ -297,7 +300,9 @@ rewards = client(data)
|
|
| 297 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
| 298 |
|
| 299 |
for output, reward in sorted_res:
|
| 300 |
-
print(f"Output: {output}
|
|
|
|
|
|
|
| 301 |
```
|
| 302 |
|
| 303 |
```txt
|
|
@@ -351,7 +356,9 @@ rewards = client(data)
|
|
| 351 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
| 352 |
|
| 353 |
for output, reward in sorted_res:
|
| 354 |
-
print(f"Output: {output}
|
|
|
|
|
|
|
| 355 |
```
|
| 356 |
|
| 357 |
```txt
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- internlm/internlm2_5-7b
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- zh
|
| 7 |
+
license: apache-2.0
|
|
|
|
| 8 |
tags:
|
| 9 |
- Reward
|
| 10 |
- RL
|
| 11 |
- RFT
|
| 12 |
- Reward Model
|
| 13 |
+
pipeline_tag: text-ranking
|
| 14 |
+
library_name: transformers
|
| 15 |
---
|
| 16 |
+
|
| 17 |
<div align="center">
|
| 18 |
|
| 19 |
<img src="./misc/logo.png" width="400"/><br>
|
|
|
|
| 38 |
|
| 39 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
| 40 |
|
| 41 |
+
* **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
|
| 42 |
|
| 43 |
+
* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
|
| 44 |
|
| 45 |
+
* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
|
| 46 |
|
| 47 |
+
* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
|
| 48 |
|
| 49 |
<img src="./misc/intro.jpeg"/><br>
|
| 50 |
|
|
|
|
| 63 |
|
| 64 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
| 65 |
|
| 66 |
+
- It is recommended to build a Python-3.10 virtual environment using conda
|
| 67 |
|
| 68 |
+
```bash
|
| 69 |
+
conda create --name xtuner-env python=3.10 -y
|
| 70 |
+
conda activate xtuner-env
|
| 71 |
+
```
|
| 72 |
|
| 73 |
+
- Install xtuner via pip
|
| 74 |
|
| 75 |
+
```shell
|
| 76 |
+
pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
|
| 77 |
+
```
|
| 78 |
|
| 79 |
## Inference
|
| 80 |
|
|
|
|
| 131 |
|
| 132 |
#### Requirements
|
| 133 |
|
| 134 |
+
- lmdeploy >= 0.9.1
|
| 135 |
|
| 136 |
#### Server Launch
|
| 137 |
|
|
|
|
| 161 |
|
| 162 |
#### Requirements
|
| 163 |
|
| 164 |
+
- 0.4.3.post4 <= sglang <= 0.4.4.post1
|
| 165 |
|
| 166 |
#### Server Launch
|
| 167 |
|
|
|
|
| 192 |
|
| 193 |
#### Requirements
|
| 194 |
|
| 195 |
+
- vllm >= 0.8.0
|
| 196 |
|
| 197 |
#### Server Launch
|
| 198 |
|
|
|
|
| 223 |
|
| 224 |
### Requirements
|
| 225 |
|
| 226 |
+
- flash_attn
|
| 227 |
+
- tensorboard
|
| 228 |
|
| 229 |
### Data format
|
| 230 |
|
|
|
|
| 241 |
|
| 242 |
### Training steps
|
| 243 |
|
| 244 |
+
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
|
| 245 |
|
| 246 |
+
- **Step 1:** Start fine-tuning.
|
| 247 |
|
| 248 |
```shell
|
| 249 |
xtuner train ${CONFIG_FILE_PATH}
|
| 250 |
```
|
| 251 |
|
| 252 |
+
For example, you can start the fine-tuning of POLAR-7B-Base by
|
| 253 |
|
| 254 |
+
```shell
|
| 255 |
+
# On a single GPU
|
| 256 |
+
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
| 257 |
|
| 258 |
+
# On multiple GPUs
|
| 259 |
+
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
| 260 |
+
```
|
| 261 |
|
| 262 |
+
Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
|
| 263 |
|
| 264 |
+
- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
|
| 265 |
|
| 266 |
+
```shell
|
| 267 |
+
xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
|
| 268 |
+
```
|
| 269 |
|
| 270 |
# Examples
|
| 271 |
|
|
|
|
| 300 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
| 301 |
|
| 302 |
for output, reward in sorted_res:
|
| 303 |
+
print(f"Output: {output}
|
| 304 |
+
Reward: {reward}
|
| 305 |
+
")
|
| 306 |
```
|
| 307 |
|
| 308 |
```txt
|
|
|
|
| 356 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
| 357 |
|
| 358 |
for output, reward in sorted_res:
|
| 359 |
+
print(f"Output: {output}
|
| 360 |
+
Reward: {reward}
|
| 361 |
+
")
|
| 362 |
```
|
| 363 |
|
| 364 |
```txt
|