Add pipeline tag and library name to metadata (#2)

Browse files

- Add pipeline tag and library name to metadata (55cb02eb834550f0fef62f019b4951e33b641fab)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +45 -38

README.md CHANGED Viewed

@@ -1,16 +1,19 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
-base_model:
-- internlm/internlm2_5-7b
 tags:
 - Reward
 - RL
 - RFT
 - Reward Model
 ---
 <div align="center">
 <img src="./misc/logo.png" width="400"/><br>
@@ -35,13 +38,13 @@ tags:
 POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
-* **Innovative Pre-training Paradigm:**  POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
-* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
-* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
-* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
 <img src="./misc/intro.jpeg"/><br>
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
 You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
-- It is recommended to build a Python-3.10 virtual environment using conda
-  ```bash
-  conda create --name xtuner-env python=3.10 -y
-  conda activate xtuner-env
-  ```
-- Install xtuner via pip
-  ```shell
-  pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
-  ```
 ## Inference
@@ -129,7 +132,7 @@ print(rewards)
 #### Requirements
-- lmdeploy >= 0.9.1
 #### Server Launch
@@ -159,7 +162,7 @@ print(rewards)
 #### Requirements
-- 0.4.3.post4 <= sglang <= 0.4.4.post1
 #### Server Launch
@@ -190,7 +193,7 @@ print(rewards)
 #### Requirements
-- vllm >= 0.8.0
 #### Server Launch
@@ -221,8 +224,8 @@ print(rewards)
 ### Requirements
-- flash_attn
-- tensorboard
 ### Data format
@@ -239,31 +242,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
 ### Training steps
-- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
-- **Step 1:** Start fine-tuning.
     ```shell
     xtuner train ${CONFIG_FILE_PATH}
     ```
-  For example, you can start the fine-tuning of POLAR-7B-Base by
-  ```shell
-  # On a single GPU
-  xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
-  # On multiple GPUs
-  NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
-  ```
-  Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
-- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
-  ```shell
-  xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
-  ```
 # Examples
@@ -298,7 +301,9 @@ rewards = client(data)
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
-    print(f"Output: {output}\nReward: {reward}\n")
 ```
 ```txt
@@ -314,7 +319,7 @@ Reward: -6.70703125
 Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
 Reward: -7.10546875
-Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.
 Reward: -7.1328125
 Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
@@ -352,7 +357,9 @@ rewards = client(data)
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
-    print(f"Output: {output}\nReward: {reward}\n")
 ```
 ```txt

 ---
+base_model:
+- internlm/internlm2_5-7b
 language:
 - en
 - zh
+license: apache-2.0
 tags:
 - Reward
 - RL
 - RFT
 - Reward Model
+pipeline_tag: text-ranking
+library_name: transformers
 ---
 <div align="center">
 <img src="./misc/logo.png" width="400"/><br>
 POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
+*   **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
+*   **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
+*   **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
+*   **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
 <img src="./misc/intro.jpeg"/><br>
 You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
+-   It is recommended to build a Python-3.10 virtual environment using conda
+    ```bash
+    conda create --name xtuner-env python=3.10 -y
+    conda activate xtuner-env
+    ```
+-   Install xtuner via pip
+    ```shell
+    pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
+    ```
 ## Inference
 #### Requirements
+-   lmdeploy >= 0.9.1
 #### Server Launch
 #### Requirements
+-   0.4.3.post4 <= sglang <= 0.4.4.post1
 #### Server Launch
 #### Requirements
+-   vllm >= 0.8.0
 #### Server Launch
 ### Requirements
+-   flash_attn
+-   tensorboard
 ### Data format
 ### Training steps
+-   **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
+-   **Step 1:** Start fine-tuning.
     ```shell
     xtuner train ${CONFIG_FILE_PATH}
     ```
+    For example, you can start the fine-tuning of POLAR-7B-Base by
+    ```shell
+    # On a single GPU
+    xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
+    # On multiple GPUs
+    NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
+    ```
+    Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
+-   **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
+    ```shell
+    xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
+    ```
 # Examples
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
+    print(f"Output: {output}
+Reward: {reward}
+")
 ```
 ```txt
 Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
 Reward: -7.10546875
+Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
 Reward: -7.1328125
 Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
+    print(f"Output: {output}
+Reward: {reward}
+")
 ```
 ```txt