openchat
/

openchat-3.5-1210

@@ -6,6 +6,7 @@ tags:
 - C-RLFT
 datasets:
 - openchat/openchat_sharegpt4_dataset
 - imone/OpenOrca_FLAN
 - LDJnr/LessWrong-Amplify-Instruct
 - LDJnr/Pure-Dove
@@ -19,8 +20,6 @@ library_name: transformers
 pipeline_tag: text-generation
 ---
-# OpenChat (1210 Version): Advancing Open-source Language Models with Mixed-Quality Data
 <div align="center">
   <img src="https://raw.githubusercontent.com/imoneoi/openchat/master/assets/logo_new.png" style="width: 65%">
 </div>
@@ -34,18 +33,27 @@ pipeline_tag: text-generation
   <a href="https://arxiv.org/pdf/2309.11235.pdf">Paper</a>
 </p>
-**🔥  **
-| Model                       | HumanEval+ |
-|-----------------------------|------------|
-| GPT-3.5 (December 2023)     | 64.6       |
-| **OpenChat 3.5 1210**       | **63.4**   |
-| GPT-3.5 (March 2023)     | 64.6       |
-| OpenHermes 2.5              | 41.5       |
-  <div align="center" style="justify-content: center; align-items: center; "'>
-  <img src="https://github.com/alpayariyak/openchat/blob/master/assets/3.5-benchmarks.png?raw=true" style="width: 100%;  border-radius: 0.5em">
-  </div>
 OpenChat is an innovative library of open-source language models, fine-tuned with [C-RLFT](https://arxiv.org/pdf/2309.11235.pdf) - a strategy inspired by offline reinforcement learning. Our models learn from mixed-quality data without preference labels, delivering exceptional performance on par with ChatGPT, even with a 7B model. Despite our simple approach, we are committed to developing a high-performance, commercially viable, open-source large language model, and we continue to make significant strides toward this vision.
@@ -59,10 +67,14 @@ Once started, the server listens at `localhost:18888` for requests and is compat
 If you want to deploy the server as an online service, you can use `--api-keys sk-KEY1 sk-KEY2 ...` to specify allowed API keys and `--disable-log-requests --disable-log-stats --log-file openchat.log` for logging only to a file. For security purposes, we recommend using an [HTTPS gateway](https://fastapi.tiangolo.com/es/deployment/concepts/#security-https) in front of the server.
 <details>
   <summary>Example request (click to expand)</summary>
-Default Mode (Chat & Coding)
 ```bash
 curl http://localhost:18888/v1/chat/completions \
@@ -73,7 +85,7 @@ curl http://localhost:18888/v1/chat/completions \
   }'
 ```
-Mathematical Reasoning Mode
 ```bash
 curl http://localhost:18888/v1/chat/completions \
@@ -87,24 +99,22 @@ curl http://localhost:18888/v1/chat/completions \
 </details>
-| Model             | Size | Context | Weights                                                          | Serving                                                                                                          |
-|-------------------|------|---------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
-| OpenChat 3.5 1210 | 7B   | 8192    | [Huggingface](https://huggingface.co/openchat/openchat_3.5_1210) | `python -m ochat.serving.openai_api_server --model openchat/openchat_3.5_1210 --engine-use-ray --worker-use-ray` |
 ### Conversation templates
-Default Mode (GPT4 Correct)
 ```
 GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:
 ```
-Mathematical Reasoning Mode
 ```
 Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:
 ```
 The default (GPT4 Correct) template is also available as the integrated `tokenizer.chat_template`,
 which can be used instead of manually specifying the template:
@@ -118,6 +128,38 @@ tokens = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
 assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
 ```
 ## Comparison with [X.AI Grok models](https://x.ai/)
 |                   | License     | # Param | Average  | MMLU | HumanEval | MATH     | GSM8k    |
@@ -127,6 +169,8 @@ assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 42
 | Grok-0            | Proprietary | 33B     | 44.5     | 65.7 | 39.7      | 15.7     | 56.8     |
 | Grok-1            | Proprietary | ???B    | 55.8     | 73   | 63.2      | 23.9     | 62.9     |
 ## <a id="benchmarks"></a> Benchmarks
 | Model              | # Params | Average  | MT-Bench     | HumanEval       | BBH MC   | AGIEval  | TruthfulQA    | MMLU         | GSM8K        | BBH CoT     |
@@ -175,6 +219,7 @@ OpenChat 3.5 was trained with C-RLFT on a collection of publicly available high-
  - [OpenChat ShareGPT](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset)
  - [Open-Orca with FLAN answers](https://huggingface.co/datasets/imone/OpenOrca_FLAN)
  - Capybara [1](https://huggingface.co/datasets/LDJnr/Pure-Dove) [2](https://huggingface.co/datasets/LDJnr/Verified-Camel) [3](https://huggingface.co/datasets/LDJnr/LessWrong-Amplify-Instruct)
  - [GOAT](https://huggingface.co/datasets/tiedong/goat)
  - [Glaive](https://huggingface.co/datasets/glaiveai/glaive-code-assistant)

 - C-RLFT
 datasets:
 - openchat/openchat_sharegpt4_dataset
+- kaist-ai/Feedback-Collection
 - imone/OpenOrca_FLAN
 - LDJnr/LessWrong-Amplify-Instruct
 - LDJnr/Pure-Dove
 pipeline_tag: text-generation
 ---
 <div align="center">
   <img src="https://raw.githubusercontent.com/imoneoi/openchat/master/assets/logo_new.png" style="width: 65%">
 </div>
   <a href="https://arxiv.org/pdf/2309.11235.pdf">Paper</a>
 </p>
+# OpenChat 3.5: First Update Released on December 10th!
+**🚀 15-point improvement in coding performance**
+**💡 Introducing a coding & generalist mode and a mathematical reasoning mode**
+**🧑‍⚖️ Experimental support for evaluator and feedback capabilities**
+**🤖 Outperforms Grok-1 in 3/4 and ChatGPT (March) in 5/8 benchmarks**
+| Model                       | Size     | HumanEval+ pass@1 |
+|-----------------------------|----------|------------|
+| ChatGPT (December 12, 2023) | -        | 64.6       |
+| WizardCoder-Python-34B-V1.0 | 34B      | 64.6       |
+| **OpenChat 3.5 (Dec 10)**   | **7B**   | **63.4**   |
+| OpenHermes 2.5              | 7B       | 41.5       |
+<div style="display: flex; justify-content: center; align-items: center">
+  <img src="https://raw.githubusercontent.com/imoneoi/openchat/master/assets/openchat.png" style="width: 45%;">
+  <img src="https://raw.githubusercontent.com/imoneoi/openchat/master/assets/openchat_grok.png" style="width: 45%;">
+</div>
 OpenChat is an innovative library of open-source language models, fine-tuned with [C-RLFT](https://arxiv.org/pdf/2309.11235.pdf) - a strategy inspired by offline reinforcement learning. Our models learn from mixed-quality data without preference labels, delivering exceptional performance on par with ChatGPT, even with a 7B model. Despite our simple approach, we are committed to developing a high-performance, commercially viable, open-source large language model, and we continue to make significant strides toward this vision.
 If you want to deploy the server as an online service, you can use `--api-keys sk-KEY1 sk-KEY2 ...` to specify allowed API keys and `--disable-log-requests --disable-log-stats --log-file openchat.log` for logging only to a file. For security purposes, we recommend using an [HTTPS gateway](https://fastapi.tiangolo.com/es/deployment/concepts/#security-https) in front of the server.
+| Model             | Size | Context | Weights                                                          | Serving                                                                                                          |
+|-------------------|------|---------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| OpenChat 3.5 1210 | 7B   | 8192    | [Huggingface](https://huggingface.co/openchat/openchat_3.5_1210) | `python -m ochat.serving.openai_api_server --model openchat/openchat_3.5_1210 --engine-use-ray --worker-use-ray` |
 <details>
   <summary>Example request (click to expand)</summary>
+💡 **Default Mode (GPT4 Correct)**: Best for coding, chat and general tasks
 ```bash
 curl http://localhost:18888/v1/chat/completions \
   }'
 ```
+🧮 **Mathematical Reasoning Mode**: Tailored for solving math problems
 ```bash
 curl http://localhost:18888/v1/chat/completions \
 </details>
 ### Conversation templates
+💡 **Default Mode (GPT4 Correct)**: Best for coding, chat and general tasks
 ```
 GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:
 ```
+🧮 **Mathematical Reasoning Mode**: Tailored for solving math problems
 ```
 Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:
 ```
+⚠️ **Notice:** Remember to set `<|end_of_turn|>` as end of generation token.
 The default (GPT4 Correct) template is also available as the integrated `tokenizer.chat_template`,
 which can be used instead of manually specifying the template:
 assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
 ```
+## 🧑‍⚖️ (Experimental) Evaluator / Feedback Capabilities
+We've included evaluator capabilities in this release to advance open-source models as evaluators. You can use `Default Mode (GPT4 Correct)` with the following prompt (same as [Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)) to evaluate a response.
+```
+###Task Description:
+An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
+1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
+2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
+3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
+4. Please do not generate any other opening, closing, and explanations.
+###The instruction to evaluate:
+{orig_instruction}
+###Response to evaluate:
+{orig_response}
+###Reference Answer (Score 5):
+{orig_reference_answer}
+###Score Rubrics:
+[{orig_criteria}]
+Score 1: {orig_score1_description}
+Score 2: {orig_score2_description}
+Score 3: {orig_score3_description}
+Score 4: {orig_score4_description}
+Score 5: {orig_score5_description}
+###Feedback:
+```
 ## Comparison with [X.AI Grok models](https://x.ai/)
 |                   | License     | # Param | Average  | MMLU | HumanEval | MATH     | GSM8k    |
 | Grok-0            | Proprietary | 33B     | 44.5     | 65.7 | 39.7      | 15.7     | 56.8     |
 | Grok-1            | Proprietary | ???B    | 55.8     | 73   | 63.2      | 23.9     | 62.9     |
+*: Grok results are reported by [X.AI](https://x.ai/).
 ## <a id="benchmarks"></a> Benchmarks
 | Model              | # Params | Average  | MT-Bench     | HumanEval       | BBH MC   | AGIEval  | TruthfulQA    | MMLU         | GSM8K        | BBH CoT     |
  - [OpenChat ShareGPT](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset)
  - [Open-Orca with FLAN answers](https://huggingface.co/datasets/imone/OpenOrca_FLAN)
+ - [Feedback-Collection](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)
  - Capybara [1](https://huggingface.co/datasets/LDJnr/Pure-Dove) [2](https://huggingface.co/datasets/LDJnr/Verified-Camel) [3](https://huggingface.co/datasets/LDJnr/LessWrong-Amplify-Instruct)
  - [GOAT](https://huggingface.co/datasets/tiedong/goat)
  - [Glaive](https://huggingface.co/datasets/glaiveai/glaive-code-assistant)