Commit
·
1c1b159
1
Parent(s):
8890ef9
Update README.md
Browse files
README.md
CHANGED
@@ -22,12 +22,16 @@ license: apache-2.0
|
|
22 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
|
23 |
</div>
|
24 |
|
25 |
-
Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
|
|
|
|
|
26 |
Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
|
|
|
27 |
This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
|
|
|
28 |
Notus models are intended to be used as assistants via chat-like applications, and
|
29 |
-
are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks
|
30 |
-
with the original Zephyr dDPO model.
|
31 |
|
32 |
## Model Details
|
33 |
|
|
|
22 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
|
23 |
</div>
|
24 |
|
25 |
+
Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
|
26 |
+
|
27 |
+
Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
|
28 |
Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
|
29 |
+
|
30 |
This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
|
31 |
+
|
32 |
Notus models are intended to be used as assistants via chat-like applications, and
|
33 |
+
are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
|
34 |
+
with the original Zephyr dDPO model and other 7B models.
|
35 |
|
36 |
## Model Details
|
37 |
|