argilla
/

notus-7b-v1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

alvarobartt HF staff commited on Nov 30, 2023

Commit

75dfa6c

·

1 Parent(s): 517e565

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -175,19 +175,20 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting w
 ### Training Data
-We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences).
-TLDR:
 After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
 By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
 See screenshot below for one example of this issue.
 After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
 While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
 ## Prompt template

 ### Training Data
+We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
+TL;DR
 After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
 By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
 See screenshot below for one example of this issue.
 After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
 While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
 ## Prompt template