Update README.md
Browse files
README.md
CHANGED
@@ -40,18 +40,18 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
40 |
|
41 |
## Performance
|
42 |
|
43 |
-
|
44 |
|
45 |
-
| Model | Size |
|
46 |
|-------------|-----|----|---------------|--------------|
|
47 |
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
|
48 |
| MPT-Chat | 7B |dSFT |5.42| -|
|
49 |
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
|
50 |
| Mistral-Instructv0.1 | 7B| - | 6.84 |-|
|
51 |
| Zephyr-7b-α |7B| dDPO| 6.88| -|
|
52 |
-
| **Zephyr-7b-β** |7B|
|
53 |
| Falcon-Instruct | 40B |dSFT |5.17 |45.71|
|
54 |
-
| Guanaco 65B | SFT |6.41| 71.80|
|
55 |
| Llama2-Chat | 70B |RLHF |6.86| 92.66|
|
56 |
| Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
|
57 |
| WizardLM v1.0 | 70B |dSFT |7.71 |-|
|
@@ -60,6 +60,13 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
60 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
61 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
## Intended uses & limitations
|
64 |
|
65 |
The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
|
@@ -108,9 +115,8 @@ It is also unknown what the size and composition of the corpus was used to train
|
|
108 |
|
109 |
## Training and evaluation data
|
110 |
|
111 |
-
|
112 |
|
113 |
-
It achieves the following results on the evaluation set:
|
114 |
- Loss: 0.7496
|
115 |
- Rewards/chosen: -4.5221
|
116 |
- Rewards/rejected: -8.3184
|
@@ -140,6 +146,9 @@ The following hyperparameters were used during training:
|
|
140 |
|
141 |
### Training results
|
142 |
|
|
|
|
|
|
|
143 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
144 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
145 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|
|
|
40 |
|
41 |
## Performance
|
42 |
|
43 |
+
At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
|
44 |
|
45 |
+
| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
|
46 |
|-------------|-----|----|---------------|--------------|
|
47 |
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
|
48 |
| MPT-Chat | 7B |dSFT |5.42| -|
|
49 |
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
|
50 |
| Mistral-Instructv0.1 | 7B| - | 6.84 |-|
|
51 |
| Zephyr-7b-α |7B| dDPO| 6.88| -|
|
52 |
+
| **Zephyr-7b-β** 🪁 | **7B** | **dDPO** | **7.34** | **90.60** |
|
53 |
| Falcon-Instruct | 40B |dSFT |5.17 |45.71|
|
54 |
+
| Guanaco | 65B | SFT |6.41| 71.80|
|
55 |
| Llama2-Chat | 70B |RLHF |6.86| 92.66|
|
56 |
| Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
|
57 |
| WizardLM v1.0 | 70B |dSFT |7.71 |-|
|
|
|
60 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
61 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
62 |
|
63 |
+
In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B:
|
64 |
+
|
65 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6200d0a443eb0913fa2df7cc/raxvt5ma16d7T23my34WC.png)
|
66 |
+
|
67 |
+
However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap.
|
68 |
+
|
69 |
+
|
70 |
## Intended uses & limitations
|
71 |
|
72 |
The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
|
|
|
115 |
|
116 |
## Training and evaluation data
|
117 |
|
118 |
+
During DPO training, this model achieves the following results on the evaluation set:
|
119 |
|
|
|
120 |
- Loss: 0.7496
|
121 |
- Rewards/chosen: -4.5221
|
122 |
- Rewards/rejected: -8.3184
|
|
|
146 |
|
147 |
### Training results
|
148 |
|
149 |
+
The table below shows the full set of DPO training metrics:
|
150 |
+
|
151 |
+
|
152 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
153 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
154 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|