Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,9 @@ Mini-Mixtral-v0.2 is a Mixture of Experts (MoE) made with the following models u
|
|
22 |
* [unsloth/mistral-7b-v0.2](https://huggingface.co/unsloth/mistral-7b-v0.2)
|
23 |
* [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
|
24 |
|
|
|
|
|
|
|
25 |
## 🧩 Configuration
|
26 |
|
27 |
```yaml
|
@@ -77,10 +80,6 @@ print(outputs[0]["generated_text"])
|
|
77 |
|
78 |
## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
79 |
|
80 |
-
|
81 |
-
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
82 |
-
<a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
|
83 |
-
|
84 |
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
85 |
|
86 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
@@ -113,4 +112,3 @@ If all our tokens are sent to just a few popular experts, that will make trainin
|
|
113 |
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
|
114 |
## "Wait...but you called this a frankenMoE?"
|
115 |
The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
|
116 |
-
```
|
|
|
22 |
* [unsloth/mistral-7b-v0.2](https://huggingface.co/unsloth/mistral-7b-v0.2)
|
23 |
* [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
|
24 |
|
25 |
+
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
26 |
+
<a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
|
27 |
+
|
28 |
## 🧩 Configuration
|
29 |
|
30 |
```yaml
|
|
|
80 |
|
81 |
## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
82 |
|
|
|
|
|
|
|
|
|
83 |
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
84 |
|
85 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
|
|
112 |
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
|
113 |
## "Wait...but you called this a frankenMoE?"
|
114 |
The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
|
|