# Highly experimental *proper* MoE | |
Based off of smollmv2. (Llama) | |
MoE-ified then further trained on a general dataset. | |
### info: | |
``` | |
MoE layers: [8, 12, 16, 20, 24, 28] | |
Top-k: 2 (activates 50.0% of experts per token) | |
Hidden size: 960 | |
Total parameters: 494,554,560 | |
Trainable parameters: 494,554,560 | |
Auxiliary loss weight: 0.01 | |
``` | |
code @ https://gist.github.com/cappuch/6a454ec8d2d349a27f9fd84f6ac90554 |