Update README.md
Browse files
README.md
CHANGED
@@ -41,13 +41,6 @@ The **sole purpose** of this project was to **test the feasibility of replacing
|
|
41 |
|
42 |
---
|
43 |
|
44 |
-
### 💀 **The Painful Side**
|
45 |
-
- **Spike Hell**: Ctx4096 training introduced catastrophic **KL Loss spikes**, requiring constant rollbacks and manual interventions.
|
46 |
-
- **VRAM starvation**: Running 14B models with long contexts meant **batch sizes** were reduced to **32**, relying on **Gradient Accumulation** just to survive.
|
47 |
-
- **System Prompt Overfitting**: Earlier phases locked the model into repeating fixed prompts, needing a **full distillation reset**.
|
48 |
-
|
49 |
-
---
|
50 |
-
|
51 |
### 📈 **Scaling Observations**
|
52 |
- PRWKV scales from **3B** to **14B** parameters.
|
53 |
- **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.
|
|
|
41 |
|
42 |
---
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
### 📈 **Scaling Observations**
|
45 |
- PRWKV scales from **3B** to **14B** parameters.
|
46 |
- **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.
|