OpenMOSE
/

PRWKV-7-cxa076

Model card Files Files and versions

OpenMOSE commited on Apr 28

Commit

aaa59d0

·

verified ·

1 Parent(s): 89cd398

Update README.md

Files changed (1) hide show

README.md +0 -7

README.md CHANGED Viewed

@@ -41,13 +41,6 @@ The **sole purpose** of this project was to **test the feasibility of replacing
 ---
-### 💀 **The Painful Side**
-- **Spike Hell**: Ctx4096 training introduced catastrophic **KL Loss spikes**, requiring constant rollbacks and manual interventions.
-- **VRAM starvation**: Running 14B models with long contexts meant **batch sizes** were reduced to **32**, relying on **Gradient Accumulation** just to survive.
-- **System Prompt Overfitting**: Earlier phases locked the model into repeating fixed prompts, needing a **full distillation reset**.
----
 ### 📈 **Scaling Observations**
 - PRWKV scales from **3B** to **14B** parameters.
 - **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.

 ---
 ### 📈 **Scaling Observations**
 - PRWKV scales from **3B** to **14B** parameters.
 - **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.