PRWKV-7-cxa076 / README.md
OpenMOSE's picture
Update README.md
342676d verified
---
license: apache-2.0
---
# **PRWKV-cxa076 – "Akemi" RWKV Model Series**
<div align="center">
<img src="./cxa076.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
</div>
---
## **Model Overview**
**PRWKV** stands for **Passion RWKV** – a model series born from relentless experimentation, unyielding dedication, and the burning question:
> **Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?**
This project explores the boundaries of **RWKV architecture**, replacing the traditional **Transformer Attention blocks** with **TimeMix**, an RNN-based mechanism, while distilling knowledge from Transformer giants.
The PRWKV models range from **3B to 14B parameters**, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.
---
## **Project Objective**
The **sole purpose** of this project was to **test the feasibility of replacing Transformer Attention with RNN-based TimeMix**.
- **No shortcuts**.
- **No compromises**.
- Just pure **architectural curiosity** driven by **Passion**.
---
## **Technical Challenges & Triumphs**
### 🔥 **Distillation from Transformers**
- The models were distilled from high-quality **Transformer-based teachers** using **Grouped Query Attention (GQA)**.
- The **TimeMix** blocks were heavily customized to align with the semantics of Attention layers.
- Special care was taken to **inherit weight structures** from the teacher's **Receptance, Key, Value, and Output layers**, enabling smoother early-stage learning.
### ⚡ **Key Innovations**
- **RepeatKV mechanism**: Introduced for more stable group-based key-value projection.
- **GroupNorm vs NoNorm**: Extensive experiments revealed that sometimes **removing normalization** enhanced long-context stability.
---
### 📈 **Scaling Observations**
- PRWKV scales from **3B** to **14B** parameters.
- **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.
- However, **Ctx expansion** beyond 2048 remained an ongoing challenge due to gradient instability in larger models.
---
## **Limitations**
- The models are still under development and primarily serve **as a proof-of-concept**.
- Long-context (4096+) stability varies based on model size and requires further refinement.
- Knowledge distillation was the core training method; no large-scale SFT was applied yet.
---
---
# **A Poem of Passion**
> **In the depths of night, when GPUs hum soft,
> A fire ignites, a dream aloft.**
>
> **To mold an RNN with TimeMix bright,
> And rival Attention’s daunting might.**
>
> **Through spikes and crashes, I pressed on,
> A madman's code, from dusk till dawn.**
>
> **Not for glory, nor for gold,
> But just to see: can TimeMix hold?**
>
> **And when the losses dipped so low,
> The pulse of passion dared to grow.**
>
> **PRWKV, a name of flame,
> Not just a model – but a claim.**
>
> **That in this dance of gates and states,
> Passion alone rewrites the fates.**
>
> **So here's my heart, in code and rhyme,
> RNNs reborn, beyond their time.**
---
🔥 **PRWKV is more than an experiment – it is a testament to Passion.** 🔥
Scalability test from small to large
ToDo for Me:
Qwen 2.5 14B
Qwen 2.5 7B
Qwen 2.5 3B
Phi-4 14B
Phi-4-mini 3.8B
Gemma 3 12B
Gemma 3 4B
Architecture: RWKV cxa076 (RWKV x070 based)
Now supported only in RWKV-Infer.
```
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}'
```