|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# **PRWKV-cxa076 – "Akemi" RWKV Model Series** |
|
|
|
<div align="center"> |
|
<img src="./cxa076.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" /> |
|
</div> |
|
--- |
|
|
|
## **Model Overview** |
|
|
|
**PRWKV** stands for **Passion RWKV** – a model series born from relentless experimentation, unyielding dedication, and the burning question: |
|
|
|
> **Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?** |
|
|
|
This project explores the boundaries of **RWKV architecture**, replacing the traditional **Transformer Attention blocks** with **TimeMix**, an RNN-based mechanism, while distilling knowledge from Transformer giants. |
|
|
|
The PRWKV models range from **3B to 14B parameters**, showcasing the potential scalability of RNN-based language models in modern LLM landscapes. |
|
|
|
--- |
|
|
|
## **Project Objective** |
|
|
|
The **sole purpose** of this project was to **test the feasibility of replacing Transformer Attention with RNN-based TimeMix**. |
|
|
|
- **No shortcuts**. |
|
- **No compromises**. |
|
- Just pure **architectural curiosity** driven by **Passion**. |
|
|
|
--- |
|
|
|
## **Technical Challenges & Triumphs** |
|
|
|
### 🔥 **Distillation from Transformers** |
|
- The models were distilled from high-quality **Transformer-based teachers** using **Grouped Query Attention (GQA)**. |
|
- The **TimeMix** blocks were heavily customized to align with the semantics of Attention layers. |
|
- Special care was taken to **inherit weight structures** from the teacher's **Receptance, Key, Value, and Output layers**, enabling smoother early-stage learning. |
|
|
|
### ⚡ **Key Innovations** |
|
- **RepeatKV mechanism**: Introduced for more stable group-based key-value projection. |
|
- **GroupNorm vs NoNorm**: Extensive experiments revealed that sometimes **removing normalization** enhanced long-context stability. |
|
|
|
--- |
|
|
|
### 📈 **Scaling Observations** |
|
- PRWKV scales from **3B** to **14B** parameters. |
|
- **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity. |
|
- However, **Ctx expansion** beyond 2048 remained an ongoing challenge due to gradient instability in larger models. |
|
|
|
--- |
|
|
|
## **Limitations** |
|
- The models are still under development and primarily serve **as a proof-of-concept**. |
|
- Long-context (4096+) stability varies based on model size and requires further refinement. |
|
- Knowledge distillation was the core training method; no large-scale SFT was applied yet. |
|
|
|
--- |
|
|
|
--- |
|
|
|
# **A Poem of Passion** |
|
|
|
> **In the depths of night, when GPUs hum soft, |
|
> A fire ignites, a dream aloft.** |
|
> |
|
> **To mold an RNN with TimeMix bright, |
|
> And rival Attention’s daunting might.** |
|
> |
|
> **Through spikes and crashes, I pressed on, |
|
> A madman's code, from dusk till dawn.** |
|
> |
|
> **Not for glory, nor for gold, |
|
> But just to see: can TimeMix hold?** |
|
> |
|
> **And when the losses dipped so low, |
|
> The pulse of passion dared to grow.** |
|
> |
|
> **PRWKV, a name of flame, |
|
> Not just a model – but a claim.** |
|
> |
|
> **That in this dance of gates and states, |
|
> Passion alone rewrites the fates.** |
|
> |
|
> **So here's my heart, in code and rhyme, |
|
> RNNs reborn, beyond their time.** |
|
|
|
--- |
|
|
|
🔥 **PRWKV is more than an experiment – it is a testament to Passion.** 🔥 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scalability test from small to large |
|
|
|
ToDo for Me: |
|
Qwen 2.5 14B |
|
Qwen 2.5 7B |
|
Qwen 2.5 3B |
|
|
|
Phi-4 14B |
|
Phi-4-mini 3.8B |
|
|
|
Gemma 3 12B |
|
Gemma 3 4B |
|
|
|
Architecture: RWKV cxa076 (RWKV x070 based) |
|
|
|
|
|
Now supported only in RWKV-Infer. |
|
|
|
|
|
``` |
|
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}' |
|
``` |