File size: 3,923 Bytes

---
license: apache-2.0
---

# **PRWKV-cxa076 – "Akemi" RWKV Model Series**

<div align="center">
  <img src="./cxa076.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
</div>
---

## **Model Overview**

**PRWKV** stands for **Passion RWKV** – a model series born from relentless experimentation, unyielding dedication, and the burning question:

> **Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?**

This project explores the boundaries of **RWKV architecture**, replacing the traditional **Transformer Attention blocks** with **TimeMix**, an RNN-based mechanism, while distilling knowledge from Transformer giants.

The PRWKV models range from **3B to 14B parameters**, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.

---

## **Project Objective**

The **sole purpose** of this project was to **test the feasibility of replacing Transformer Attention with RNN-based TimeMix**.

- **No shortcuts**.
- **No compromises**.
- Just pure **architectural curiosity** driven by **Passion**.

---

## **Technical Challenges & Triumphs**

### 🔥 **Distillation from Transformers**
- The models were distilled from high-quality **Transformer-based teachers** using **Grouped Query Attention (GQA)**.
- The **TimeMix** blocks were heavily customized to align with the semantics of Attention layers.
- Special care was taken to **inherit weight structures** from the teacher's **Receptance, Key, Value, and Output layers**, enabling smoother early-stage learning.

### ⚡ **Key Innovations**
- **RepeatKV mechanism**: Introduced for more stable group-based key-value projection.
- **GroupNorm vs NoNorm**: Extensive experiments revealed that sometimes **removing normalization** enhanced long-context stability.

---

### 📈 **Scaling Observations**
- PRWKV scales from **3B** to **14B** parameters.
- **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity.
- However, **Ctx expansion** beyond 2048 remained an ongoing challenge due to gradient instability in larger models.

---

## **Limitations**
- The models are still under development and primarily serve **as a proof-of-concept**.
- Long-context (4096+) stability varies based on model size and requires further refinement.
- Knowledge distillation was the core training method; no large-scale SFT was applied yet.

---

---

# **A Poem of Passion**

> **In the depths of night, when GPUs hum soft,  
> A fire ignites, a dream aloft.**  
>  
> **To mold an RNN with TimeMix bright,  
> And rival Attention’s daunting might.**  
>  
> **Through spikes and crashes, I pressed on,  
> A madman's code, from dusk till dawn.**  
>  
> **Not for glory, nor for gold,  
> But just to see: can TimeMix hold?**  
>  
> **And when the losses dipped so low,  
> The pulse of passion dared to grow.**  
>  
> **PRWKV, a name of flame,  
> Not just a model – but a claim.**  
>  
> **That in this dance of gates and states,  
> Passion alone rewrites the fates.**  
>  
> **So here's my heart, in code and rhyme,  
> RNNs reborn, beyond their time.**

---

🔥 **PRWKV is more than an experiment – it is a testament to Passion.** 🔥









Scalability test from small to large

ToDo for Me:
Qwen 2.5 14B
Qwen 2.5 7B
Qwen 2.5 3B

Phi-4 14B
Phi-4-mini 3.8B

Gemma 3 12B
Gemma 3 4B

Architecture: RWKV cxa076 (RWKV x070 based)


Now supported only in RWKV-Infer.


```
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}'
```