--- license: apache-2.0 --- # **PRWKV-cxa076 – "Akemi" RWKV Model Series**

--- ## **Model Overview** **PRWKV** stands for **Passion RWKV** – a model series born from relentless experimentation, unyielding dedication, and the burning question: > **Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?** This project explores the boundaries of **RWKV architecture**, replacing the traditional **Transformer Attention blocks** with **TimeMix**, an RNN-based mechanism, while distilling knowledge from Transformer giants. The PRWKV models range from **3B to 14B parameters**, showcasing the potential scalability of RNN-based language models in modern LLM landscapes. --- ## **Project Objective** The **sole purpose** of this project was to **test the feasibility of replacing Transformer Attention with RNN-based TimeMix**. - **No shortcuts**. - **No compromises**. - Just pure **architectural curiosity** driven by **Passion**. --- ## **Technical Challenges & Triumphs** ### 🔥 **Distillation from Transformers** - The models were distilled from high-quality **Transformer-based teachers** using **Grouped Query Attention (GQA)**. - The **TimeMix** blocks were heavily customized to align with the semantics of Attention layers. - Special care was taken to **inherit weight structures** from the teacher's **Receptance, Key, Value, and Output layers**, enabling smoother early-stage learning. ### ⚡ **Key Innovations** - **RepeatKV mechanism**: Introduced for more stable group-based key-value projection. - **GroupNorm vs NoNorm**: Extensive experiments revealed that sometimes **removing normalization** enhanced long-context stability. --- ### 📈 **Scaling Observations** - PRWKV scales from **3B** to **14B** parameters. - **14B KD** runs achieved **KL divergence < 0.1**, proving **RNN TimeMix blocks can indeed mimic Transformer Attention** at high fidelity. - However, **Ctx expansion** beyond 2048 remained an ongoing challenge due to gradient instability in larger models. --- ## **Limitations** - The models are still under development and primarily serve **as a proof-of-concept**. - Long-context (4096+) stability varies based on model size and requires further refinement. - Knowledge distillation was the core training method; no large-scale SFT was applied yet. --- --- # **A Poem of Passion** > **In the depths of night, when GPUs hum soft, > A fire ignites, a dream aloft.** > > **To mold an RNN with TimeMix bright, > And rival Attention’s daunting might.** > > **Through spikes and crashes, I pressed on, > A madman's code, from dusk till dawn.** > > **Not for glory, nor for gold, > But just to see: can TimeMix hold?** > > **And when the losses dipped so low, > The pulse of passion dared to grow.** > > **PRWKV, a name of flame, > Not just a model – but a claim.** > > **That in this dance of gates and states, > Passion alone rewrites the fates.** > > **So here's my heart, in code and rhyme, > RNNs reborn, beyond their time.** --- 🔥 **PRWKV is more than an experiment – it is a testament to Passion.** 🔥 Scalability test from small to large ToDo for Me: Qwen 2.5 14B Qwen 2.5 7B Qwen 2.5 3B Phi-4 14B Phi-4-mini 3.8B Gemma 3 12B Gemma 3 4B Architecture: RWKV cxa076 (RWKV x070 based) Now supported only in RWKV-Infer. ``` curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}' ```