PRWKV-7-cxa076 / README.md

Update README.md

342676d verified 4 months ago

3.92 kB

	---
	license: apache-2.0
	---

	# PRWKV-cxa076 – "Akemi" RWKV Model Series

	<div align="center">
	<img src="./cxa076.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
	</div>
	---

	## Model Overview

	PRWKV stands for Passion RWKV – a model series born from relentless experimentation, unyielding dedication, and the burning question:

	> Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?

	This project explores the boundaries of RWKV architecture, replacing the traditional Transformer Attention blocks with TimeMix, an RNN-based mechanism, while distilling knowledge from Transformer giants.

	The PRWKV models range from 3B to 14B parameters, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.

	---

	## Project Objective

	The sole purpose of this project was to test the feasibility of replacing Transformer Attention with RNN-based TimeMix.

	- No shortcuts.
	- No compromises.
	- Just pure architectural curiosity driven by Passion.

	---

	## Technical Challenges & Triumphs

	### 🔥 Distillation from Transformers
	- The models were distilled from high-quality Transformer-based teachers using Grouped Query Attention (GQA).
	- The TimeMix blocks were heavily customized to align with the semantics of Attention layers.
	- Special care was taken to inherit weight structures from the teacher's Receptance, Key, Value, and Output layers, enabling smoother early-stage learning.

	### ⚡ Key Innovations
	- RepeatKV mechanism: Introduced for more stable group-based key-value projection.
	- GroupNorm vs NoNorm: Extensive experiments revealed that sometimes removing normalization enhanced long-context stability.

	---

	### 📈 Scaling Observations
	- PRWKV scales from 3B to 14B parameters.
	- 14B KD runs achieved KL divergence < 0.1, proving RNN TimeMix blocks can indeed mimic Transformer Attention at high fidelity.
	- However, Ctx expansion beyond 2048 remained an ongoing challenge due to gradient instability in larger models.

	---

	## Limitations
	- The models are still under development and primarily serve as a proof-of-concept.
	- Long-context (4096+) stability varies based on model size and requires further refinement.
	- Knowledge distillation was the core training method; no large-scale SFT was applied yet.

	---

	---

	# A Poem of Passion

	> **In the depths of night, when GPUs hum soft,
	> A fire ignites, a dream aloft.**
	>
	> **To mold an RNN with TimeMix bright,
	> And rival Attention’s daunting might.**
	>
	> **Through spikes and crashes, I pressed on,
	> A madman's code, from dusk till dawn.**
	>
	> **Not for glory, nor for gold,
	> But just to see: can TimeMix hold?**
	>
	> **And when the losses dipped so low,
	> The pulse of passion dared to grow.**
	>
	> **PRWKV, a name of flame,
	> Not just a model – but a claim.**
	>
	> **That in this dance of gates and states,
	> Passion alone rewrites the fates.**
	>
	> **So here's my heart, in code and rhyme,
	> RNNs reborn, beyond their time.**

	---

	🔥 PRWKV is more than an experiment – it is a testament to Passion. 🔥









	Scalability test from small to large

	ToDo for Me:
	Qwen 2.5 14B
	Qwen 2.5 7B
	Qwen 2.5 3B

	Phi-4 14B
	Phi-4-mini 3.8B

	Gemma 3 12B
	Gemma 3 4B

	Architecture: RWKV cxa076 (RWKV x070 based)


	Now supported only in RWKV-Infer.


	```
	curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<\|im_end\|>","default_temperature":"1.0", "default_top_p":"0.3"}'
	```