OpenMOSE
/

PRWKV-7-Qwen3-Preview-v0.1

Model card Files Files and versions

OpenMOSE commited on Apr 30

Commit

e495fdf

·

verified ·

1 Parent(s): 50f97a4

Update README.md

Files changed (1) hide show

README.md +74 -3

README.md CHANGED Viewed

@@ -1,3 +1,74 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# **Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1**
+### **Overview**
+- **Model Name:** PRWKV-7-Qwen3-14B-Preview-v0.1
+- **Base Model:** Qwen3
+- **Architecture:** RWKV Cxa076r (RWKV x070 Based) + SwiGLU
+- **Parameter Count:** 14 Billion
+- **Context Length:** 3072
+- **Training Tokens:**
+  - Stage 1: 100 Million Tokens
+  - Stage 2: 200 Million Tokens
+This model is part of an experimental effort to *replace Transformer-style attention with a fully recurrent RWKV-based architecture*. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.
+---
+### **Motivation**
+The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.
+Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.
+---
+### **Challenges Faced**
+- **Stability in Early Training:**
+  Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.
+- **Cross-Architecture Distillation:**
+  Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.
+- **Context Sensitivity:**
+  Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.
+---
+### **Current Limitations**
+This is a *preview* version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.
+---
+### **License & Usage**
+This model is intended for **research and experimentation only**. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.
+---
+### **Poem – The Cost of Curiosity**
+> Countless times we failed—
+> A ghost in the gradients,
+> A silence in the state.
+>
+> Attention was easy.
+> But ease never leads to breakthrough.
+>
+> We drank too much coffee.
+> Slept too little.
+>
+> And somewhere between the hallucinations,
+> The loss spikes,
+> And the whispered curses at 3am—
+>
+> A new mind was born.
+> PRWKV-7 lives.
+---