File size: 3,052 Bytes
e495fdf
 
 
 
 
 
 
 
 
efdd133
e495fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4f9244
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: apache-2.0
---

# **Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1**

### **Overview**

- **Model Name:** PRWKV-7-Qwen3-14B-Preview-v0.1  
- **Base Model:** Qwen3 14B (Instruct)
- **Architecture:** RWKV Cxa076r (RWKV x070 Based) + SwiGLU  
- **Parameter Count:** 14 Billion  
- **Context Length:** 3072  
- **Training Tokens:**  
  - Stage 1: 100 Million Tokens  
  - Stage 2: 200 Million Tokens  

This model is part of an experimental effort to *replace Transformer-style attention with a fully recurrent RWKV-based architecture*. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.

---

### **Motivation**

The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.

Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.

---

### **Challenges Faced**

- **Stability in Early Training:**  
  Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.
  
- **Cross-Architecture Distillation:**  
  Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.
  
- **Context Sensitivity:**  
  Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.

---

### **Current Limitations**

This is a *preview* version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.

---

### **License & Usage**

This model is intended for **research and experimentation only**. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.

---

### **Poem – The Cost of Curiosity**

> Countless times we failed—  
> A ghost in the gradients,  
> A silence in the state.  
>  
> Attention was easy.  
> But ease never leads to breakthrough.  
>  
> We drank too much coffee.  
> Slept too little.  
>  
> And somewhere between the hallucinations,  
> The loss spikes,  
> And the whispered curses at 3am—  
>  
> A new mind was born.  
> PRWKV-7 lives.

---

2025 OpenMOSE

https://x.com/_m0se_