OpenMOSE commited on
Commit
e495fdf
·
verified ·
1 Parent(s): 50f97a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # **Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1**
6
+
7
+ ### **Overview**
8
+
9
+ - **Model Name:** PRWKV-7-Qwen3-14B-Preview-v0.1
10
+ - **Base Model:** Qwen3
11
+ - **Architecture:** RWKV Cxa076r (RWKV x070 Based) + SwiGLU
12
+ - **Parameter Count:** 14 Billion
13
+ - **Context Length:** 3072
14
+ - **Training Tokens:**
15
+ - Stage 1: 100 Million Tokens
16
+ - Stage 2: 200 Million Tokens
17
+
18
+ This model is part of an experimental effort to *replace Transformer-style attention with a fully recurrent RWKV-based architecture*. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.
19
+
20
+ ---
21
+
22
+ ### **Motivation**
23
+
24
+ The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.
25
+
26
+ Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.
27
+
28
+ ---
29
+
30
+ ### **Challenges Faced**
31
+
32
+ - **Stability in Early Training:**
33
+ Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.
34
+
35
+ - **Cross-Architecture Distillation:**
36
+ Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.
37
+
38
+ - **Context Sensitivity:**
39
+ Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.
40
+
41
+ ---
42
+
43
+ ### **Current Limitations**
44
+
45
+ This is a *preview* version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.
46
+
47
+ ---
48
+
49
+ ### **License & Usage**
50
+
51
+ This model is intended for **research and experimentation only**. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.
52
+
53
+ ---
54
+
55
+ ### **Poem – The Cost of Curiosity**
56
+
57
+ > Countless times we failed—
58
+ > A ghost in the gradients,
59
+ > A silence in the state.
60
+ >
61
+ > Attention was easy.
62
+ > But ease never leads to breakthrough.
63
+ >
64
+ > We drank too much coffee.
65
+ > Slept too little.
66
+ >
67
+ > And somewhere between the hallucinations,
68
+ > The loss spikes,
69
+ > And the whispered curses at 3am—
70
+ >
71
+ > A new mind was born.
72
+ > PRWKV-7 lives.
73
+
74
+ ---