Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,74 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# **Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1**
|
6 |
+
|
7 |
+
### **Overview**
|
8 |
+
|
9 |
+
- **Model Name:** PRWKV-7-Qwen3-14B-Preview-v0.1
|
10 |
+
- **Base Model:** Qwen3
|
11 |
+
- **Architecture:** RWKV Cxa076r (RWKV x070 Based) + SwiGLU
|
12 |
+
- **Parameter Count:** 14 Billion
|
13 |
+
- **Context Length:** 3072
|
14 |
+
- **Training Tokens:**
|
15 |
+
- Stage 1: 100 Million Tokens
|
16 |
+
- Stage 2: 200 Million Tokens
|
17 |
+
|
18 |
+
This model is part of an experimental effort to *replace Transformer-style attention with a fully recurrent RWKV-based architecture*. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.
|
19 |
+
|
20 |
+
---
|
21 |
+
|
22 |
+
### **Motivation**
|
23 |
+
|
24 |
+
The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.
|
25 |
+
|
26 |
+
Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.
|
27 |
+
|
28 |
+
---
|
29 |
+
|
30 |
+
### **Challenges Faced**
|
31 |
+
|
32 |
+
- **Stability in Early Training:**
|
33 |
+
Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.
|
34 |
+
|
35 |
+
- **Cross-Architecture Distillation:**
|
36 |
+
Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.
|
37 |
+
|
38 |
+
- **Context Sensitivity:**
|
39 |
+
Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.
|
40 |
+
|
41 |
+
---
|
42 |
+
|
43 |
+
### **Current Limitations**
|
44 |
+
|
45 |
+
This is a *preview* version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
+
### **License & Usage**
|
50 |
+
|
51 |
+
This model is intended for **research and experimentation only**. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.
|
52 |
+
|
53 |
+
---
|
54 |
+
|
55 |
+
### **Poem – The Cost of Curiosity**
|
56 |
+
|
57 |
+
> Countless times we failed—
|
58 |
+
> A ghost in the gradients,
|
59 |
+
> A silence in the state.
|
60 |
+
>
|
61 |
+
> Attention was easy.
|
62 |
+
> But ease never leads to breakthrough.
|
63 |
+
>
|
64 |
+
> We drank too much coffee.
|
65 |
+
> Slept too little.
|
66 |
+
>
|
67 |
+
> And somewhere between the hallucinations,
|
68 |
+
> The loss spikes,
|
69 |
+
> And the whispered curses at 3am—
|
70 |
+
>
|
71 |
+
> A new mind was born.
|
72 |
+
> PRWKV-7 lives.
|
73 |
+
|
74 |
+
---
|