|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# **Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1** |
|
|
|
### **Overview** |
|
|
|
- **Model Name:** PRWKV-7-Qwen3-14B-Preview-v0.1 |
|
- **Base Model:** Qwen3 14B (Instruct) |
|
- **Architecture:** RWKV Cxa076r (RWKV x070 Based) + SwiGLU |
|
- **Parameter Count:** 14 Billion |
|
- **Context Length:** 3072 |
|
- **Training Tokens:** |
|
- Stage 1: 100 Million Tokens |
|
- Stage 2: 200 Million Tokens |
|
|
|
This model is part of an experimental effort to *replace Transformer-style attention with a fully recurrent RWKV-based architecture*. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3. |
|
|
|
--- |
|
|
|
### **Motivation** |
|
|
|
The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory. |
|
|
|
Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry. |
|
|
|
--- |
|
|
|
### **Challenges Faced** |
|
|
|
- **Stability in Early Training:** |
|
Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up. |
|
|
|
- **Cross-Architecture Distillation:** |
|
Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs. |
|
|
|
- **Context Sensitivity:** |
|
Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably. |
|
|
|
--- |
|
|
|
### **Current Limitations** |
|
|
|
This is a *preview* version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work. |
|
|
|
--- |
|
|
|
### **License & Usage** |
|
|
|
This model is intended for **research and experimentation only**. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it. |
|
|
|
--- |
|
|
|
### **Poem – The Cost of Curiosity** |
|
|
|
> Countless times we failed— |
|
> A ghost in the gradients, |
|
> A silence in the state. |
|
> |
|
> Attention was easy. |
|
> But ease never leads to breakthrough. |
|
> |
|
> We drank too much coffee. |
|
> Slept too little. |
|
> |
|
> And somewhere between the hallucinations, |
|
> The loss spikes, |
|
> And the whispered curses at 3am— |
|
> |
|
> A new mind was born. |
|
> PRWKV-7 lives. |
|
|
|
--- |
|
|
|
2025 OpenMOSE |
|
|
|
https://x.com/_m0se_ |