OpenMOSE
/

PRWKV-7-Qwen3-Preview-v0.1

Model card Files Files and versions

PRWKV-7-Qwen3-Preview-v0.1 / README.md

OpenMOSE's picture

Update README.md

f4f9244 verified 4 months ago

|

history blame contribute delete

3.05 kB

	---
	license: apache-2.0
	---

	# Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1

	### Overview

	- Model Name: PRWKV-7-Qwen3-14B-Preview-v0.1
	- Base Model: Qwen3 14B (Instruct)
	- Architecture: RWKV Cxa076r (RWKV x070 Based) + SwiGLU
	- Parameter Count: 14 Billion
	- Context Length: 3072
	- Training Tokens:
	- Stage 1: 100 Million Tokens
	- Stage 2: 200 Million Tokens

	This model is part of an experimental effort to replace Transformer-style attention with a fully recurrent RWKV-based architecture. It uses a customized version of the RWKV TimeMix block (`Cxa076r`) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.

	---

	### Motivation

	The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.

	Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.

	---

	### Challenges Faced

	- Stability in Early Training:
	Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.

	- Cross-Architecture Distillation:
	Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.

	- Context Sensitivity:
	Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.

	---

	### Current Limitations

	This is a preview version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.

	---

	### License & Usage

	This model is intended for research and experimentation only. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.

	---

	### Poem – The Cost of Curiosity

	> Countless times we failed—
	> A ghost in the gradients,
	> A silence in the state.
	>
	> Attention was easy.
	> But ease never leads to breakthrough.
	>
	> We drank too much coffee.
	> Slept too little.
	>
	> And somewhere between the hallucinations,
	> The loss spikes,
	> And the whispered curses at 3am—
	>
	> A new mind was born.
	> PRWKV-7 lives.

	---

	2025 OpenMOSE

	https://x.com/_m0se_