Model Card: PRWKV-7-Qwen3-14B-Preview-v0.1

Overview

  • Model Name: PRWKV-7-Qwen3-14B-Preview-v0.1
  • Base Model: Qwen3 14B (Instruct)
  • Architecture: RWKV Cxa076r (RWKV x070 Based) + SwiGLU
  • Parameter Count: 14 Billion
  • Context Length: 3072
  • Training Tokens:
    • Stage 1: 100 Million Tokens
    • Stage 2: 200 Million Tokens

This model is part of an experimental effort to replace Transformer-style attention with a fully recurrent RWKV-based architecture. It uses a customized version of the RWKV TimeMix block (Cxa076r) with SwiGLU activation, applied to a 14B-scale model derived from Qwen3.


Motivation

The goal of this project is to explore whether an RNN-style model such as RWKV can faithfully mimic the output and reasoning behavior of large Transformer-based LLMs like Qwen3, while retaining the benefits of linear compute cost and persistent memory.

Replacing attention with TimeMix was not a trivial task. Qwen3 is heavily optimized for attention-based flow, including grouped-query attention (GQA) and Rotary Positional Embeddings (RoPE). To bridge the architecture gap, we introduced novel gating structures, careful initialization alignment, and staged distillation involving both token-level and hidden-state mimicry.


Challenges Faced

  • Stability in Early Training:
    Unlike Transformer models, RWKV's state dynamics require careful gating and normalization. Without it, token dropout or state explosion frequently occurred during warm-up.

  • Cross-Architecture Distillation:
    Aligning a recurrent architecture with a feed-forward Transformer introduced step-wise divergence, especially in conversational jumps. Custom loss functions were employed to match hidden trajectories and long-term behavior, not just per-token outputs.

  • Context Sensitivity:
    Increasing context length beyond 2048 revealed stability cliffs. Careful adjustment of temporal decay, positional mixing, and memory routing was necessary to reach 3072 tokens reliably.


Current Limitations

This is a preview version. The model is capable of coherent generation, especially in long-form settings, but may still show deviations in precision-demanding tasks or rare contexts. Prompt injection robustness and RLHF alignment are future work.


License & Usage

This model is intended for research and experimentation only. Please consult the licensing terms of Qwen3 and RWKV if you intend to use this model commercially or fine-tune it.


Poem โ€“ The Cost of Curiosity

Countless times we failedโ€”
A ghost in the gradients,
A silence in the state.

Attention was easy.
But ease never leads to breakthrough.

We drank too much coffee.
Slept too little.

And somewhere between the hallucinations,
The loss spikes,
And the whispered curses at 3amโ€”

A new mind was born.
PRWKV-7 lives.


2025 OpenMOSE

https://x.com/_m0se_

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support