File size: 7,623 Bytes

185ccb0
 
2b3de2b
 
 
 
 
 
 
 
 
 
185ccb0
2b3de2b
185ccb0
 
2b3de2b
 
185ccb0
 
 
 
2b3de2b
 
 
 
185ccb0
 
 
 
2b3de2b
 
 
 
 
 
185ccb0
 
 
2b3de2b
 
 
 
 
 
 
 
185ccb0
 
 
 
 
 
 
2b3de2b
 
 
185ccb0
 
 
2b3de2b
 
 
185ccb0
 
 
 
 
2b3de2b
 
 
 
185ccb0
 
 
2b3de2b
 
 
185ccb0
 
 
2b3de2b
 
 
 
 
185ccb0
 
 
 
2b3de2b
 
 
185ccb0
2b3de2b
185ccb0
2b3de2b
185ccb0
2b3de2b
 
 
 
 
185ccb0
2b3de2b
 
 
 
 
 
 
 
185ccb0
2b3de2b
 
 
185ccb0
2b3de2b
 
 
 
 
 
 
 
 
185ccb0
 
 
 
 
 
 
2b3de2b

---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - rwkv
  - linear-attention
  - reka
  - distillation
  - knowledge-distillation
  - hybrid-architecture
  - language-model
---

# HRWKV7-Reka-Flash3.1-Preview

This model is an experimental research model developed as part of the work presented in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).

<div align="center">
  <img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
</div>

## Abstract

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at this https URL Training Code at this https URL

### Model Description

HRWKV7-Reka-Flash3.1-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3.1 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.

-   **Developed by:** OpenMOSE
-   **Model type:** Hybrid Linear-Attention Language Model
-   **Language(s):** Multilingual (inherited from Reka-flash3.1 21B)
-   **License:** Apache-2.0
-   **Base Model:** Reka-flash3 21B(https://huggingface.co/RekaAI/reka-flash-3.1)
-   **Year:** 2025

### Architecture Specifications

-   **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
-   **Total Layers:** 44 layers (L44D6114)
    -   38 RWKV layers (with Rope)
    -   6 GQA layers (No Rope, No Position Embeddings)
-   **Hidden Dimension:** 6144
-   **Training Context Window:** 4096 tokens
-   **Inference Context Window** 32768+
-   **Training Strategy** Following RADLADS method based knowledge distillation

## Technical Innovation

### RWKV "hxa079" Architecture

The model implements several key improvements over original RWKV architectures:

1.  **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
2.  **GroupNorm Removal**: Helps improve training stability issues
3.  **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.

### Hybrid Design Benefits

-   **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
-   **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
-   **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities

## Intended Use

This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:

-   Research into efficient attention mechanisms
-   Benchmarking hybrid architecture performance
-   Exploring linear attention limitations and solutions
-   Academic and industrial R&D purposes

## Limitations

-   **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
-   **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
-   **Performance Variability**: As a hybrid model, performance may vary significantly across different task types

## Training Details

-   **Training Context Window:** 4096 tokens
-   **Training GPU** AMD MI300X x 1(takes 70hrs) AMD Developer Cloud.
-   **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
-   **Base Model Initialization:** Weights initialized from Reka-flash3.1 21B
-   **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers

## Evaluation

Performance evaluation is ongoing. The model shows promising results in:
-   Maintaining base model capabilities while achieving linear attention efficiency
-   Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
-   Competitive performance on standard language modeling benchmarks

## Usage with Hugging Face Transformers

You can load and use this model with the `transformers` library, ensuring `trust_remote_code=True` is set:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "OpenMOSE/HRWKV7-Reka-Flash3.1-Preview"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # or torch.float16 depending on your hardware/preference
    device_map="auto",
    trust_remote_code=True,
)

# Example text generation
prompt = "Hello, I am a language model, and"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate response
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Code Repositories

-   **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
-   **Specific Training Code (OpenMOSE):** The training code for this particular `HRWKV7-Reka-Flash3.1-Preview` model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.)

## Model Card Contact

OpenMOSE - 2025

---

*Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.*

## Citation

If you use this work or find it valuable, please consider citing the RADLADS paper:

```bibtex
@misc{goldstein2025radladsrapidattentiondistillation,
      title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale}, 
      author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
      year={2025},
      eprint={2505.03005},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.03005}, 
}
```