File size: 7,103 Bytes
dcec4e6 f67cade dcec4e6 f67cade 2ed9c45 0cd1f6e 2440b37 88fb0d7 e37f8bd 88fb0d7 f67cade 2ed9c45 0cd1f6e 2ed9c45 0cd1f6e f67cade 0cd1f6e 2ed9c45 0cd1f6e f67cade 0cd1f6e 2ed9c45 0cd1f6e 2ed9c45 f67cade 2ed9c45 f67cade 2ed9c45 f67cade 2ed9c45 f67cade 2ed9c45 f67cade 2ed9c45 f67cade 1a4439c f67cade 1a4439c f67cade 1a4439c 2ed9c45 f67cade e37f8bd f67cade 2ed9c45 f67cade |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation
- causal-lm
- linear-attention
- rwkv
- reka
- knowledge-distillation
- multilingual
languages:
- mul
---
# HRWKV7-Reka-Flash3-Preview
<div align="center">
<img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
</div>
> I'm simply exploring the possibility of linearizing existing Transformer models.
> It's still far from perfect,
> but I hope you'll bear with me as I continue this journey. :)
## Paper and Project Details
This model is part of the research presented in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
The main codebase for the RADLADS project can be found at: [https://github.com/recursal/RADLADS-paper](https://github.com/recursal/RADLADS-paper)
### Model Description
HRWKV7-Reka-Flash3-Preview is an experimental hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
- **Developed by:** OpenMOSE
- **Model type:** Hybrid Linear-Attention Language Model
- **Language(s):** Multilingual (inherited from Reka-flash3 21B)
- **License:** Apache-2.0
- **Base Model:** Reka-flash3 21B(https://huggingface.co/RekaAI/reka-flash-3)
- **Year:** 2025
### Architecture Specifications
- **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
- **Total Layers:** 44 layers (L44D6114)
- 38 RWKV layers (with Rope)
- 6 GQA layers (No Rope, No Position Embeddings)
- **Hidden Dimension:** 6144
- **Training Context Window:** 4096 tokens
- **Inference Context Window** 32768+
- **Training Strategy** Following RADLADS method based knowledge distillation
## Technical Innovation
### RWKV "hxa079" Architecture
The model implements several key improvements over standard RWKV architectures:
1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
2. **GroupNorm Removal**: Helps improve training stability issues
3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
### Hybrid Design Benefits
- **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
- **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
- **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
## Intended Use
This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
- Research into efficient attention mechanisms
- Benchmarking hybrid architecture performance
- Exploring linear attention limitations and solutions
- Academic and industrial R&D purposes
## Limitations
- **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
- **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
- **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
## Training Details
- **Training Context Window:** 4096 tokens
- **Training GPU** AMD MI300X x 1(takes 68hrs)
- **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1
- **Base Model Initialization:** Weights initialized from Reka-flash3 21B
- **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
## Evaluation
Performance evaluation is ongoing. The model shows promising results in:
- Maintaining base model capabilities while achieving linear attention efficiency
- Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
- Competitive performance on standard language modeling benchmarks
## Usage with Hugging Face Transformers
This model can be loaded and used with the `transformers` library. Ensure you have `transformers` installed: `pip install transformers`.
When loading, remember to set `trust_remote_code=True` because of the custom architecture.
```python
from transformers import pipeline, AutoTokenizer
import torch
model_name = "OpenMOSE/HRWKV7-Reka-Flash3-Preview" # Replace with the actual model ID if different
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
pipe = pipeline(
"text-generation",
model_name,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16, # or torch.float16 depending on your GPU and model precision
device_map="auto",
trust_remote_code=True,
)
text = "The quick brown fox jumps over the lazy "
result = pipe(text, max_new_tokens=20, do_sample=True, top_p=0.9, temperature=0.7)[0]["generated_text"]
print(result)
```
## Run with RWKV-Infer (as provided by original authors)
- RWKV-Infer now support hxa079
```bash
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"/home/client/Projects/llm/hxa079-reka-flash3-stage2-hybrid.pth","model_viewname":"RWKV HXA079 L38T6 Reka Flash3","model_strategy":"int8","adapter_filename":"","adapter_mode":"", "template":"rekaflash3", "endtoken":"
<sep>","default_temperature":"0.2", "default_top_p":"0.3", "rope_theta":"8000000.0", "rms_norm_eps":"1e-5"}'
```
## Thank you for Big help :)
- SmerkyG Inspired by RADLADS (https://arxiv.org/abs/2505.03005)
## Training Code
- https://github.com/OpenMOSE/RWKVInside (still buggy)
## Model Card Contact
OpenMOSE - 2025
---
*Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.*
## Citation
If you use this code or find our work valuable, please consider citing RADLADS:
```bibtex
@misc{goldstein2025radladsrapidattentiondistillation,
title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
year={2025},
eprint={2505.03005},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.03005},
}
``` |