nielsr HF Staff commited on
Commit
a10fae9
·
verified ·
1 Parent(s): 5011217

Improve model card: Add metadata, paper abstract, links & transformers usage

Browse files

This PR significantly improves the model card for `HRWKV7-Reka-Flash3.1-Preview` by:

* Adding essential metadata: `pipeline_tag: text-generation`, `library_name: transformers`, and comprehensive `tags` (`rwkv`, `linear-attention`, `reka`, `distillation`, `knowledge-distillation`, `hybrid-architecture`, `language-model`). This enhances discoverability and enables the "how to use" widget on the Hub.
* Adding the paper abstract for better context on the model's development via the RADLADS protocol.
* Updating the paper link to the official Hugging Face Papers page: [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
* Adding direct links to the main RADLADS project GitHub repository (`https://github.com/recursal/RADLADS`) and clarifying the link to this model's specific training code (`https://github.com/OpenMOSE/RWKVInside`).
* Replacing the non-standard `curl` usage snippet with a clear Python code example using the Hugging Face `transformers` library for easy model loading and generation.
* Adding the paper's BibTeX citation for proper attribution.

Please review and merge this PR if everything looks good.

Files changed (1) hide show
  1. README.md +96 -46
README.md CHANGED
@@ -1,33 +1,50 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
4
  # HRWKV7-Reka-Flash3.1-Preview
5
 
 
 
6
  <div align="center">
7
  <img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
8
  </div>
9
 
 
 
 
 
10
  ### Model Description
11
 
12
  HRWKV7-Reka-Flash3.1-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3.1 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
13
 
14
- - **Developed by:** OpenMOSE
15
- - **Model type:** Hybrid Linear-Attention Language Model
16
- - **Language(s):** Multilingual (inherited from Reka-flash3.1 21B)
17
- - **License:** Apache-2.0
18
- - **Base Model:** Reka-flash3 21B(https://huggingface.co/RekaAI/reka-flash-3.1)
19
- - **Year:** 2025
20
 
21
  ### Architecture Specifications
22
 
23
- - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
24
- - **Total Layers:** 44 layers (L44D6114)
25
- - 38 RWKV layers (with Rope)
26
- - 6 GQA layers (No Rope, No Position Embeddings)
27
- - **Hidden Dimension:** 6144
28
- - **Training Context Window:** 4096 tokens
29
- - **Inference Context Window** 32768+
30
- - **Training Strategy** Following RADLADS method based knowledge distillation
31
 
32
  ## Technical Innovation
33
 
@@ -35,61 +52,78 @@ HRWKV7-Reka-Flash3.1-Preview is an RNN hybrid architecture model that combines R
35
 
36
  The model implements several key improvements over original RWKV architectures:
37
 
38
- 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
39
- 2. **GroupNorm Removal**: Helps improve training stability issues
40
- 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
41
 
42
  ### Hybrid Design Benefits
43
 
44
- - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
45
- - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
46
- - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
47
 
48
  ## Intended Use
49
 
50
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
51
 
52
- - Research into efficient attention mechanisms
53
- - Benchmarking hybrid architecture performance
54
- - Exploring linear attention limitations and solutions
55
- - Academic and industrial R&D purposes
56
 
57
  ## Limitations
58
 
59
- - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
60
- - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
61
- - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
62
 
63
  ## Training Details
64
 
65
- - **Training Context Window:** 4096 tokens
66
- - **Training GPU** AMD MI300X x 1(takes 70hrs) AMD Developer Cloud.
67
- - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
68
- - **Base Model Initialization:** Weights initialized from Reka-flash3.1 21B
69
- - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
70
 
71
  ## Evaluation
72
 
73
  Performance evaluation is ongoing. The model shows promising results in:
74
- - Maintaining base model capabilities while achieving linear attention efficiency
75
- - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
76
- - Competitive performance on standard language modeling benchmarks
77
 
 
78
 
79
- ## Run
80
- - RWKV-Infer now support hxa079
81
- ```bash
82
- curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"/home/client/Projects/llm/hxa079-reka-flash3-stage2-hybrid.pth","model_viewname":"RWKV HXA079 L38T6 Reka Flash3","model_strategy":"int8","adapter_filename":"","adapter_mode":"", "template":"rekaflash3", "endtoken":"\n <sep>","default_temperature":"0.2", "default_top_p":"0.3", "rope_theta":"8000000.0", "rms_norm_eps":"1e-5"}'
83
 
84
- ```
 
 
 
 
85
 
86
- ## Thank you for Big help :)
87
- - SmerkyG Inspired by RADLADS (https://arxiv.org/abs/2505.03005)
88
- - https://github.com/recursal/RADLADS-paper
 
 
 
 
 
89
 
90
- ## Training Code
91
- - https://github.com/OpenMOSE/RWKVInside (still buggy)
 
92
 
 
 
 
 
 
 
 
 
 
93
 
94
  ## Model Card Contact
95
 
@@ -97,4 +131,20 @@ OpenMOSE - 2025
97
 
98
  ---
99
 
100
- *Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - rwkv
7
+ - linear-attention
8
+ - reka
9
+ - distillation
10
+ - knowledge-distillation
11
+ - hybrid-architecture
12
+ - language-model
13
  ---
14
+
15
  # HRWKV7-Reka-Flash3.1-Preview
16
 
17
+ This model is an experimental research model developed as part of the work presented in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
18
+
19
  <div align="center">
20
  <img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
21
  </div>
22
 
23
+ ## Abstract
24
+
25
+ We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at this https URL Training Code at this https URL
26
+
27
  ### Model Description
28
 
29
  HRWKV7-Reka-Flash3.1-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3.1 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
30
 
31
+ - **Developed by:** OpenMOSE
32
+ - **Model type:** Hybrid Linear-Attention Language Model
33
+ - **Language(s):** Multilingual (inherited from Reka-flash3.1 21B)
34
+ - **License:** Apache-2.0
35
+ - **Base Model:** Reka-flash3 21B(https://huggingface.co/RekaAI/reka-flash-3.1)
36
+ - **Year:** 2025
37
 
38
  ### Architecture Specifications
39
 
40
+ - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
41
+ - **Total Layers:** 44 layers (L44D6114)
42
+ - 38 RWKV layers (with Rope)
43
+ - 6 GQA layers (No Rope, No Position Embeddings)
44
+ - **Hidden Dimension:** 6144
45
+ - **Training Context Window:** 4096 tokens
46
+ - **Inference Context Window** 32768+
47
+ - **Training Strategy** Following RADLADS method based knowledge distillation
48
 
49
  ## Technical Innovation
50
 
 
52
 
53
  The model implements several key improvements over original RWKV architectures:
54
 
55
+ 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
56
+ 2. **GroupNorm Removal**: Helps improve training stability issues
57
+ 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
58
 
59
  ### Hybrid Design Benefits
60
 
61
+ - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
62
+ - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
63
+ - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
64
 
65
  ## Intended Use
66
 
67
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
68
 
69
+ - Research into efficient attention mechanisms
70
+ - Benchmarking hybrid architecture performance
71
+ - Exploring linear attention limitations and solutions
72
+ - Academic and industrial R&D purposes
73
 
74
  ## Limitations
75
 
76
+ - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
77
+ - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
78
+ - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
79
 
80
  ## Training Details
81
 
82
+ - **Training Context Window:** 4096 tokens
83
+ - **Training GPU** AMD MI300X x 1(takes 70hrs) AMD Developer Cloud.
84
+ - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
85
+ - **Base Model Initialization:** Weights initialized from Reka-flash3.1 21B
86
+ - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
87
 
88
  ## Evaluation
89
 
90
  Performance evaluation is ongoing. The model shows promising results in:
91
+ - Maintaining base model capabilities while achieving linear attention efficiency
92
+ - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
93
+ - Competitive performance on standard language modeling benchmarks
94
 
95
+ ## Usage with Hugging Face Transformers
96
 
97
+ You can load and use this model with the `transformers` library, ensuring `trust_remote_code=True` is set:
 
 
 
98
 
99
+ ```python
100
+ from transformers import AutoModelForCausalLM, AutoTokenizer
101
+ import torch
102
+
103
+ model_id = "OpenMOSE/HRWKV7-Reka-Flash3.1-Preview"
104
 
105
+ # Load tokenizer and model
106
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
107
+ model = AutoModelForCausalLM.from_pretrained(
108
+ model_id,
109
+ torch_dtype=torch.bfloat16, # or torch.float16 depending on your hardware/preference
110
+ device_map="auto",
111
+ trust_remote_code=True,
112
+ )
113
 
114
+ # Example text generation
115
+ prompt = "Hello, I am a language model, and"
116
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
117
 
118
+ # Generate response
119
+ outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)
120
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
121
+ ```
122
+
123
+ ## Code Repositories
124
+
125
+ - **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
126
+ - **Specific Training Code (OpenMOSE):** The training code for this particular `HRWKV7-Reka-Flash3.1-Preview` model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.)
127
 
128
  ## Model Card Contact
129
 
 
131
 
132
  ---
133
 
134
+ *Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.*
135
+
136
+ ## Citation
137
+
138
+ If you use this work or find it valuable, please consider citing the RADLADS paper:
139
+
140
+ ```bibtex
141
+ @misc{goldstein2025radladsrapidattentiondistillation,
142
+ title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
143
+ author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
144
+ year={2025},
145
+ eprint={2505.03005},
146
+ archivePrefix={arXiv},
147
+ primaryClass={cs.CL},
148
+ url={https://arxiv.org/abs/2505.03005},
149
+ }
150
+ ```