Improve model card: Add pipeline tag, library, paper, code, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +82 -39
README.md CHANGED
@@ -1,29 +1,35 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  # HRWKV7-hxa079-Qwen3-8B
5
 
 
 
 
6
  ### Model Description
7
 
8
  HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-8B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
9
 
10
- - **Developed by:** OpenMOSE
11
- - **Model type:** Hybrid Linear-Attention Language Model
12
- - **Language(s):** Multilingual (inherited from Qwen3-8B)
13
- - **License:** Apache-2.0
14
- - **Base Model:** Qwen3-8B
15
- - **Year:** 2025
16
 
17
  ### Architecture Specifications
18
 
19
- - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
20
- - **Total Layers:** 36 layers (L36D4096)
21
- - 32 RWKV layers (with Rope)
22
- - 4 GQA layers (No Rope, No Position Embeddings)
23
- - **Hidden Dimension:** 4096
24
- - **Training Context Window:** 4096 tokens
25
- - **Inference Context Window** 16384+
26
- - **Training Strategy** Following RADLADS method based knowledge distillation
27
 
28
  ## Technical Innovation
29
 
@@ -31,54 +37,91 @@ HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v
31
 
32
  The model implements several key improvements over original RWKV architectures:
33
 
34
- 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
35
- 2. **GroupNorm Removal**: Helps improve training stability issues
36
- 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
37
 
38
  ### Hybrid Design Benefits
39
 
40
- - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/9 of full GQA.
41
- - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
42
- - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
43
 
44
  ## Intended Use
45
 
46
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
47
 
48
- - Research into efficient attention mechanisms
49
- - Benchmarking hybrid architecture performance
50
- - Exploring linear attention limitations and solutions
51
- - Academic and industrial R&D purposes
52
 
53
  ## Limitations
54
 
55
- - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
56
- - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
57
- - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
58
 
59
  ## Training Details
60
 
61
- - **Training Context Window:** 4096 tokens
62
- - **Training GPU** AMD MI300X x 1(takes 80hrs) Runpod
63
- - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
64
- - **Base Model Initialization:** Weights initialized from Qwen3-8B
65
- - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
66
 
67
  ## Evaluation
68
 
69
  Performance evaluation is ongoing. The model shows promising results in:
70
- - Maintaining base model capabilities while achieving linear attention efficiency
71
- - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
72
- - Competitive performance on standard language modeling benchmarks
 
 
 
 
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Thank you for Big help :)
76
- - SmerkyG Inspired by RADLADS (https://arxiv.org/abs/2505.03005)
77
- - https://github.com/recursal/RADLADS-paper
78
 
79
  ## Training Code
80
- - https://github.com/OpenMOSE/RWKVInside (still buggy)
81
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Model Card Contact
84
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
6
+
7
  # HRWKV7-hxa079-Qwen3-8B
8
 
9
+ **Paper:** [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005)
10
+ **Code:** [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
11
+
12
  ### Model Description
13
 
14
  HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-8B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
15
 
16
+ - **Developed by:** OpenMOSE
17
+ - **Model type:** Hybrid Linear-Attention Language Model
18
+ - **Language(s):** Multilingual (inherited from Qwen3-8B)
19
+ - **License:** Apache-2.0
20
+ - **Base Model:** Qwen3-8B
21
+ - **Year:** 2025
22
 
23
  ### Architecture Specifications
24
 
25
+ - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
26
+ - **Total Layers:** 36 layers (L36D4096)
27
+ - 32 RWKV layers (with Rope)
28
+ - 4 GQA layers (No Rope, No Position Embeddings)
29
+ - **Hidden Dimension:** 4096
30
+ - **Training Context Window:** 4096 tokens
31
+ - **Inference Context Window** 16384+
32
+ - **Training Strategy** Following RADLADS method based knowledge distillation
33
 
34
  ## Technical Innovation
35
 
 
37
 
38
  The model implements several key improvements over original RWKV architectures:
39
 
40
+ 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
41
+ 2. **GroupNorm Removal**: Helps improve training stability issues
42
+ 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
43
 
44
  ### Hybrid Design Benefits
45
 
46
+ - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/9 of full GQA.
47
+ - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
48
+ - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
49
 
50
  ## Intended Use
51
 
52
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
53
 
54
+ - Research into efficient attention mechanisms
55
+ - Benchmarking hybrid architecture performance
56
+ - Exploring linear attention limitations and solutions
57
+ - Academic and industrial R&D purposes
58
 
59
  ## Limitations
60
 
61
+ - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
62
+ - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
63
+ - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
64
 
65
  ## Training Details
66
 
67
+ - **Training Context Window:** 4096 tokens
68
+ - **Training GPU** AMD MI300X x 1(takes 80hrs) Runpod
69
+ - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
70
+ - **Base Model Initialization:** Weights initialized from Qwen3-8B
71
+ - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
72
 
73
  ## Evaluation
74
 
75
  Performance evaluation is ongoing. The model shows promising results in:
76
+ - Maintaining base model capabilities while achieving linear attention efficiency
77
+ - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
78
+ - Competitive performance on standard language modeling benchmarks
79
+
80
+ ## Sample Usage
81
+
82
+ You can use this model with the Hugging Face `transformers` library. Ensure you have `trust_remote_code=True` set, as it uses a custom architecture.
83
 
84
+ ```python
85
+ from transformers import AutoTokenizer, AutoModelForCausalLM
86
+ import torch
87
+
88
+ model_name = "OpenMOSE/HRWKV7-hxa079-Qwen3-8B" # The name of this model
89
+ model = AutoModelForCausalLM.from_pretrained(
90
+ model_name,
91
+ torch_dtype=torch.bfloat16, # Or torch.float16, depending on your system and model precision
92
+ device_map="auto",
93
+ trust_remote_code=True,
94
+ )
95
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
96
+
97
+ prompt = "Tell me a short story about a brave knight named Sir Reginald."
98
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
99
+ outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)
100
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
101
+ ```
102
 
103
  ## Thank you for Big help :)
104
+ - SmerkyG Inspired by RADLADS ([https://arxiv.org/abs/2505.03005](https://arxiv.org/abs/2505.03005))
105
+ - [https://github.com/recursal/RADLADS-paper](https://github.com/recursal/RADLADS-paper) (This is the primary repository for the RADLADS paper's code)
106
 
107
  ## Training Code
108
+ - [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (This repository contains training code specifically for this model variant, still buggy)
109
+
110
+ ## Citation
111
+
112
+ If you use this model or find our work valuable, please consider citing the RADLADS paper:
113
+
114
+ ```bibtex
115
+ @misc{goldstein2025radladsrapidattentiondistillation,
116
+ title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
117
+ author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
118
+ year={2025},
119
+ eprint={2505.03005},
120
+ archivePrefix={arXiv},
121
+ primaryClass={cs.CL},
122
+ url={https://arxiv.org/abs/2505.03005},
123
+ }
124
+ ```
125
 
126
  ## Model Card Contact
127