File size: 2,385 Bytes
e70c115
 
 
 
 
 
 
a6964b6
e70c115
 
 
3af36a4
2a4d51f
 
 
 
 
 
 
 
f0f78a0
2a4d51f
 
c1a12af
 
2a4d51f
c1a12af
3721c13
c1a12af
 
 
 
 
70689a4
c1a12af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1153ebc
c1a12af
 
 
 
 
 
 
 
 
3af36a4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-7B-Instruct
- BlinkDL/rwkv-7-world
pipeline_tag: text-generation
library_name: transformers
---

<div align="center">
  <img src="./figures/banner.jpg" style="border-radius: 10px; width: 100%; height: 100%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="ARWKV" />
</div>


  <h1 align="center">ARWKV๐Ÿชฟ</h1>

<p align="center">
  <a href="https://arxiv.org/abs/2501.15570"><b>Paper Link</b>๐Ÿ‘๏ธ</a>  |  <a href="https://github.com/yynil/RWKVInside"><b>Github</b>โœ…</a>
</p>

# ARWKV-7B-GATE-MLP (Preview 0.1)

<img src="./figures/architecture.png" alt="ARWKV Hybrid Architecture"  width="30%">

*Preview version with **RWKV-7** time mixing and Transformer MLP*

## ๐Ÿ“Œ Overview

**ALL YOU NEED IS RWKV**

This is an **early preview** of our 7B parameter RNN-based model, trained on 2k context length **(only stage-2 applied, without SFT or DPO)** through 3-stage knowledge distillation from Qwen2.5-7B-Instruct. While being a foundational version, it demonstrates:

- โœ… RWKV-7's efficient recurrence mechanism
- โœ… No self-attention, fully O(n)
- โœ… Constant VRAM usage
- โœ… Single-GPU trainability

**Roadmap Notice**: We will soon open-source different enhanced versions with:
- ๐Ÿš€ 16k+ context capability
- ๐Ÿงฎ Math-specific improvements
- ๐Ÿ“š RL enhanced reasoning model

## How to use
```shell
pip3 install --upgrade rwkv-fla transformers
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained(
    "RWKV-Red-Team/ARWKV-7B-Preview-0.1",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "RWKV-Red-Team/ARWKV-7B-Preview-0.1"
)
```

## ๐Ÿ”‘ Key Features
| Component | Specification | Note |
|-----------|---------------|------|
| Architecture | RWKV-7 TimeMix + SwiGLU | Hybrid design |
| Context Window | 2048 training CTX | *Preview limitation* |
| Training Tokens | 40M | Distillation-focused |
| Precision | FP16 inference recommended(16G Vram required) | 15%โ†‘ vs BF16 |

## ๐Ÿ—๏ธ Architecture Highlights
### Core Modification Flow
```diff
Qwen2.5 Decoder Layer:
- Grouped Query Attention
+ RWKV-7 Time Mixing (Eq.3)
- RoPE Positional Encoding
+ State Recurrence
= Hybrid Layer Output
```