File size: 2,385 Bytes
e70c115 a6964b6 e70c115 3af36a4 2a4d51f f0f78a0 2a4d51f c1a12af 2a4d51f c1a12af 3721c13 c1a12af 70689a4 c1a12af 1153ebc c1a12af 3af36a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-7B-Instruct
- BlinkDL/rwkv-7-world
pipeline_tag: text-generation
library_name: transformers
---
<div align="center">
<img src="./figures/banner.jpg" style="border-radius: 10px; width: 100%; height: 100%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="ARWKV" />
</div>
<h1 align="center">ARWKV๐ชฟ</h1>
<p align="center">
<a href="https://arxiv.org/abs/2501.15570"><b>Paper Link</b>๐๏ธ</a> | <a href="https://github.com/yynil/RWKVInside"><b>Github</b>โ
</a>
</p>
# ARWKV-7B-GATE-MLP (Preview 0.1)
<img src="./figures/architecture.png" alt="ARWKV Hybrid Architecture" width="30%">
*Preview version with **RWKV-7** time mixing and Transformer MLP*
## ๐ Overview
**ALL YOU NEED IS RWKV**
This is an **early preview** of our 7B parameter RNN-based model, trained on 2k context length **(only stage-2 applied, without SFT or DPO)** through 3-stage knowledge distillation from Qwen2.5-7B-Instruct. While being a foundational version, it demonstrates:
- โ
RWKV-7's efficient recurrence mechanism
- โ
No self-attention, fully O(n)
- โ
Constant VRAM usage
- โ
Single-GPU trainability
**Roadmap Notice**: We will soon open-source different enhanced versions with:
- ๐ 16k+ context capability
- ๐งฎ Math-specific improvements
- ๐ RL enhanced reasoning model
## How to use
```shell
pip3 install --upgrade rwkv-fla transformers
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"RWKV-Red-Team/ARWKV-7B-Preview-0.1",
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"RWKV-Red-Team/ARWKV-7B-Preview-0.1"
)
```
## ๐ Key Features
| Component | Specification | Note |
|-----------|---------------|------|
| Architecture | RWKV-7 TimeMix + SwiGLU | Hybrid design |
| Context Window | 2048 training CTX | *Preview limitation* |
| Training Tokens | 40M | Distillation-focused |
| Precision | FP16 inference recommended(16G Vram required) | 15%โ vs BF16 |
## ๐๏ธ Architecture Highlights
### Core Modification Flow
```diff
Qwen2.5 Decoder Layer:
- Grouped Query Attention
+ RWKV-7 Time Mixing (Eq.3)
- RoPE Positional Encoding
+ State Recurrence
= Hybrid Layer Output
``` |