|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
base_model: |
|
- Qwen/Qwen2.5-7B-Instruct |
|
- BlinkDL/rwkv-7-world |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
<div align="center"> |
|
<img src="./figures/banner.jpg" style="border-radius: 10px; width: 100%; height: 100%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="ARWKV" /> |
|
</div> |
|
|
|
|
|
<h1 align="center">ARWKV๐ชฟ</h1> |
|
|
|
<p align="center"> |
|
<a href="https://arxiv.org/abs/2501.15570"><b>Paper Link</b>๐๏ธ</a> | <a href="https://github.com/yynil/RWKVInside"><b>Github</b>โ
</a> |
|
</p> |
|
|
|
# ARWKV-7B-GATE-MLP (Preview 0.1) |
|
|
|
<img src="./figures/architecture.png" alt="ARWKV Hybrid Architecture" width="30%"> |
|
|
|
*Preview version with **RWKV-7** time mixing and Transformer MLP* |
|
|
|
## ๐ Overview |
|
|
|
**ALL YOU NEED IS RWKV** |
|
|
|
This is an **early preview** of our 7B parameter RNN-based model, trained on 2k context length **(only stage-2 applied, without SFT or DPO)** through 3-stage knowledge distillation from Qwen2.5-7B-Instruct. While being a foundational version, it demonstrates: |
|
|
|
- โ
RWKV-7's efficient recurrence mechanism |
|
- โ
No self-attention, fully O(n) |
|
- โ
Constant VRAM usage |
|
- โ
Single-GPU trainability |
|
|
|
**Roadmap Notice**: We will soon open-source different enhanced versions with: |
|
- ๐ 16k+ context capability |
|
- ๐งฎ Math-specific improvements |
|
- ๐ RL enhanced reasoning model |
|
|
|
## How to use |
|
```shell |
|
pip3 install --upgrade rwkv-fla transformers |
|
``` |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"RWKV-Red-Team/ARWKV-7B-Preview-0.1", |
|
device_map="auto", |
|
torch_dtype=torch.float16, |
|
trust_remote_code=True, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"RWKV-Red-Team/ARWKV-7B-Preview-0.1" |
|
) |
|
``` |
|
|
|
## ๐ Key Features |
|
| Component | Specification | Note | |
|
|-----------|---------------|------| |
|
| Architecture | RWKV-7 TimeMix + SwiGLU | Hybrid design | |
|
| Context Window | 2048 training CTX | *Preview limitation* | |
|
| Training Tokens | 40M | Distillation-focused | |
|
| Precision | FP16 inference recommended(16G Vram required) | 15%โ vs BF16 | |
|
|
|
## ๐๏ธ Architecture Highlights |
|
### Core Modification Flow |
|
```diff |
|
Qwen2.5 Decoder Layer: |
|
- Grouped Query Attention |
|
+ RWKV-7 Time Mixing (Eq.3) |
|
- RoPE Positional Encoding |
|
+ State Recurrence |
|
= Hybrid Layer Output |
|
``` |