Spaces:
				
			
			
	
			
			
		Build error
		
	
	
	
			
			
	
	
	
	
		
		
		Build error
		
	File size: 1,926 Bytes
			
			| 292c2df | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | > RWKV: RNN with Transformer-level LLM Performance
>
> It combines the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).
https://github.com/BlinkDL/RWKV-LM
https://github.com/BlinkDL/ChatRWKV
## Using RWKV in the web UI
#### 1. Download the model
It is available in different sizes:
* https://huggingface.co/BlinkDL/rwkv-4-pile-3b/
* https://huggingface.co/BlinkDL/rwkv-4-pile-7b/
* https://huggingface.co/BlinkDL/rwkv-4-pile-14b/
There are also older releases with smaller sizes like:
* https://huggingface.co/BlinkDL/rwkv-4-pile-169m/resolve/main/RWKV-4-Pile-169M-20220807-8023.pth
Download the chosen `.pth` and put it directly in the `models` folder. 
#### 2. Download the tokenizer
[20B_tokenizer.json](https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/v2/20B_tokenizer.json)
Also put it directly in the `models` folder. Make sure to not rename it. It should be called `20B_tokenizer.json`.
#### 3. Launch the web UI
No additional steps are required. Just launch it as you would with any other model.
```
python server.py --listen  --no-stream --model RWKV-4-Pile-169M-20220807-8023.pth
```
## Setting a custom strategy
It is possible to have very fine control over the offloading and precision for the model with the `--rwkv-strategy` flag. Possible values include:
```
"cpu fp32" # CPU mode
"cuda fp16" # GPU mode with float16 precision
"cuda fp16 *30 -> cpu fp32" # GPU+CPU offloading. The higher the number after *, the higher the GPU allocation.
"cuda fp16i8" # GPU mode with 8-bit precision
```
See the README for the PyPl package for more details: https://pypi.org/project/rwkv/
## Compiling the CUDA kernel
You can compile the CUDA kernel for the model with `--rwkv-cuda-on`. This should improve the performance a lot but I haven't been able to get it to work yet. | 
