KaraKaraWitch nielsr HF Staff commited on
Commit
7201443
·
verified ·
1 Parent(s): b76e9a6

Enhance model card with detailed description, usage examples, and citation (#2)

Browse files

- Enhance model card with detailed description, usage examples, and citation (662cd934b7f4af614f79b32a96414af015ffb943)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +87 -2
README.md CHANGED
@@ -1,9 +1,94 @@
1
  ---
2
- license: apache-2.0
3
  library_name: transformers
 
4
  pipeline_tag: text-generation
5
  ---
6
 
 
 
7
  This repository contains the model described in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
8
 
9
- Github repository: https://github.com/recursal/Monet
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
  pipeline_tag: text-generation
5
  ---
6
 
7
+ # RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
8
+
9
  This repository contains the model described in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
10
 
11
+ **RADLADS** (Rapid Attention Distillation to Linear Attention Decoders at Scale) presents a novel protocol for rapidly converting softmax attention transformers into linear attention decoder models. This innovative process requires only 350-700 million tokens for distillation, which is less than 0.005% of the tokens used to train the original teacher models. Despite this minimal training, the inference quality remains remarkably close to the original transformer.
12
+
13
+ These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size, offering significant efficiency benefits with constant-time inference per token. The project also introduces two new RWKV-variant architectures, RAD-RWKV6 and RAD-RWKV7, which serve as efficient destination architectures for transformer conversions.
14
+
15
+ We release all our models on Hugging Face under the Apache 2.0 license. Please note that our 72B models are also governed by the Qwen License Agreement.
16
+
17
+ Github repository: https://github.com/recursal/Monet
18
+
19
+ <div align="center">
20
+ <img src="https://github.com/recursal/Monet/raw/main/assets/radlads_process.png" height=63 alt="RADLADS Conversion Process" />
21
+ <img src="https://github.com/recursal/Monet/raw/main/assets/radlads_evals.png" height=275 alt="GoldFinch evals" />
22
+ </div>
23
+
24
+ ## Quickstart
25
+
26
+ You can explore the core implementation of RADLADS in the [GitHub repository](https://github.com/recursal/Monet). To use these models with the Hugging Face `transformers` library, you will need to set `trust_remote_code=True` when loading them due to custom architecture components.
27
+
28
+ ### Text Generation
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
32
+ import torch
33
+
34
+ # Replace with the actual model ID (e.g., recursal/radrwkv7qwen2-7b-instruct)
35
+ model_id = "your-model-id-here"
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
38
+ model = AutoModelForCausalLM.from_pretrained(
39
+ model_id,
40
+ torch_dtype=torch.bfloat16,
41
+ device_map="auto",
42
+ trust_remote_code=True,
43
+ ).eval()
44
+
45
+ text = "The quick brown fox jumps over the lazy"
46
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
47
+ outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, temperature=0.7, top_p=0.8, top_k=20)
48
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
49
+ ```
50
+
51
+ ### Chat Completion
52
+
53
+ ```python
54
+ from transformers import AutoModelForCausalLM, AutoTokenizer
55
+ import torch
56
+
57
+ # Replace with the actual model ID (e.g., recursal/radrwkv7qwen2-7b-instruct)
58
+ model_id = "your-model-id-here"
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_id,
63
+ torch_dtype=torch.bfloat16,
64
+ device_map="auto",
65
+ trust_remote_code=True,
66
+ ).eval()
67
+
68
+ messages = [
69
+ {"role": "system", "content": "You are a helpful AI assistant."},
70
+ {"role": "user", "content": "What is the capital of France?"}
71
+ ]
72
+
73
+ # Apply chat template and generate text
74
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
75
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
76
+ outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7, top_p=0.8, top_k=20)
77
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
78
+ ```
79
+
80
+ ## Citation
81
+
82
+ If you use this code or find our work valuable, please consider citing RADLADS:
83
+
84
+ ```bibtex
85
+ @misc{goldstein2025radladsrapidattentiondistillation,
86
+ title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
87
+ author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
88
+ year={2025},
89
+ eprint={2505.03005},
90
+ archivePrefix={arXiv},
91
+ primaryClass={cs.CL},
92
+ url={https://arxiv.org/abs/2505.03005},
93
+ }
94
+ ```