Enhance model card with detailed description, usage examples, and citation (#2)

Browse files

- Enhance model card with detailed description, usage examples, and citation (662cd934b7f4af614f79b32a96414af015ffb943)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +87 -2

README.md CHANGED Viewed

@@ -1,9 +1,94 @@
 ---
-license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 ---
 This repository contains the model described in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
-Github repository: https://github.com/recursal/Monet

 ---
 library_name: transformers
+license: apache-2.0
 pipeline_tag: text-generation
 ---
+# RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
 This repository contains the model described in the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
+**RADLADS** (Rapid Attention Distillation to Linear Attention Decoders at Scale) presents a novel protocol for rapidly converting softmax attention transformers into linear attention decoder models. This innovative process requires only 350-700 million tokens for distillation, which is less than 0.005% of the tokens used to train the original teacher models. Despite this minimal training, the inference quality remains remarkably close to the original transformer.
+These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size, offering significant efficiency benefits with constant-time inference per token. The project also introduces two new RWKV-variant architectures, RAD-RWKV6 and RAD-RWKV7, which serve as efficient destination architectures for transformer conversions.
+We release all our models on Hugging Face under the Apache 2.0 license. Please note that our 72B models are also governed by the Qwen License Agreement.
+Github repository: https://github.com/recursal/Monet
+<div align="center">
+    <img src="https://github.com/recursal/Monet/raw/main/assets/radlads_process.png" height=63 alt="RADLADS Conversion Process" />
+    <img src="https://github.com/recursal/Monet/raw/main/assets/radlads_evals.png" height=275 alt="GoldFinch evals" />
+</div>
+## Quickstart
+You can explore the core implementation of RADLADS in the [GitHub repository](https://github.com/recursal/Monet). To use these models with the Hugging Face `transformers` library, you will need to set `trust_remote_code=True` when loading them due to custom architecture components.
+### Text Generation
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# Replace with the actual model ID (e.g., recursal/radrwkv7qwen2-7b-instruct)
+model_id = "your-model-id-here"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+).eval()
+text = "The quick brown fox jumps over the lazy"
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, temperature=0.7, top_p=0.8, top_k=20)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Chat Completion
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# Replace with the actual model ID (e.g., recursal/radrwkv7qwen2-7b-instruct)
+model_id = "your-model-id-here"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+).eval()
+messages = [
+    {"role": "system", "content": "You are a helpful AI assistant."},
+    {"role": "user", "content": "What is the capital of France?"}
+]
+# Apply chat template and generate text
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7, top_p=0.8, top_k=20)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Citation
+If you use this code or find our work valuable, please consider citing RADLADS:
+```bibtex
+@misc{goldstein2025radladsrapidattentiondistillation,
+      title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
+      author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
+      year={2025},
+      eprint={2505.03005},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.03005},
+}
+```