|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
repo_url: https://github.com/recursal/RADLADS-paper |
|
--- |
|
|
|
This repository contains various checkpoints for ablations and other unusual models from the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005). |
|
|
|
The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper. |
|
|
|
| checkpoint | step number | teacher | student | description | |
|
|-|-|-|-|-| |
|
|L28-D3584-qwen2-rwkv6-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV6|| |
|
|L28-D3584-qwen2-rwkv6-3-250m.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|250m tokens trained| |
|
|L28-D3584-qwen2-rwkv6-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|| |
|
|L28-D3584-qwen2-rwkv6-base-2.pth|1|Qwen2.5-7B|RAD-RWKV6|| |
|
|L28-D3584-qwen2-rwkv7-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV7|| |
|
|L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV7|no rope used, w0 must be multiplied by 2 due to code mistake| |
|
|L28-D3584-qwen2-rwkv7-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV7|| |
|
|L28-D3584-qwerky6_qwen2-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV6|| |
|
|L28-D3584-qwerky6_qwen2-base-3.pth|2|Qwen2.5-7B|RAD-RWKV6|| |
|
|L28-D3584-qwerky6_qwen2-groupnorm-2.pth|1|Qwen2.5-6B-Instruct|RAD-RWKV6|ablation study: use groupnorm instead of state balancing| |
|
|L28-D3584-qwerky6_qwen2-groupnorm-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study:use groupnorm instead of state balancing| |
|
|L28-D3584-qwerky6_qwen2-no_gate-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: no gate| |
|
|L28-D3584-qwerky6_qwen2-no_gate-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: no gate| |
|
|L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: no tokenshift| |
|
|L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: no tokenshift| |
|
|L28-D3584-qwerky6_qwen2-use_rope-2.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: use rope| |
|
|L28-D3584-qwerky6_qwen2-use_rope-3.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV6|ablation study: use rope| |
|
|L28-D3584-qwerky7_qwen2-2-4k.pth|1|Qwen2.5-7B-Instruct|RAD-RWKV7|4k ctxlen training| |
|
|L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV7|4k ctxlen training, early checkpoint| |
|
|L28-D3584-qwerky7_qwen2-3-4k.pth|2|Qwen2.5-7B-Instruct|RAD-RWKV7|4k ctxlen training| |
|
|
|
## Usage |
|
|
|
This repository contains various PyTorch `.pth` checkpoints from the RADLADS paper, which are primarily intended for research, ablation studies, and conversion. To use these models with the Hugging Face `transformers` library, you will generally need to convert them to the Hugging Face format first. |
|
|
|
Please refer to the original GitHub repository for detailed instructions on how to convert these checkpoints to Hugging Face-compatible formats and for specific usage examples: [https://github.com/recursal/RADLADS-paper](https://github.com/recursal/RADLADS-paper) |
|
|
|
For models already converted to Hugging Face format and ready for direct use, please refer to the main [Recursal RADLADS collection](https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102) on the Hugging Face Hub. |
|
|
|
A conceptual example for loading a text generation model with `transformers` (after it has been converted to Hugging Face format, or if you are using a model from the main collection): |
|
|
|
```python |
|
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
# Replace "recursal/RADLADS-RWKV7-Qwen2.5-7B" with the actual ID of a converted model |
|
# from the Recursal RADLADS collection, or your local path to a converted model. |
|
model_name = "recursal/RADLADS-RWKV7-Qwen2.5-7B" |
|
|
|
try: |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.bfloat16, # Adjust dtype based on model |
|
device_map="auto", |
|
trust_remote_code=True, |
|
) |
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
|
prompt = "The key to life is" |
|
print(pipe(prompt, max_new_tokens=20, do_sample=True)[0]["generated_text"]) |
|
|
|
except Exception as e: |
|
print(f"Could not load model directly with pipeline. This repository contains raw checkpoints that require conversion.") |
|
print(f"Please refer to the original GitHub repository for detailed conversion and usage instructions: https://github.com/recursal/RADLADS-paper") |
|
print(f"Or explore pre-converted models in the Recursal collection: https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102") |
|
|
|
``` |
|
|
|
## Citation |
|
|
|
If you use this code or find our work valuable, please consider citing RADLADS: |
|
|
|
```bibtex |
|
@misc{goldstein2025radladsrapidattentiondistillation, |
|
title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale}, |
|
author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah}, |
|
year={2025}, |
|
eprint={2505.03005}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2505.03005}, |
|
} |
|
``` |