radlads-7b-various / README.md

Improve model card: remove incorrect project page, add usage and citation (#5)

c9753fe verified about 1 month ago

5.15 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-generation
	repo_url: https://github.com/recursal/RADLADS-paper
	---

	This repository contains various checkpoints for ablations and other unusual models from the paper [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).

	The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.

	\| checkpoint \| step number \| teacher \| student \| description \|
	\|-\|-\|-\|-\|-\|
	\|L28-D3584-qwen2-rwkv6-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|\|
	\|L28-D3584-qwen2-rwkv6-3-250m.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|250m tokens trained\|
	\|L28-D3584-qwen2-rwkv6-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|\|
	\|L28-D3584-qwen2-rwkv6-base-2.pth\|1\|Qwen2.5-7B\|RAD-RWKV6\|\|
	\|L28-D3584-qwen2-rwkv7-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|\|
	\|L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|no rope used, w0 must be multiplied by 2 due to code mistake\|
	\|L28-D3584-qwen2-rwkv7-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|\|
	\|L28-D3584-qwerky6_qwen2-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|\|
	\|L28-D3584-qwerky6_qwen2-base-3.pth\|2\|Qwen2.5-7B\|RAD-RWKV6\|\|
	\|L28-D3584-qwerky6_qwen2-groupnorm-2.pth\|1\|Qwen2.5-6B-Instruct\|RAD-RWKV6\|ablation study: use groupnorm instead of state balancing\|
	\|L28-D3584-qwerky6_qwen2-groupnorm-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study:use groupnorm instead of state balancing\|
	\|L28-D3584-qwerky6_qwen2-no_gate-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: no gate\|
	\|L28-D3584-qwerky6_qwen2-no_gate-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: no gate\|
	\|L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: no tokenshift\|
	\|L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: no tokenshift\|
	\|L28-D3584-qwerky6_qwen2-use_rope-2.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: use rope\|
	\|L28-D3584-qwerky6_qwen2-use_rope-3.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV6\|ablation study: use rope\|
	\|L28-D3584-qwerky7_qwen2-2-4k.pth\|1\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|4k ctxlen training\|
	\|L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|4k ctxlen training, early checkpoint\|
	\|L28-D3584-qwerky7_qwen2-3-4k.pth\|2\|Qwen2.5-7B-Instruct\|RAD-RWKV7\|4k ctxlen training\|

	## Usage

	This repository contains various PyTorch `.pth` checkpoints from the RADLADS paper, which are primarily intended for research, ablation studies, and conversion. To use these models with the Hugging Face `transformers` library, you will generally need to convert them to the Hugging Face format first.

	Please refer to the original GitHub repository for detailed instructions on how to convert these checkpoints to Hugging Face-compatible formats and for specific usage examples: [https://github.com/recursal/RADLADS-paper](https://github.com/recursal/RADLADS-paper)

	For models already converted to Hugging Face format and ready for direct use, please refer to the main [Recursal RADLADS collection](https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102) on the Hugging Face Hub.

	A conceptual example for loading a text generation model with `transformers` (after it has been converted to Hugging Face format, or if you are using a model from the main collection):

	```python
	from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
	import torch

	# Replace "recursal/RADLADS-RWKV7-Qwen2.5-7B" with the actual ID of a converted model
	# from the Recursal RADLADS collection, or your local path to a converted model.
	model_name = "recursal/RADLADS-RWKV7-Qwen2.5-7B"

	try:
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16, # Adjust dtype based on model
	device_map="auto",
	trust_remote_code=True,
	)
	pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

	prompt = "The key to life is"
	print(pipe(prompt, max_new_tokens=20, do_sample=True)[0]["generated_text"])

	except Exception as e:
	print(f"Could not load model directly with pipeline. This repository contains raw checkpoints that require conversion.")
	print(f"Please refer to the original GitHub repository for detailed conversion and usage instructions: https://github.com/recursal/RADLADS-paper")
	print(f"Or explore pre-converted models in the Recursal collection: https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102")

	```

	## Citation

	If you use this code or find our work valuable, please consider citing RADLADS:

	```bibtex
	@misc{goldstein2025radladsrapidattentiondistillation,
	title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
	author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
	year={2025},
	eprint={2505.03005},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.03005},
	}
	```