QRWKV7-7B-Instruct / README.md

Enhance model card with detailed description, key highlights, and usage example (#2)

c49b5ff verified about 1 month ago

3.8 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

	<div align="center">
	<img src="https://github.com/recursal/RADLADS/raw/main/assets/radlads_process.png" height=63 alt="RADLADS Conversion Process" />
	</div>

	RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) introduces a novel protocol for rapidly converting softmax attention transformers into linear attention decoder models. This highly efficient process requires only 350-700 million tokens of distillation, which is less than 0.005% of the token count used to train the original teacher models.

	This repository provides the RADRWKV7Qwen2.5-7B model, a converted version from Qwen2.5. The RADLADS approach maintains quality remarkably close to the original transformer while achieving state-of-the-art downstream performance for linear attention models of their size. It enables significantly faster inference due to its constant-time inference per token.

	Paper: [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005)

	GitHub Repository: [recursal/RADLADS](https://github.com/recursal/RADLADS)

	## ✨ Key Highlights

	* Efficient & Cost-Effective Conversion: Converts large softmax attention transformers to linear attention models with minimal additional training tokens. Converting a 72B model costs less than $2,000 USD.
	* Quality Preservation: Models converted using RADLADS maintain quality remarkably close to the original teacher transformer models.
	* State-of-the-Art Performance: These models achieve state-of-the-art downstream performance across standard benchmarks for linear attention models of their size.
	* Faster Inference: Leverages linear attention for constant-time inference per token, significantly boosting decoding speed.
	* New Architectures: Introduces new RWKV-variant architectures (RAD-RWKV6 and RAD-RWKV7) optimized for linear attention.

	## How to use

	This model is compatible with the Hugging Face `transformers` library. Ensure `trust_remote_code=True` is set when loading the model due to custom architecture components.

	### Text Generation

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "recursal/RADRWKV7Qwen2.5-7B" # This model

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16, # or torch.float16 if needed
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)

	# Prepare input
	text = "The quick brown fox jumps over the lazy"
	input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

	# Generate text
	generated_ids = model.generate(
	input_ids,
	max_new_tokens=50,
	do_sample=False, # Set to True for sampling, adjust temperature/top_p
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id, # Or set to a specific pad_token_id if available
	)

	# Decode output
	output_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
	print(f"Input: {text}
	Generated: {output_text}")
	```

	## Citation

	If you use this model or find our work valuable, please consider citing the RADLADS paper:

	```bibtex
	@misc{goldstein2025radladsrapidattentiondistillation,
	title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
	author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
	year={2025},
	eprint={2505.03005},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.03005},
	}
	```