This repo contains the AttnGates' weights for QwQ-32B Model introduced by SeerAttention.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo

https://github.com/microsoft/SeerAttention.

Downloads last month
7
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for SeerAttention/SeerAttention-QwQ-32B-AttnGates

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Adapter
(8)
this model