SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates

This repo only contains the AttnGates' weights for Qwen2.5-32B-Instruct Model.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo

https://github.com/microsoft/SeerAttention.

Evaluation Results

PG19 PPL

density	8192	16384	32768
0.1	8.11	7.76	7.72
0.2	7.85	7.62	7.62
0.3	7.77	7.58	7.59
0.4	7.75	7.57	7.58
0.5	7.73	7.56	7.57
1.0	7.72	7.55	7.57

LongBench

Task	0-4k (Dense / Sparse)	4-8k (Dense / Sparse)	8k+ (Dense / Sparse)
hotpotqa	74.73 / 75.73	66.92 / 67.28	66.05 / 65.59
trec	68.00 / 68.00	79.00 / 78.00	80.00 / 80.00
2wikimqa	71.01 / 71.01	61.59 / 61.26	49.36 / 49.59
multi_news	23.60 / 23.37	21.09 / 21.12	20.55 / 20.55
lcc	58.20 / 58.84	52.76 / 50.60	53.98 / 54.57
qasper	50.23 / 50.25	38.80 / 38.72	38.48 / 39.22
passage_count	31.00 / 31.00	18.00 / 18.00	16.00 / 20.00
passage_retrieval_en	100.0 / 100.0	100.0 / 99.00	99.00 / 99.00
triviaqa	84.68 / 84.68	88.79 / 89.42	86.37 / 85.43
samsum	41.16 / 41.26	41.13 / 41.65	46.88 / 46.36
gov_report	29.90 / 30.09	30.70 / 30.91	29.35 / 29.46
repobench-p	42.98 / 42.90	32.73 / 33.25	36.82 / 35.37
multifieldqa_en	56.26 / 56.51	46.73 / 45.86	50.99 / 50.99
averaged score	56.29 / 56.43	52.17 / 51.93	51.83 / 52.01
averaged density	0.895	0.682	0.409

LongBenchV2 CoT Benchmark

All the SeerAttention models run with threshold=5e-4.

For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.

Model	Overall	Easy	Hard	Short	Medium	Long
Llama-3.1-8B-Instruct	30.4	31.2	29.9	37.8	24.7	29.6
SeerAttention-Llama-3.1-8B	31.6	33.3	30.5	33.9	31.6	27.8
Qwen2.5-14B-Instruct	34.8	37.5	33.1	44.4	32.1	24.1
SeerAttention-Qwen2.5-14B	32.8	38.0	29.6	45.0	30.2	17.6
Qwen2.5-32B-Instruct	36.4	42.2	32.8	47.8	29.8	30.6
SeerAttention-Qwen2.5-32B	36.4	41.1	33.4	49.4	29.8	27.8
DeepSeek-R1-Distill-Qwen-14B	34.2	43.2	28.6	45.0	27.9	28.7
SeerAttention-DeepSeek-R1-Distill-Qwen-14B	31.6	35.9	28.9	41.7	26.0	25.9
DeepSeek-R1-Distill-Qwen-32B	37.2	42.7	33.8	47.2	35.8	23.1
SeerAttention-DeepSeek-R1-Distill-Qwen-32B	37.0	42.2	33.8	49.4	31.6	26.9

SeerAttention
/

SeerAttention-Qwen2.5-32B-AttnGates

Original Github Repo

Evaluation Results

PG19 PPL

LongBench

LongBenchV2 CoT Benchmark

Model tree for SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates