This repo only contains the AttnGates' weights for Qwen2.5-32B-Instruct Model.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo

https://github.com/microsoft/SeerAttention.

Evaluation Results

PG19 PPL

density 8192 16384 32768
0.1 8.11 7.76 7.72
0.2 7.85 7.62 7.62
0.3 7.77 7.58 7.59
0.4 7.75 7.57 7.58
0.5 7.73 7.56 7.57
1.0 7.72 7.55 7.57

LongBench

Task 0-4k (Dense / Sparse) 4-8k (Dense / Sparse) 8k+ (Dense / Sparse)
hotpotqa 74.73 / 75.73 66.92 / 67.28 66.05 / 65.59
trec 68.00 / 68.00 79.00 / 78.00 80.00 / 80.00
2wikimqa 71.01 / 71.01 61.59 / 61.26 49.36 / 49.59
multi_news 23.60 / 23.37 21.09 / 21.12 20.55 / 20.55
lcc 58.20 / 58.84 52.76 / 50.60 53.98 / 54.57
qasper 50.23 / 50.25 38.80 / 38.72 38.48 / 39.22
passage_count 31.00 / 31.00 18.00 / 18.00 16.00 / 20.00
passage_retrieval_en 100.0 / 100.0 100.0 / 99.00 99.00 / 99.00
triviaqa 84.68 / 84.68 88.79 / 89.42 86.37 / 85.43
samsum 41.16 / 41.26 41.13 / 41.65 46.88 / 46.36
gov_report 29.90 / 30.09 30.70 / 30.91 29.35 / 29.46
repobench-p 42.98 / 42.90 32.73 / 33.25 36.82 / 35.37
multifieldqa_en 56.26 / 56.51 46.73 / 45.86 50.99 / 50.99
averaged score 56.29 / 56.43 52.17 / 51.93 51.83 / 52.01
averaged density 0.895 0.682 0.409

LongBenchV2 CoT Benchmark

All the SeerAttention models run with threshold=5e-4.

For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.

Model Overall Easy Hard Short Medium Long
Llama-3.1-8B-Instruct 30.4 31.2 29.9 37.8 24.7 29.6
SeerAttention-Llama-3.1-8B 31.6 33.3 30.5 33.9 31.6 27.8
Qwen2.5-14B-Instruct 34.8 37.5 33.1 44.4 32.1 24.1
SeerAttention-Qwen2.5-14B 32.8 38.0 29.6 45.0 30.2 17.6
Qwen2.5-32B-Instruct 36.4 42.2 32.8 47.8 29.8 30.6
SeerAttention-Qwen2.5-32B 36.4 41.1 33.4 49.4 29.8 27.8
DeepSeek-R1-Distill-Qwen-14B 34.2 43.2 28.6 45.0 27.9 28.7
SeerAttention-DeepSeek-R1-Distill-Qwen-14B 31.6 35.9 28.9 41.7 26.0 25.9
DeepSeek-R1-Distill-Qwen-32B 37.2 42.7 33.8 47.2 35.8 23.1
SeerAttention-DeepSeek-R1-Distill-Qwen-32B 37.0 42.2 33.8 49.4 31.6 26.9
Downloads last month
93
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates

Base model

Qwen/Qwen2.5-32B
Adapter
(29)
this model