This repo only contains the AttnGates' weights for Qwen2.5-32B-Instruct Model.
SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
Original Github Repo
https://github.com/microsoft/SeerAttention.
Evaluation Results
PG19 PPL
density | 8192 | 16384 | 32768 |
---|---|---|---|
0.1 | 8.11 | 7.76 | 7.72 |
0.2 | 7.85 | 7.62 | 7.62 |
0.3 | 7.77 | 7.58 | 7.59 |
0.4 | 7.75 | 7.57 | 7.58 |
0.5 | 7.73 | 7.56 | 7.57 |
1.0 | 7.72 | 7.55 | 7.57 |
LongBench
Task | 0-4k (Dense / Sparse) | 4-8k (Dense / Sparse) | 8k+ (Dense / Sparse) |
---|---|---|---|
hotpotqa | 74.73 / 75.73 | 66.92 / 67.28 | 66.05 / 65.59 |
trec | 68.00 / 68.00 | 79.00 / 78.00 | 80.00 / 80.00 |
2wikimqa | 71.01 / 71.01 | 61.59 / 61.26 | 49.36 / 49.59 |
multi_news | 23.60 / 23.37 | 21.09 / 21.12 | 20.55 / 20.55 |
lcc | 58.20 / 58.84 | 52.76 / 50.60 | 53.98 / 54.57 |
qasper | 50.23 / 50.25 | 38.80 / 38.72 | 38.48 / 39.22 |
passage_count | 31.00 / 31.00 | 18.00 / 18.00 | 16.00 / 20.00 |
passage_retrieval_en | 100.0 / 100.0 | 100.0 / 99.00 | 99.00 / 99.00 |
triviaqa | 84.68 / 84.68 | 88.79 / 89.42 | 86.37 / 85.43 |
samsum | 41.16 / 41.26 | 41.13 / 41.65 | 46.88 / 46.36 |
gov_report | 29.90 / 30.09 | 30.70 / 30.91 | 29.35 / 29.46 |
repobench-p | 42.98 / 42.90 | 32.73 / 33.25 | 36.82 / 35.37 |
multifieldqa_en | 56.26 / 56.51 | 46.73 / 45.86 | 50.99 / 50.99 |
averaged score | 56.29 / 56.43 | 52.17 / 51.93 | 51.83 / 52.01 |
averaged density | 0.895 | 0.682 | 0.409 |
LongBenchV2 CoT Benchmark
All the SeerAttention models run with threshold=5e-4.
For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.
Model | Overall | Easy | Hard | Short | Medium | Long |
---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
SeerAttention-Llama-3.1-8B | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
Qwen2.5-14B-Instruct | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
SeerAttention-Qwen2.5-14B | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
Qwen2.5-32B-Instruct | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
SeerAttention-Qwen2.5-32B | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
DeepSeek-R1-Distill-Qwen-14B | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
SeerAttention-DeepSeek-R1-Distill-Qwen-14B | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
DeepSeek-R1-Distill-Qwen-32B | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
SeerAttention-DeepSeek-R1-Distill-Qwen-32B | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |
- Downloads last month
- 93