YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Eagle Speculative Decoding Model Trained with BaldEagle
BaldEagle Repo: https://github.com/NickL77/BaldEagle/
Learn how the model was trained: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative
Achieves 2.06x speed up (50.43 tok/s -> 106.22 tok/s) on Qwen2.5-7B-Instruct. Better than sglang reported improvements (1.54x) with EAGLE 2.
Benchmarking (on RTX 3090)
- Start sglang server
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algo EAGLE \
--speculative-draft NickL77/BaldEagle-Qwen-2.5-7B-Instruct \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
- In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1
Output:
#questions: 50, Throughput: 106.22 token/s, Acceptance length: 3.55
runtime: 6 min 17 sec
Baseline
- Start sglang server
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
- In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1
Output:
#questions: 50, Throughput: 50.43 token/s, Acceptance length: 1.00
runtime: 12 min 57 sec
Note: We're benchmarking on 50 questions out of 80 due to an SGLang issue when running speculative decoding for long periods: https://github.com/sgl-project/sglang/issues/6309
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support