NickL77/BaldEagle-Qwen-2.5-7B-Instruct

Eagle Speculative Decoding Model Trained with BaldEagle

BaldEagle Repo: https://github.com/NickL77/BaldEagle/

Learn how the model was trained: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative

Achieves 2.06x speed up (50.43 tok/s -> 106.22 tok/s) on Qwen2.5-7B-Instruct. Better than sglang reported improvements (1.54x) with EAGLE 2.

Benchmarking (on RTX 3090)

Start sglang server

python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-Qwen-2.5-7B-Instruct \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65

In another terminal, run benchmark script

python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1

Output:

#questions: 50, Throughput: 106.22 token/s, Acceptance length: 3.55

runtime: 6 min 17 sec

Baseline

Start sglang server

python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65

In another terminal, run benchmark script

python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1

Output:

#questions: 50, Throughput: 50.43 token/s, Acceptance length: 1.00

runtime: 12 min 57 sec

Note: We're benchmarking on 50 questions out of 80 due to an SGLang issue when running speculative decoding for long periods: https://github.com/sgl-project/sglang/issues/6309