YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Eagle Speculative Decoding Model Trained with BaldEagle

BaldEagle Repo: https://github.com/NickL77/BaldEagle/

Learn how the model was trained: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative

Achieves 2.06x speed up (50.43 tok/s -> 106.22 tok/s) on Qwen2.5-7B-Instruct. Better than sglang reported improvements (1.54x) with EAGLE 2.

Benchmarking (on RTX 3090)

  1. Start sglang server
python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-Qwen-2.5-7B-Instruct \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65
  1. In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1

Output:

#questions: 50, Throughput: 106.22 token/s, Acceptance length: 3.55

runtime: 6 min 17 sec

Baseline

  1. Start sglang server
python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65
  1. In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py --questions 50 --parallel 1

Output:

#questions: 50, Throughput: 50.43 token/s, Acceptance length: 1.00

runtime: 12 min 57 sec

Note: We're benchmarking on 50 questions out of 80 due to an SGLang issue when running speculative decoding for long periods: https://github.com/sgl-project/sglang/issues/6309

Downloads last month
3
Safetensors
Model size
754M params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support