Eagle Speculative Decoding Model Trained with BaldEagle

BaldEagle Repo: https://github.com/NickL77/BaldEagle/

Experimental model with training-time test from Eagle 3

11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline

see below for baseline

Benchmarking w/ sglang

Increasing speculative-num-steps from 5 -> 8 based on https://github.com/SafeAILab/EAGLE/issues/209

python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \
  --speculative-num-steps 8 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65

#questions: 80, Throughput: 169.49 token/s, Acceptance length: 3.98

runtime: 4 min 50 sec

With speculative-num-steps equals 5.

python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65

#questions: 80, Throughput: 165.10 token/s, Acceptance length: 3.86

runtime: 5 min 10 sec

Baseline: https://huggingface.co/NickL77/BaldEagle-Llama-3.1-8B-Instruct

#questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57

runtime: 5 min 24 sec