Eagle Speculative Decoding Model Trained with BaldEagle
BaldEagle Repo: https://github.com/NickL77/BaldEagle/
Experimental model with training-time test from Eagle 3
11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline
- see below for baseline
Benchmarking w/ sglang
Increasing speculative-num-steps
from 5 -> 8 based on https://github.com/SafeAILab/EAGLE/issues/209
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-algo EAGLE \
--speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \
--speculative-num-steps 8 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
#questions: 80, Throughput: 169.49 token/s, Acceptance length: 3.98
runtime: 4 min 50 sec
With speculative-num-steps
equals 5.
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-algo EAGLE \
--speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
#questions: 80, Throughput: 165.10 token/s, Acceptance length: 3.86
runtime: 5 min 10 sec
Baseline: https://huggingface.co/NickL77/BaldEagle-Llama-3.1-8B-Instruct
#questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57
runtime: 5 min 24 sec
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support