--- library_name: transformers tags: [] --- # Eagle Speculative Decoding Model Trained with BaldEagle BaldEagle Repo: https://github.com/NickL77/BaldEagle/ Experimental model with training-time test from Eagle 3 11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline - see below for baseline **Benchmarking w/ sglang** Increasing `speculative-num-steps` from 5 -> 8 based on https://github.com/SafeAILab/EAGLE/issues/209 ``` python3 -m sglang.launch_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --speculative-algo EAGLE \ --speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \ --speculative-num-steps 8 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 64 \ --dtype bfloat16 \ --port 30000 \ --mem-fraction-static 0.65 ``` > #questions: 80, Throughput: 169.49 token/s, Acceptance length: 3.98 > > runtime: 4 min 50 sec With `speculative-num-steps` equals 5. ``` python3 -m sglang.launch_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --speculative-algo EAGLE \ --speculative-draft NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 64 \ --dtype bfloat16 \ --port 30000 \ --mem-fraction-static 0.65 ``` > #questions: 80, Throughput: 165.10 token/s, Acceptance length: 3.86 > > runtime: 5 min 10 sec Baseline: https://huggingface.co/NickL77/BaldEagle-Llama-3.1-8B-Instruct > #questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57 > > runtime: 5 min 24 sec