NickL77
/

BaldEagle-Llama-3.1-8B-Instruct

text-generation-inference

Model card Files Files and versions Community

NickL77 commited on May 6

Commit

eb65091

·

verified ·

1 Parent(s): 9103d25

Update README.md

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -5,3 +5,49 @@ tags: []
 # Eagle Speculative Decoding Model Trained with BaldEagle
 BaldEagle Repo: https://github.com/NickL77/BaldEagle/

 # Eagle Speculative Decoding Model Trained with BaldEagle
 BaldEagle Repo: https://github.com/NickL77/BaldEagle/
+Achieves 3.17x speed up (49.24 tok/s -> 156.33 tok/s) on Llama3.1 8B model.
+### Benchmarking (on RTX 3090)
+1. Start sglang server
+```
+python3 -m sglang.launch_server \
+  --model meta-llama/Meta-Llama-3-8B-Instruct \
+  --speculative-algo EAGLE \
+  --speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
+  --speculative-num-steps 5 \
+  --speculative-eagle-topk 8 \
+  --speculative-num-draft-tokens 64 \
+  --dtype bfloat16 \
+  --port 30000 \
+  --mem-fraction-static 0.65
+```
+2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
+```
+python3 bench_sglang_eagle_double_turn.py
+```
+Output:
+> #questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57
+>
+> runtime: 5 min 24 sec
+#### Baseline
+1. Start sglang server
+```
+python3 -m sglang.launch_server \
+  --model meta-llama/Meta-Llama-3-8B-Instruct \
+  --dtype bfloat16 \
+  --port 30000 \
+  --mem-fraction-static 0.65
+```
+2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
+```
+python3 bench_sglang_eagle_double_turn.py
+```
+Output:
+> #questions: 80, Throughput: 49.24 token/s, Acceptance length: 1.00
+>
+> runtime: 15 min 5 sec