NickL77 commited on
Commit
eb65091
·
verified ·
1 Parent(s): 9103d25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md CHANGED
@@ -5,3 +5,49 @@ tags: []
5
  # Eagle Speculative Decoding Model Trained with BaldEagle
6
  BaldEagle Repo: https://github.com/NickL77/BaldEagle/
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  # Eagle Speculative Decoding Model Trained with BaldEagle
6
  BaldEagle Repo: https://github.com/NickL77/BaldEagle/
7
 
8
+ Achieves 3.17x speed up (49.24 tok/s -> 156.33 tok/s) on Llama3.1 8B model.
9
+
10
+ ### Benchmarking (on RTX 3090)
11
+ 1. Start sglang server
12
+ ```
13
+ python3 -m sglang.launch_server \
14
+ --model meta-llama/Meta-Llama-3-8B-Instruct \
15
+ --speculative-algo EAGLE \
16
+ --speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
17
+ --speculative-num-steps 5 \
18
+ --speculative-eagle-topk 8 \
19
+ --speculative-num-draft-tokens 64 \
20
+ --dtype bfloat16 \
21
+ --port 30000 \
22
+ --mem-fraction-static 0.65
23
+ ```
24
+
25
+ 2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
26
+ ```
27
+ python3 bench_sglang_eagle_double_turn.py
28
+ ```
29
+
30
+ Output:
31
+ > #questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57
32
+ >
33
+ > runtime: 5 min 24 sec
34
+
35
+ #### Baseline
36
+ 1. Start sglang server
37
+ ```
38
+ python3 -m sglang.launch_server \
39
+ --model meta-llama/Meta-Llama-3-8B-Instruct \
40
+ --dtype bfloat16 \
41
+ --port 30000 \
42
+ --mem-fraction-static 0.65
43
+ ```
44
+
45
+ 2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
46
+ ```
47
+ python3 bench_sglang_eagle_double_turn.py
48
+ ```
49
+
50
+ Output:
51
+ > #questions: 80, Throughput: 49.24 token/s, Acceptance length: 1.00
52
+ >
53
+ > runtime: 15 min 5 sec