Update README.md
Browse files
README.md
CHANGED
@@ -5,3 +5,49 @@ tags: []
|
|
5 |
# Eagle Speculative Decoding Model Trained with BaldEagle
|
6 |
BaldEagle Repo: https://github.com/NickL77/BaldEagle/
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
# Eagle Speculative Decoding Model Trained with BaldEagle
|
6 |
BaldEagle Repo: https://github.com/NickL77/BaldEagle/
|
7 |
|
8 |
+
Achieves 3.17x speed up (49.24 tok/s -> 156.33 tok/s) on Llama3.1 8B model.
|
9 |
+
|
10 |
+
### Benchmarking (on RTX 3090)
|
11 |
+
1. Start sglang server
|
12 |
+
```
|
13 |
+
python3 -m sglang.launch_server \
|
14 |
+
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
15 |
+
--speculative-algo EAGLE \
|
16 |
+
--speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
|
17 |
+
--speculative-num-steps 5 \
|
18 |
+
--speculative-eagle-topk 8 \
|
19 |
+
--speculative-num-draft-tokens 64 \
|
20 |
+
--dtype bfloat16 \
|
21 |
+
--port 30000 \
|
22 |
+
--mem-fraction-static 0.65
|
23 |
+
```
|
24 |
+
|
25 |
+
2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
|
26 |
+
```
|
27 |
+
python3 bench_sglang_eagle_double_turn.py
|
28 |
+
```
|
29 |
+
|
30 |
+
Output:
|
31 |
+
> #questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57
|
32 |
+
>
|
33 |
+
> runtime: 5 min 24 sec
|
34 |
+
|
35 |
+
#### Baseline
|
36 |
+
1. Start sglang server
|
37 |
+
```
|
38 |
+
python3 -m sglang.launch_server \
|
39 |
+
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
40 |
+
--dtype bfloat16 \
|
41 |
+
--port 30000 \
|
42 |
+
--mem-fraction-static 0.65
|
43 |
+
```
|
44 |
+
|
45 |
+
2. In another terminal, run [benchmark script](https://github.com/NickL77/BaldEagle/blob/master/benchmark/bench_sglang_eagle_double_turn.py)
|
46 |
+
```
|
47 |
+
python3 bench_sglang_eagle_double_turn.py
|
48 |
+
```
|
49 |
+
|
50 |
+
Output:
|
51 |
+
> #questions: 80, Throughput: 49.24 token/s, Acceptance length: 1.00
|
52 |
+
>
|
53 |
+
> runtime: 15 min 5 sec
|