Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
<p align="center">
|
2 |
<picture>
|
3 |
-
<source media="(prefers-color-scheme: dark)" srcset="
|
4 |
-
<img alt="AngelSlim" src="
|
5 |
</picture>
|
6 |
</p>
|
7 |
|
@@ -10,10 +17,11 @@ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compre
|
|
10 |
</h3>
|
11 |
|
12 |
<p align="center">
|
13 |
-
π <a href="https://angelslim.readthedocs.io/">Documentation</a>   |   π€ <a href="https://huggingface.co/AngelSlim">Hugging Face</a>   |   π€ <a href="https://modelscope.cn/organization/AngelSlim">ModelScope</a>   |   π¬ <a href="./docs/source/assets/angel_slim_wechat.png">WeChat</a>
|
14 |
<br>
|
15 |
</p>
|
16 |
|
|
|
17 |
## Table of Contents
|
18 |
|
19 |
- [Latest Updates](#latest-updates)
|
@@ -29,11 +37,14 @@ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compre
|
|
29 |
- [Technical Discussion](#technical-discussion)
|
30 |
|
31 |
## π£Latest Updates
|
32 |
-
|
33 |
-
- [25/07/04] We now support quantization for
|
|
|
34 |
|
35 |
Coming soon:
|
36 |
-
|
|
|
|
|
37 |
- [ ] Release of new algorithm for speculative sampling.
|
38 |
|
39 |
## πKey Features
|
@@ -49,7 +60,7 @@ Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwe
|
|
49 |
|
50 |
| Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
|
51 |
| --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
|
52 |
-
| [Hunyuan-Dense](https://huggingface.co/
|
53 |
| [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | β
| β
| β
| β
| β
|
|
54 |
| [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
55 |
| [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
@@ -58,8 +69,6 @@ Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwe
|
|
58 |
| [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
59 |
|
60 |
### Speculative Decoding
|
61 |
-
|
62 |
-
#### Eagle3
|
63 |
The Eagle3 weights for the Qwen3 series model are now available.
|
64 |
|
65 |
| Qwen3 Models | Hunyuan Models |
|
@@ -121,21 +130,9 @@ After installing `AngelSlim`, you can quickly start by running the following scr
|
|
121 |
|
122 |
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
|
123 |
|
124 |
-
### Deployment and Testing
|
125 |
-
|
126 |
-
### 1. Offline Inference
|
127 |
-
|
128 |
-
If you need to load a quantized model via `transformers`, please set the `deploy_backend: huggingface` in the `global` configuration before quantizing the model, or manually modify the `ignored_layers` field in the `config.json` file located in the quantized model output directory to `ignore`.
|
129 |
-
|
130 |
-
To test offline inference with a quantized model loaded via `transformers`, run the following command:
|
131 |
-
|
132 |
-
```shell
|
133 |
-
python deploy/offline.py $MODEL_PATH
|
134 |
-
```
|
135 |
-
|
136 |
-
Where `MODEL_PATH` is the path to the quantized model output.
|
137 |
|
138 |
-
####
|
139 |
|
140 |
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
|
141 |
|
@@ -157,7 +154,7 @@ Use the following script to launch a [SGLang](https://github.com/sgl-project/sgl
|
|
157 |
bash deploy/run_sglang.sh $MODEL_PATH
|
158 |
```
|
159 |
|
160 |
-
####
|
161 |
|
162 |
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
|
163 |
|
@@ -165,7 +162,7 @@ Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-r
|
|
165 |
bash deploy/openai.sh $MODEL_PATH
|
166 |
```
|
167 |
|
168 |
-
####
|
169 |
|
170 |
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
|
171 |
|
@@ -184,48 +181,14 @@ The performance test results for selected models are shown below. For the comple
|
|
184 |
|
185 |
#### Hunyuan Series Models
|
186 |
|
187 |
-
Benchmark results for the `Hunyuan-Instruct` model with `FP8
|
188 |
|
189 |
-
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
|
195 |
-
<td>BF16</td><td>82.7</td><td>87.30</td><td>91.1</td><td>71.2</td></tr>
|
196 |
-
<tr><td>FP8-Static</td><td>83.0</td><td>86.7</td><td>91.1</td><td>-</td></tr>
|
197 |
-
<tr><td>Int4-GPTQ</td><td>82.7</td><td>86.7</td><td>91.1</td><td>-</td></tr>
|
198 |
-
<tr><td>Int4-AWQ</td><td>82.6</td><td>85.6</td><td>91.0</td><td>-</td></tr>
|
199 |
-
</tbody>
|
200 |
-
<tbody>
|
201 |
-
<tr><td rowspan="4">Hunyuan-7B-Instruct</td>
|
202 |
-
<td>BF16</td> <td>76.5</td><td>81.1</td><td>85.9</td><td>60.1</td></tr>
|
203 |
-
<tr><td>FP8-Static</td><td>76.6</td><td>80.9</td><td>86.0</td><td>60.1</td></tr>
|
204 |
-
<tr><td>Int4-GPTQ</td><td>76.2</td><td>81.0</td><td>85.7</td><td>60.0</td></tr>
|
205 |
-
<tr><td>Int4-AWQ</td><td>76.4</td><td>80.9</td><td>85.9</td><td>60.1</td></tr>
|
206 |
-
</tbody>
|
207 |
-
<tbody>
|
208 |
-
<tr><td rowspan="4">Hunyuan-4B-Instruct</td>
|
209 |
-
<td>BF16</td> <td>73.1</td><td>78.3</td><td>78.2</td><td>61.1</td></tr>
|
210 |
-
<tr><td>FP8-Static</td><td>73.1</td><td>76.6</td><td>78.3</td><td>60.2</td></tr>
|
211 |
-
<tr><td>Int4-GPTQ</td><td>72.9</td><td>-</td><td>78.1</td><td>58.1</td></tr>
|
212 |
-
<tr><td>Int4-AWQ</td><td>72.8</td><td>-</td><td>78.2</td><td>-</td></tr>
|
213 |
-
</tbody>
|
214 |
-
<tbody>
|
215 |
-
<tr><td rowspan="4">Hunyuan-1.8B-Instruct</td>
|
216 |
-
<td>BF16</td> <td>63.4</td><td>56.7</td><td>76.7</td><td>47.2</td></tr>
|
217 |
-
<tr><td>FP8-Static</td><td>62.5</td><td>55.2</td><td>75.1</td><td>47.7</td></tr>
|
218 |
-
<tr><td>Int4-GPTQ</td><td>60.9</td><td>-</td><td>73.0</td><td>44.4</td></tr>
|
219 |
-
<tr><td>Int4-AWQ</td><td>61.7</td><td>-</td><td>71.7</td><td>43.6</td></tr>
|
220 |
-
</tbody>
|
221 |
-
<tbody>
|
222 |
-
<tr><td rowspan="4">Hunyuan-0.5B-Instruct</td>
|
223 |
-
<td>BF16</td> <td>29.6</td><td>17.2</td><td>52.8</td><td>23.3</td></tr>
|
224 |
-
<tr><td>FP8-Static</td><td>29.6</td><td>17.2</td><td>51.6</td><td>22.5</td></tr>
|
225 |
-
<tr><td>Int4-GPTQ</td><td>26.8</td><td>-</td><td>50.9</td><td>23.3</td></tr>
|
226 |
-
<tr><td>Int4-AWQ</td><td>26.3</td><td>-</td><td>48.9</td><td>23.3</td></tr>
|
227 |
-
</tbody>
|
228 |
-
</table>
|
229 |
|
230 |
#### Qwen3 Series Models
|
231 |
|
@@ -273,65 +236,6 @@ Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT
|
|
273 |
</tbody>
|
274 |
</table>
|
275 |
|
276 |
-
#### Qwen2.5VL Series Models
|
277 |
-
|
278 |
-
Benchmark results for Qwen2.5VL series models with `BF16`γ`FP8-Static`γ`FP8-Dynamic`γ`INT4-GPTQ`γ`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`γ`DocVQA_VAL` and `ChartQA_TEST`οΌ
|
279 |
-
|
280 |
-
<table>
|
281 |
-
<thead>
|
282 |
-
<tr><th>Model</th><th>Quantization</th><th>MMMU_VAL</th><th>MMLDocVQA_VALU</th><th>ChartQA_TEST</th></tr>
|
283 |
-
</thead>
|
284 |
-
<tbody>
|
285 |
-
<tr><td rowspan="5">Qwen2.5VL-3B</td><td>BF16</td><td>47.11</td><td>78.57</td><td>80.32</td></tr>
|
286 |
-
<tr><td>FP8-Static</td><td>47.33</td><td>79.34</td><td>79.68</td></tr>
|
287 |
-
<tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td></tr>
|
288 |
-
<tr><td>INT4-GPTQ</td><td>46.56</td><td>77.20</td><td>78.96</td></tr>
|
289 |
-
<tr><td>INT4-AWQ</td><td>45.78</td><td>-</td><td>79.60</td></tr>
|
290 |
-
<tr><td rowspan="5">Qwen2.5VL-7B</td><td>BF16</td><td>45.44</td><td>89.71</td><td>84.64</td></tr>
|
291 |
-
<tr><td>FP8-Static</td><td>47.00</td><td>89.83</td><td>85.92</td></tr>
|
292 |
-
<tr><td>FP8-Dynamic</td><td>47.22</td><td>89.80</td><td>88.64</td></tr>
|
293 |
-
<tr><td>INT4-GPTQ</td><td>46.67</td><td>90.45</td><td>-</td></tr>
|
294 |
-
<tr><td>INT4-AWQ</td><td>45.67</td><td>89.28</td><td>-</td></tr>
|
295 |
-
<tr><td rowspan="5">Qwen2.5VL-32B</td><td>BF16</td><td>57.00</td><td>90.03</td><td>-</td></tr>
|
296 |
-
<tr><td>FP8-Static</td><td>57.00</td><td>89.88</td><td>-</td></tr>
|
297 |
-
<tr><td>FP8-Dynamic</td><td>56.44</td><td>89.88</td><td>-</td></tr>
|
298 |
-
<tr><td>INT4-GPTQ</td><td>55.22</td><td>89.80 </td><td>-</td></tr>
|
299 |
-
<tr><td>INT4-AWQ</td><td>55.22</td><td>90.30</td><td>-</td></tr>
|
300 |
-
<tr><td rowspan="5">Qwen2.5VL-72B</td><td>BF16</td><td>58.78</td><td>94.39</td><td>85.60</td></tr>
|
301 |
-
<tr><td>FP8-Static</td><td>57.89</td><td>94.41</td><td>85.84</td></tr>
|
302 |
-
<tr><td>FP8-Dynamic</td><td>58.67</td><td>94.38</td><td>85.60</td></tr>
|
303 |
-
<tr><td>INT4-GPTQ</td><td>57.56</td><td>94.46</td><td>86.48</td></tr>
|
304 |
-
<tr><td>INT4-AWQ</td><td>58.78</td><td>94.19</td><td>87.28</td></tr>
|
305 |
-
</tbody>
|
306 |
-
</table>
|
307 |
-
|
308 |
-
#### DeepSeek Series Models
|
309 |
-
|
310 |
-
Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`γ`AIME 2024`γ`SimpleQA` and `LiveCodeBench`οΌ
|
311 |
-
|
312 |
-
<table>
|
313 |
-
<thead>
|
314 |
-
<tr><th>Model</th><th>Quantization</th><th>GPQA Diamond</th><th>AIME 2024</th><th>SimpleQA</th><th>LiveCodeBench</th></tr>
|
315 |
-
</thead>
|
316 |
-
<tbody>
|
317 |
-
<tr><td rowspan="6">DeepSeek-R1-0528</td><td>FP8-Block-Wise</td><td>78.28</td><td>88.67</td><td>27.8</td><td>77.1</td></tr>
|
318 |
-
<tr><td>W4A8-FP8</td><td>77.37</td><td>88.67</td><td>26.83</td><td>78.86</td></tr>
|
319 |
-
</tbody>
|
320 |
-
</table>
|
321 |
-
|
322 |
-
> **Note**οΌ
|
323 |
-
> - The above results are based on the average of 5 test runs deployed with TRT-LLM
|
324 |
-
> - The hyperparameters used during evaluation are as follows:
|
325 |
-
> ```json
|
326 |
-
>{
|
327 |
-
> "top_k": 20,
|
328 |
-
> "top_p": 0.6,
|
329 |
-
> "temperature": 0.7,
|
330 |
-
> "output_seq_len": 32768,
|
331 |
-
> "max_input_seq_len": 16384
|
332 |
-
>}
|
333 |
-
>```
|
334 |
-
|
335 |
#### Other Models
|
336 |
|
337 |
Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
|
@@ -393,16 +297,16 @@ Benchmark results for Qwen3 series models with `Eagle3` speculative decoding alg
|
|
393 |
<tr><td rowspan="6"><strong>T=0</strong></td>
|
394 |
<td>Qwen3-1.7B</td><td>2.05x</td><td>2.81</td><td>2.07x</td><td>2.93</td><td>2.11x</td><td>2.98</td><td>1.93x</td><td>2.69</td><td>2.04x</td><td>2.85</td></tr>
|
395 |
<tr> <td>Qwen3-4B</td><td>2.21x</td><td>3.01</td><td>2.36x</td><td>3.24</td><td>2.42x</td><td>3.13</td><td>2.32x</td><td>2.75</td><td>2.33x</td><td>3.03</td></tr>
|
396 |
-
<tr><td>Qwen3-8B</td><td>2.
|
397 |
-
<tr><td>Qwen3-14B</td><td>2.
|
398 |
<tr><td>Qwen3-32B</td><td>2.39x</td><td>2.78</td><td>2.37x</td><td>2.81</td><td>2.47x</td><td>2.92</td><td>2.42x</td><td>2.53</td><td>2.41x</td><td>2.76</td></tr>
|
399 |
<tr><td>Qwen3-30B-A3B</td><td>2.84x</td><td>3.63</td><td>2.27x</td><td>3.09</td><td>2.64x</td><td>3.42</td><td>2.83x</td><td>3.56</td><td>2.64x</td><td>3.42</td></tr>
|
400 |
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
|
401 |
<tr><td rowspan="6"><strong>T=1</strong></td>
|
402 |
<td>Qwen3-1.7B</td><td>1.74x</td><td>2.53</td><td>1.86x</td><td>2.70</td><td>1.82x</td><td>2.69</td><td>1.72x</td><td>2.46</td><td>1.93x</td><td>2.60</td></tr>
|
403 |
<tr><td>Qwen3-4B</td><td>1.93x</td><td>2.60</td><td>2.00x</td><td>2.84</td><td>2.11x</td><td>2.82</td><td>2.34x</td><td>2.50</td><td>1.75x</td><td>2.69</td></tr>
|
404 |
-
<tr><td>Qwen3-8B</td><td>1.
|
405 |
-
<tr><td>Qwen3-14B</td><td>1.
|
406 |
<tr><td>Qwen3-32B</td><td>1.62x</td><td>1.91</td><td>1.71x</td><td>2.05</td><td>1.78x</td><td>2.10</td><td>1.80x</td><td>1.95</td><td>1.62x</td><td>2.00</td></tr>
|
407 |
<tr><td>Qwen3-30B-A3B</td><td>1.91x</td><td>2.46</td><td>2.00x</td><td>2.64</td><td>1.90x</td><td>2.53</td><td>1.80x</td><td>2.32</td><td>1.90x</td><td>2.48</td></tr>
|
408 |
</tbody>
|
@@ -454,4 +358,4 @@ The code for this project is open-sourced under the [License for AngelSlim](LICE
|
|
454 |
|
455 |
## π¬ Technical Discussion
|
456 |
|
457 |
-
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- hunyuan
|
4 |
+
- eagle3
|
5 |
+
- eagle
|
6 |
+
---
|
7 |
+
|
8 |
<p align="center">
|
9 |
<picture>
|
10 |
+
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo_light.png?raw=true">
|
11 |
+
<img alt="AngelSlim" src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo.png?raw=true" width=55%>
|
12 |
</picture>
|
13 |
</p>
|
14 |
|
|
|
17 |
</h3>
|
18 |
|
19 |
<p align="center">
|
20 |
+
π <a href="https://angelslim.readthedocs.io/">Documentation</a>   |   π€ <a href="https://huggingface.co/AngelSlim">Hugging Face</a>   |   π€ <a href="https://modelscope.cn/organization/AngelSlim">ModelScope</a>   |   π¬ <a href="./docs/source/assets/angel_slim_wechat.png">WeChat</a>
|
21 |
<br>
|
22 |
</p>
|
23 |
|
24 |
+
|
25 |
## Table of Contents
|
26 |
|
27 |
- [Latest Updates](#latest-updates)
|
|
|
37 |
- [Technical Discussion](#technical-discussion)
|
38 |
|
39 |
## π£Latest Updates
|
40 |
+
|
41 |
+
- [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms.
|
42 |
+
We also opensource Qwen3-8B`s Eagle3 model weight.
|
43 |
|
44 |
Coming soon:
|
45 |
+
|
46 |
+
- [ ] Support W4A8 quantization for DeepSeek-R1.
|
47 |
+
- [ ] Support quantization for multimodal models like Qwen-VL.
|
48 |
- [ ] Release of new algorithm for speculative sampling.
|
49 |
|
50 |
## πKey Features
|
|
|
60 |
|
61 |
| Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
|
62 |
| --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
|
63 |
+
| [Hunyuan-Dense](https://huggingface.co/tencent/Hunyuan-7B-Instruct) | β
| β
| β
| β
| β
|
|
64 |
| [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | β
| β
| β
| β
| β
|
|
65 |
| [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
66 |
| [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
|
|
69 |
| [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β
| β
| β
| β
| β
|
|
70 |
|
71 |
### Speculative Decoding
|
|
|
|
|
72 |
The Eagle3 weights for the Qwen3 series model are now available.
|
73 |
|
74 |
| Qwen3 Models | Hunyuan Models |
|
|
|
130 |
|
131 |
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
|
132 |
|
133 |
+
### π₯οΈ Deployment and Testing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
+
#### 1. API Service Deployment
|
136 |
|
137 |
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
|
138 |
|
|
|
154 |
bash deploy/run_sglang.sh $MODEL_PATH
|
155 |
```
|
156 |
|
157 |
+
#### 2. Service Invocation
|
158 |
|
159 |
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
|
160 |
|
|
|
162 |
bash deploy/openai.sh $MODEL_PATH
|
163 |
```
|
164 |
|
165 |
+
#### 3. Performance Evaluation
|
166 |
|
167 |
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
|
168 |
|
|
|
181 |
|
182 |
#### Hunyuan Series Models
|
183 |
|
184 |
+
Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPTQ` quantization algorithms on datasets including `AIME 2024`, `GSM8K`, `BBH`, and `DROP`:
|
185 |
|
186 |
+
| Bench | Hunyuan-A13B-Instruct | Hunyuan-A13B-Instruct-FP8 | Hunyuan-A13B-Instruct-Int4-GPTQ |
|
187 |
+
|:---------:|:---------------------:|:-------------------------:|:-------------------------------:|
|
188 |
+
| AIME 2024 | 87.3 | 86.7 | 86.7 |
|
189 |
+
| GSM8K | 94.39 | 94.01 | 94.24 |
|
190 |
+
| BBH | 89.1 | 88.34 | 87.91 |
|
191 |
+
| DROP | 91.1 | 91.1 | 91.05 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
|
193 |
#### Qwen3 Series Models
|
194 |
|
|
|
236 |
</tbody>
|
237 |
</table>
|
238 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
239 |
#### Other Models
|
240 |
|
241 |
Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
|
|
|
297 |
<tr><td rowspan="6"><strong>T=0</strong></td>
|
298 |
<td>Qwen3-1.7B</td><td>2.05x</td><td>2.81</td><td>2.07x</td><td>2.93</td><td>2.11x</td><td>2.98</td><td>1.93x</td><td>2.69</td><td>2.04x</td><td>2.85</td></tr>
|
299 |
<tr> <td>Qwen3-4B</td><td>2.21x</td><td>3.01</td><td>2.36x</td><td>3.24</td><td>2.42x</td><td>3.13</td><td>2.32x</td><td>2.75</td><td>2.33x</td><td>3.03</td></tr>
|
300 |
+
<tr><td>Qwen3-8B</td><td>2.65x</td><td>3.87</td><td>2.64x</td><td>3.82</td><td>2.86x</td><td>4.10</td><td>2.58x</td><td>3.55</td><td>2.68x</td><td>3.83</td></tr>
|
301 |
+
<tr><td>Qwen3-14B</td><td>2.42x</td><td>3.38</td><td>2.57x</td><td>3.58</td><td>2.75x</td><td>3.77</td><td>2.27x</td><td>3.11</td><td>2.50x</td><td>3.46</td></tr>
|
302 |
<tr><td>Qwen3-32B</td><td>2.39x</td><td>2.78</td><td>2.37x</td><td>2.81</td><td>2.47x</td><td>2.92</td><td>2.42x</td><td>2.53</td><td>2.41x</td><td>2.76</td></tr>
|
303 |
<tr><td>Qwen3-30B-A3B</td><td>2.84x</td><td>3.63</td><td>2.27x</td><td>3.09</td><td>2.64x</td><td>3.42</td><td>2.83x</td><td>3.56</td><td>2.64x</td><td>3.42</td></tr>
|
304 |
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
|
305 |
<tr><td rowspan="6"><strong>T=1</strong></td>
|
306 |
<td>Qwen3-1.7B</td><td>1.74x</td><td>2.53</td><td>1.86x</td><td>2.70</td><td>1.82x</td><td>2.69</td><td>1.72x</td><td>2.46</td><td>1.93x</td><td>2.60</td></tr>
|
307 |
<tr><td>Qwen3-4B</td><td>1.93x</td><td>2.60</td><td>2.00x</td><td>2.84</td><td>2.11x</td><td>2.82</td><td>2.34x</td><td>2.50</td><td>1.75x</td><td>2.69</td></tr>
|
308 |
+
<tr><td>Qwen3-8B</td><td>1.91x</td><td>2.84</td><td>2.07x</td><td>3.05</td><td>2.34x</td><td>3.26</td><td>2.09x</td><td>2.92</td><td>2.10x</td><td>3.02</td></tr>
|
309 |
+
<tr><td>Qwen3-14B</td><td>1.81x</td><td>2.58</td><td>1.96x</td><td>2.81</td><td>2.16x</td><td>3.09</td><td>1.76x</td><td>2.49</td><td>1.92x</td><td>2.74</td></tr>
|
310 |
<tr><td>Qwen3-32B</td><td>1.62x</td><td>1.91</td><td>1.71x</td><td>2.05</td><td>1.78x</td><td>2.10</td><td>1.80x</td><td>1.95</td><td>1.62x</td><td>2.00</td></tr>
|
311 |
<tr><td>Qwen3-30B-A3B</td><td>1.91x</td><td>2.46</td><td>2.00x</td><td>2.64</td><td>1.90x</td><td>2.53</td><td>1.80x</td><td>2.32</td><td>1.90x</td><td>2.48</td></tr>
|
312 |
</tbody>
|
|
|
358 |
|
359 |
## π¬ Technical Discussion
|
360 |
|
361 |
+
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub or join our [WeChat technical discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).
|