alexmarques commited on
Commit
31138f4
·
verified ·
1 Parent(s): 8d33e5d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - int8
4
+ - vllm
5
+ - llm-compressor
6
+ language:
7
+ - en
8
+ pipeline_tag: text-generation
9
+ license: apache-2.0
10
+ base_model:
11
+ - Qwen/Qwen2.5-3B
12
+ ---
13
+
14
+ # Qwen2.5-3B-quantized.w8a16
15
+
16
+ ## Model Overview
17
+ - **Model Architecture:** Qwen2
18
+ - **Input:** Text
19
+ - **Output:** Text
20
+ - **Model Optimizations:**
21
+ - **Weight quantization:** INT8
22
+ - **Intended Use Cases:** Similarly to [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B), this is a base language model.
23
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
24
+ - **Release Date:** 10/09/2024
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B).
29
+ It achieves an OpenLLMv1 score of 63.8, compared to 63.6 for [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B).
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing the weights of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) to INT8 data type.
34
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
35
+
36
+ Only the weights of the linear operators within transformers blocks are quantized.
37
+ Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
38
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
39
+
40
+
41
+ ## Deployment
42
+
43
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
+
45
+ ```python
46
+ from vllm import LLM, SamplingParams
47
+ from transformers import AutoTokenizer
48
+
49
+ model_id = "neuralmagic/Qwen2.5-3B-quantized.w8a16"
50
+ number_gpus = 1
51
+ max_model_len = 8192
52
+
53
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+
57
+ prompt = "Give me a short introduction to large language model."
58
+
59
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
60
+
61
+ outputs = llm.generate(prompt, sampling_params)
62
+
63
+ generated_text = outputs[0].outputs[0].text
64
+ print(generated_text)
65
+ ```
66
+
67
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
68
+
69
+
70
+
71
+ ## Evaluation
72
+
73
+ The model was evaluated on the OpenLLMv1 benchmark, composed of MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
74
+ Evaluation was conducted using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
75
+
76
+ ### Accuracy
77
+
78
+ <table>
79
+ <tr>
80
+ <td><strong>Category</strong>
81
+ </td>
82
+ <td><strong>Benchmark</strong>
83
+ </td>
84
+ <td><strong>Qwen2.5-3B</strong>
85
+ </td>
86
+ <td><strong>Qwen2.5-3B-quantized.w8a16<br>(this model)</strong>
87
+ </td>
88
+ <td><strong>Recovery</strong>
89
+ </td>
90
+ </tr>
91
+ <tr>
92
+ <td rowspan="8" ><strong>OpenLLM v1</strong>
93
+ </td>
94
+ </tr>
95
+ <tr>
96
+ <td>MMLU (5-shot)
97
+ </td>
98
+ <td>65.68
99
+ </td>
100
+ <td>65.65
101
+ </td>
102
+ <td>100.0%
103
+ </td>
104
+ </tr>
105
+ <tr>
106
+ <td>ARC Challenge (25-shot)
107
+ </td>
108
+ <td>53.58
109
+ </td>
110
+ <td>53.07
111
+ </td>
112
+ <td>99.0%
113
+ </td>
114
+ </tr>
115
+ <tr>
116
+ <td>GSM-8k (5-shot, strict-match)
117
+ </td>
118
+ <td>68.23
119
+ </td>
120
+ <td>70.05
121
+ </td>
122
+ <td>102.7%
123
+ </td>
124
+ </tr>
125
+ <tr>
126
+ <td>Hellaswag (10-shot)
127
+ </td>
128
+ <td>51.83
129
+ </td>
130
+ <td>51.78
131
+ </td>
132
+ <td>99.9%
133
+ </td>
134
+ </tr>
135
+ <tr>
136
+ <td>Winogrande (5-shot)
137
+ </td>
138
+ <td>70.64
139
+ </td>
140
+ <td>70.56
141
+ </td>
142
+ <td>99.9%
143
+ </td>
144
+ </tr>
145
+ <tr>
146
+ <td>TruthfulQA (0-shot, mc2)
147
+ </td>
148
+ <td>49.93
149
+ </td>
150
+ <td>48.88
151
+ </td>
152
+ <td>99.9%
153
+ </td>
154
+ </tr>
155
+ <tr>
156
+ <td><strong>Average</strong>
157
+ </td>
158
+ <td><strong>63.59</strong>
159
+ </td>
160
+ <td><strong>63.78</strong>
161
+ </td>
162
+ <td><strong>100.3%</strong>
163
+ </td>
164
+ </tr>
165
+ </table>
166
+
167
+ ### Reproduction
168
+
169
+ The results were obtained using the following command:
170
+
171
+ ```
172
+ lm_eval \
173
+ --model vllm \
174
+ --model_args pretrained="neuralmagic/Qwen2.5-3B-quantized.w8a16",dtype=auto,max_model_len=4096,add_bos_token=True,tensor_parallel_size=1 \
175
+ --tasks openllm \
176
+ --batch_size auto
177
+ ```