nm-research commited on
Commit
3684090
·
verified ·
1 Parent(s): 4e24548

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +282 -0
README.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - deepseek
5
+ - int8
6
+ - vllm
7
+ - llmcompressor
8
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
9
+ library_name: transformers
10
+ ---
11
+
12
+ # DeepSeek-R1-Distill-Llama-8B-quantized.w8a8
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** LlamaForCausalLM
16
+ - **Input:** Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** INT8
20
+ - **Activation quantization:** INT8
21
+ - **Release Date:** 2/1/2025
22
+ - **Version:** 1.0
23
+ - **Model Developers:** Neural Magic
24
+
25
+ Quantized version of [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B).
26
+
27
+
28
+ ### Model Optimizations
29
+
30
+ This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) to INT8 data type.
31
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
32
+ Weight quantization also reduces disk size requirements by approximately 50%.
33
+
34
+ Only the weights and activations of the linear operators within transformers blocks are quantized.
35
+ Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
36
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
37
+
38
+
39
+ ## Use with vLLM
40
+
41
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
42
+
43
+ ```python
44
+ from transformers import AutoTokenizer
45
+ from vllm import LLM, SamplingParams
46
+
47
+ number_gpus = 1
48
+ model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8"
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51
+ sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
52
+ llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
53
+
54
+ messages_list = [
55
+ [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
56
+ ]
57
+
58
+ prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
59
+
60
+ outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
61
+
62
+ generated_text = [output.outputs[0].text for output in outputs]
63
+ print(generated_text)
64
+ ```
65
+
66
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
71
+
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+ from llmcompressor.modifiers.quantization import QuantizationModifier
76
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
77
+ from llmcompressor.transformers import oneshot
78
+
79
+ # Load model
80
+ model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
81
+ model_name = model_stub.split("/")[-1]
82
+
83
+ num_samples = 1024
84
+ max_seq_len = 8192
85
+
86
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
87
+
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ model_stub,
90
+ device_map=device_map,
91
+ torch_dtype="auto",
92
+ )
93
+
94
+ def preprocess_fn(example):
95
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
96
+
97
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
98
+ ds = ds.map(preprocess_fn)
99
+
100
+ # Configure the quantization algorithm and scheme
101
+ recipe = [
102
+ SmoothQuantModifier(smoothing_strength=0.8),
103
+ QuantizationModifier(
104
+ targets="Linear",
105
+ scheme="W8A8",
106
+ ignore=["lm_head"],
107
+ dampening_frac=0.1,
108
+ ),
109
+ ]
110
+
111
+ # Apply quantization
112
+ oneshot(
113
+ model=model,
114
+ dataset=ds,
115
+ recipe=recipe,
116
+ max_seq_length=max_seq_len,
117
+ num_calibration_samples=num_samples,
118
+ )
119
+
120
+ # Save to disk in compressed-tensors format
121
+ save_path = model_name + "-quantized.w8a8
122
+ model.save_pretrained(save_path)
123
+ tokenizer.save_pretrained(save_path)
124
+ print(f"Model and tokenizer saved to: {save_path}")
125
+ ```
126
+
127
+ ## Evaluation
128
+
129
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands:
130
+
131
+ OpenLLM Leaderboard V1:
132
+ ```
133
+ lm_eval \
134
+ --model vllm \
135
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
136
+ --tasks openllm \
137
+ --write_out \
138
+ --batch_size auto \
139
+ --output_path output_dir \
140
+ --show_config
141
+ ```
142
+
143
+ OpenLLM Leaderboard V2:
144
+ ```
145
+ lm_eval \
146
+ --model vllm \
147
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
148
+ --apply_chat_template \
149
+ --fewshot_as_multiturn \
150
+ --tasks leaderboard \
151
+ --write_out \
152
+ --batch_size auto \
153
+ --output_path output_dir \
154
+ --show_config
155
+ ```
156
+
157
+ ### Accuracy
158
+
159
+ <table>
160
+ <thead>
161
+ <tr>
162
+ <th>Category</th>
163
+ <th>Metric</th>
164
+ <th>deepseek-ai/DeepSeek-R1-Distill-Llama-8B</th>
165
+ <th>neuralmagic/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8</th>
166
+ <th>Recovery</th>
167
+ </tr>
168
+ </thead>
169
+ <tbody>
170
+ <tr>
171
+ <td rowspan="7"><b>OpenLLM V1</b></td>
172
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
173
+ <td>45.05</td>
174
+ <td>45.22</td>
175
+ <td>100.4%</td>
176
+ </tr>
177
+ <tr>
178
+ <td>GSM8K (Strict-Match, 5-shot)</td>
179
+ <td>62.77</td>
180
+ <td>62.09</td>
181
+ <td>98.9%</td>
182
+ </tr>
183
+ <tr>
184
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
185
+ <td>76.78</td>
186
+ <td>76.80</td>
187
+ <td>100.0%</td>
188
+ </tr>
189
+ <tr>
190
+ <td>MMLU (Acc, 5-shot)</td>
191
+ <td>55.65</td>
192
+ <td>55.53</td>
193
+ <td>99.8%</td>
194
+ </tr>
195
+ <tr>
196
+ <td>TruthfulQA (MC2, 0-shot)</td>
197
+ <td>50.55</td>
198
+ <td>49.89</td>
199
+ <td>98.7%</td>
200
+ </tr>
201
+ <tr>
202
+ <td>Winogrande (Acc, 5-shot)</td>
203
+ <td>68.51</td>
204
+ <td>67.40</td>
205
+ <td>98.4%</td>
206
+ </tr>
207
+ <tr>
208
+ <td><b>Average Score</b></td>
209
+ <td><b>59.88</b></td>
210
+ <td><b>59.49</b></td>
211
+ <td><b>99.3%</b></td>
212
+ </tr>
213
+ <tr>
214
+ <td rowspan="7"><b>OpenLLM V2</b></td>
215
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
216
+ <td>38.34</td>
217
+ <td>39.07</td>
218
+ <td>101.9%</td>
219
+ </tr>
220
+ <tr>
221
+ <td>BBH (Acc-Norm, 3-shot)</td>
222
+ <td>38.19</td>
223
+ <td>39.57</td>
224
+ <td>103.6%</td>
225
+ </tr>
226
+ <tr>
227
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
228
+ <td>0.00</td>
229
+ <td>0.00</td>
230
+ <td>---</td>
231
+ </tr>
232
+ <tr>
233
+ <td>GPQA (Acc-Norm, 0-shot)</td>
234
+ <td>28.87</td>
235
+ <td>27.28</td>
236
+ <td>94.5%</td>
237
+ </tr>
238
+ <tr>
239
+ <td>MUSR (Acc-Norm, 0-shot)</td>
240
+ <td>33.31</td>
241
+ <td>34.50</td>
242
+ <td>103.6%</td>
243
+ </tr>
244
+ <tr>
245
+ <td>MMLU-Pro (Acc, 5-shot)</td>
246
+ <td>20.10</td>
247
+ <td>20.60</td>
248
+ <td>102.4%</td>
249
+ </tr>
250
+ <tr>
251
+ <td><b>Average Score</b></td>
252
+ <td><b>26.47</b></td>
253
+ <td><b>26.84</b></td>
254
+ <td><b>101.4%</b></td>
255
+ </tr>
256
+ <tr>
257
+ <td rowspan="4"><b>Coding</b></td>
258
+ <td>HumanEval (pass@1)</td>
259
+ <td>49.90</td>
260
+ <td>50.90</td>
261
+ <td><b>102.0%</b></td>
262
+ </tr>
263
+ <tr>
264
+ <td>HumanEval (pass@10)</td>
265
+ <td>68.90</td>
266
+ <td>68.70</td>
267
+ <td>99.7%</td>
268
+ </tr>
269
+ <tr>
270
+ <td>HumanEval+ (pass@10)</td>
271
+ <td>44.10</td>
272
+ <td>46.70</td>
273
+ <td>105.9%</td>
274
+ </tr>
275
+ <tr>
276
+ <td>HumanEval+ (pass@10)</td>
277
+ <td>62.90</td>
278
+ <td>64.30</td>
279
+ <td>102.2%</td>
280
+ </tr>
281
+ </tbody>
282
+ </table>