nm-research commited on
Commit
7a7c635
·
verified ·
1 Parent(s): f781824

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -10
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -66,7 +66,9 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
-
 
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-2b-instruct --quant_path "output_dir/granite-3.1-2b-instruct-quantized.w8a8" --calib_size 2048 --dampening_frac 0.01 --observer mse
72
  ```
@@ -151,16 +153,20 @@ oneshot(
151
  model.save_pretrained(quant_path, save_compressed=True)
152
  tokenizer.save_pretrained(quant_path)
153
  ```
 
154
 
155
  ## Evaluation
156
 
157
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
158
 
 
 
 
159
  OpenLLM Leaderboard V1:
160
  ```
161
  lm_eval \
162
  --model vllm \
163
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
164
  --tasks openllm \
165
  --write_out \
166
  --batch_size auto \
@@ -168,11 +174,23 @@ lm_eval \
168
  --show_config
169
  ```
170
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  #### HumanEval
172
  ##### Generation
173
  ```
174
  python3 codegen/generate.py \
175
- --model neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8 \
176
  --bs 16 \
177
  --temperature 0.2 \
178
  --n_samples 50 \
@@ -182,20 +200,21 @@ python3 codegen/generate.py \
182
  ##### Sanitization
183
  ```
184
  python3 evalplus/sanitize.py \
185
- humaneval/neuralmagic-ent--granite-3.1-2b-instruct-quantized.w8a8_vllm_temp_0.2
186
  ```
187
  ##### Evaluation
188
  ```
189
  evalplus.evaluate \
190
  --dataset humaneval \
191
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-instruct-quantized.w8a8_vllm_temp_0.2-sanitized
192
  ```
 
193
 
194
  ### Accuracy
195
 
196
  #### OpenLLM Leaderboard V1 evaluation scores
197
 
198
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8 |
199
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
200
  | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 55.12 |
201
  | GSM8K (Strict-Match, 5-shot) | 60.96 | 60.58 |
@@ -207,7 +226,7 @@ evalplus.evaluate \
207
  | **Recovery** | **100.00** | **99.51** |
208
 
209
  #### OpenLLM Leaderboard V2 evaluation scores
210
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8 |
211
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
212
  | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.03 |
213
  | BBH (Acc-Norm, 3-shot) | 44.11 | 43.53 |
@@ -219,7 +238,7 @@ evalplus.evaluate \
219
  | **Recovery** | **100.00** | **98.40** |
220
 
221
  #### HumanEval pass@1 scores
222
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w8a8 |
223
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
224
  | HumanEval Pass@1 | 53.40 | 54.9 |
225
 
@@ -230,6 +249,16 @@ evalplus.evaluate \
230
  This model achieves up to 1.4x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
231
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
232
 
 
 
 
 
 
 
 
 
 
 
233
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
234
  <table>
235
  <tr>
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-2b-instruct-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
+
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-2b-instruct --quant_path "output_dir/granite-3.1-2b-instruct-quantized.w8a8" --calib_size 2048 --dampening_frac 0.01 --observer mse
74
  ```
 
153
  model.save_pretrained(quant_path, save_compressed=True)
154
  tokenizer.save_pretrained(quant_path)
155
  ```
156
+ </details>
157
 
158
  ## Evaluation
159
 
160
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
161
 
162
+ <details>
163
+ <summary>Evaluation Commands</summary>
164
+
165
  OpenLLM Leaderboard V1:
166
  ```
167
  lm_eval \
168
  --model vllm \
169
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
170
  --tasks openllm \
171
  --write_out \
172
  --batch_size auto \
 
174
  --show_config
175
  ```
176
 
177
+ OpenLLM Leaderboard V2:
178
+ ```
179
+ lm_eval \
180
+ --model vllm \
181
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
182
+ --tasks leaderboard \
183
+ --write_out \
184
+ --batch_size auto \
185
+ --output_path output_dir \
186
+ --show_config
187
+ ```
188
+
189
  #### HumanEval
190
  ##### Generation
191
  ```
192
  python3 codegen/generate.py \
193
+ --model neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 \
194
  --bs 16 \
195
  --temperature 0.2 \
196
  --n_samples 50 \
 
200
  ##### Sanitization
201
  ```
202
  python3 evalplus/sanitize.py \
203
+ humaneval/neuralmagic--granite-3.1-2b-instruct-quantized.w8a8_vllm_temp_0.2
204
  ```
205
  ##### Evaluation
206
  ```
207
  evalplus.evaluate \
208
  --dataset humaneval \
209
+ --samples humaneval/neuralmagic--granite-3.1-2b-instruct-quantized.w8a8_vllm_temp_0.2-sanitized
210
  ```
211
+ </details>
212
 
213
  ### Accuracy
214
 
215
  #### OpenLLM Leaderboard V1 evaluation scores
216
 
217
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
218
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
219
  | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 55.12 |
220
  | GSM8K (Strict-Match, 5-shot) | 60.96 | 60.58 |
 
226
  | **Recovery** | **100.00** | **99.51** |
227
 
228
  #### OpenLLM Leaderboard V2 evaluation scores
229
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
230
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
231
  | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.03 |
232
  | BBH (Acc-Norm, 3-shot) | 44.11 | 43.53 |
 
238
  | **Recovery** | **100.00** | **98.40** |
239
 
240
  #### HumanEval pass@1 scores
241
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 |
242
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
243
  | HumanEval Pass@1 | 53.40 | 54.9 |
244
 
 
249
  This model achieves up to 1.4x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
250
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
251
 
252
+ <details>
253
+ <summary>Benchmarking Command</summary>
254
+
255
+ ```
256
+ guidellm --model neuralmagic/granite-3.1-2b-instruct-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
257
+ ```
258
+
259
+ </details>
260
+
261
+
262
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
263
  <table>
264
  <tr>