shubhrapandit commited on
Commit
f809972
·
verified ·
1 Parent(s): e9c8f58

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -25
README.md CHANGED
@@ -227,6 +227,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
227
  </tr>
228
  <tr>
229
  <th>Hardware</th>
 
230
  <th>Model</th>
231
  <th>Average Cost Reduction</th>
232
  <th>Latency (s)</th>
@@ -239,7 +240,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
239
  </thead>
240
  <tbody>
241
  <tr>
242
- <td>A100x4</td>
 
243
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
244
  <td></td>
245
  <td>6.5</td>
@@ -250,7 +252,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
250
  <td>113</td>
251
  </tr>
252
  <tr>
253
- <td>A100x2</td>
254
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
255
  <td>1.85</td>
256
  <td>7.2</td>
@@ -261,7 +263,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
261
  <td>211</td>
262
  </tr>
263
  <tr>
264
- <td>A100x1</td>
265
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
266
  <td>3.32</td>
267
  <td>10.0</td>
@@ -272,7 +274,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
272
  <td>419</td>
273
  </tr>
274
  <tr>
275
- <td>H100x4</td>
 
276
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
277
  <td></td>
278
  <td>4.4</td>
@@ -283,7 +286,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
283
  <td>99</td>
284
  </tr>
285
  <tr>
286
- <td>H100x2</td>
287
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
288
  <td>1.79</td>
289
  <td>4.7</td>
@@ -294,7 +297,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
294
  <td>177</td>
295
  </tr>
296
  <tr>
297
- <td>H100x1</td>
298
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
299
  <td>2.60</td>
300
  <td>6.4</td>
@@ -306,7 +309,10 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
306
  </tr>
307
  </tbody>
308
  </table>
309
-
 
 
 
310
 
311
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
312
 
@@ -334,7 +340,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
334
  </thead>
335
  <tbody>
336
  <tr>
337
- <td>A100x4</td>
338
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
339
  <td></td>
340
  <td>0.3</td>
@@ -345,29 +351,27 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
345
  <td>595</td>
346
  </tr>
347
  <tr>
348
- <td>A100x2</td>
349
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
350
  <td>1.84</td>
351
- <td>0.6</td>
352
  <td>293</td>
353
- <td>2.0</td>
354
  <td>1021</td>
355
- <td>2.3</td>
356
  <td>1135</td>
357
  </tr>
358
  <tr>
359
- <td>A100x1</td>
360
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
361
  <td>2.73</td>
362
- <td>0.6</td>
363
  <td>314</td>
364
- <td>3.2</td>
365
  <td>1591</td>
366
- <td>4.0</td>
367
  <td>2019</td>
368
  </tr>
369
  <tr>
370
- <td>H100x4</td>
371
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
372
  <td></td>
373
  <td>0.5</td>
@@ -378,27 +382,31 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
378
  <td>377</td>
379
  </tr>
380
  <tr>
381
- <td>H100x2</td>
382
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
383
  <td>1.70</td>
384
- <td>0.8</td>
385
  <td>236</td>
386
- <td>2.2</td>
387
  <td>623</td>
388
- <td>2.4</td>
389
  <td>669</td>
390
  </tr>
391
  <tr>
392
- <td>H100x1</td>
393
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
394
  <td>2.35</td>
395
- <td>1.3</td>
396
  <td>350</td>
397
- <td>3.3</td>
398
  <td>910</td>
399
- <td>3.6</td>
400
  <td>994</td>
401
  </tr>
402
  </tbody>
403
  </table>
 
 
 
 
 
 
404
 
 
227
  </tr>
228
  <tr>
229
  <th>Hardware</th>
230
+ <th>Number of GPUs</th>
231
  <th>Model</th>
232
  <th>Average Cost Reduction</th>
233
  <th>Latency (s)</th>
 
240
  </thead>
241
  <tbody>
242
  <tr>
243
+ <th rowspan="3" valign="top">A100</th>
244
+ <td>4</td>
245
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
246
  <td></td>
247
  <td>6.5</td>
 
252
  <td>113</td>
253
  </tr>
254
  <tr>
255
+ <td>2</td>
256
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
257
  <td>1.85</td>
258
  <td>7.2</td>
 
263
  <td>211</td>
264
  </tr>
265
  <tr>
266
+ <td>1</td>
267
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
268
  <td>3.32</td>
269
  <td>10.0</td>
 
274
  <td>419</td>
275
  </tr>
276
  <tr>
277
+ <th rowspan="3" valign="top">H100</td>
278
+ <td>4</td>
279
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
280
  <td></td>
281
  <td>4.4</td>
 
286
  <td>99</td>
287
  </tr>
288
  <tr>
289
+ <td>2</td>
290
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
291
  <td>1.79</td>
292
  <td>4.7</td>
 
297
  <td>177</td>
298
  </tr>
299
  <tr>
300
+ <td>1</td>
301
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
302
  <td>2.60</td>
303
  <td>6.4</td>
 
309
  </tr>
310
  </tbody>
311
  </table>
312
+
313
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
314
+
315
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
316
 
317
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
318
 
 
340
  </thead>
341
  <tbody>
342
  <tr>
343
+ <th rowspan="3" valign="top">A100x4</th>
344
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
345
  <td></td>
346
  <td>0.3</td>
 
351
  <td>595</td>
352
  </tr>
353
  <tr>
 
354
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8</td>
355
  <td>1.84</td>
356
+ <td>1.2</td>
357
  <td>293</td>
358
+ <td>4.0</td>
359
  <td>1021</td>
360
+ <td>4.6</td>
361
  <td>1135</td>
362
  </tr>
363
  <tr>
 
364
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
365
  <td>2.73</td>
366
+ <td>2.4</td>
367
  <td>314</td>
368
+ <td>12.8</td>
369
  <td>1591</td>
370
+ <td>16.0</td>
371
  <td>2019</td>
372
  </tr>
373
  <tr>
374
+ <th rowspan="3" valign="top">H100x4</td>
375
  <td>Qwen/Qwen2-VL-72B-Instruct</td>
376
  <td></td>
377
  <td>0.5</td>
 
382
  <td>377</td>
383
  </tr>
384
  <tr>
 
385
  <td>neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic</td>
386
  <td>1.70</td>
387
+ <td>1.6</td>
388
  <td>236</td>
389
+ <td>4.4</td>
390
  <td>623</td>
391
+ <td>4.8</td>
392
  <td>669</td>
393
  </tr>
394
  <tr>
 
395
  <td>neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16</td>
396
  <td>2.35</td>
397
+ <td>5.2</td>
398
  <td>350</td>
399
+ <td>13.2</td>
400
  <td>910</td>
401
+ <td>14.4</td>
402
  <td>994</td>
403
  </tr>
404
  </tbody>
405
  </table>
406
+
407
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
408
+
409
+ **QPS: Queries per second.
410
+
411
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
412