meituan
/

DeepSeek-R1-Channel-INT8

Text Generation

Transformers

Safetensors

Model card Files Files and versions Community

pkumc

HandH1998 commited on 12 days ago

Commit

ce60f3d

verified ·

1 Parent(s): c6bab3f

Update README.md (#2)

Browse files

- Update README.md (95642c7919dbf9b4e500cd508d311e3909306f6c)

Co-authored-by: HandH1998 <[email protected]>

Files changed (1) hide show

README.md +34 -0

README.md CHANGED Viewed

@@ -2,6 +2,40 @@
 license: mit
 library_name: transformers
 ---
 # DeepSeek-R1
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable html -->

 license: mit
 library_name: transformers
 ---
+# Channel-wise INT8 DeepSeek-R1
+The INT8 data type is both friendly and efficient for most hardware platforms.
+**We provide a channel-wise INT8 weight for DeepSeek-R1.**
+In benchmarking, we observe **no accuracy loss** and up to **50\%** performance enhancement.
+[SGLang](https://github.com/sgl-project/sglang/tree/main) will soon support the channel-wise INT8 quantization operation once our [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3888) is merged.
+## 1. Benchmarking Result (detailed in [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3888)):
+| Model  | Config | Accuracy (GSM8K) | Accuracy (MMLU) | Output Throughput(qps=128) |
+|--------|--------|-------------------|----------------|------------------------------|
+| BF16 R1 | A100\*32  | 95.5              | 87.1           | 3342.29                       |
+| INT8 R1 | (A100\*16)x2 | **95.6**              | **87.2**           | **5035.82 (+50%)**                |
+## 2. Quantization Process
+We apply INT8 quantization to the BF16 checkpoints.
+The quantization scales are determined by dividing the channnel-wise maximum of element values by the INT8 type maximum.
+To generate this weight, run the provided script in the ``./inference`` directory:
+``
+python3 bf16_cast_channel_int8.py --input-bf16-hf-path /path/to/bf16-weights/ --output-int8-hf-path /path/to/save-int8-weight/
+``
+## 3. Trouble Shooting
+Before inference, you should confirm that there is no attribute "quantization_config" in `config.json`.
+---
 # DeepSeek-R1
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable html -->