Update README.md
Browse files
README.md
CHANGED
@@ -83,19 +83,55 @@ Please follow the [Build llama.cpp locally](https://github.com/ggerganov/llama.c
|
|
83 |
|
84 |
### Generate the model
|
85 |
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
```bash
|
88 |
-
|
89 |
-
--model Pdeepseek-ai/DeepSeek-V3 \
|
90 |
-
--device 0 \
|
91 |
-
--group_size 32 \
|
92 |
-
--bits 4 \
|
93 |
-
--disable_eval \
|
94 |
-
--iters 200 \
|
95 |
-
--nsample 512 \
|
96 |
-
-devices 0,1,2,3,4 \
|
97 |
-
--format 'gguf:q4_0' \
|
98 |
-
--output_dir "./tmp_autoround"
|
99 |
```
|
100 |
|
101 |
## Ethical Considerations and Limitations
|
|
|
83 |
|
84 |
### Generate the model
|
85 |
|
86 |
+
**5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
|
87 |
+
|
88 |
+
We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
|
89 |
+
|
90 |
+
~~~python
|
91 |
+
model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
|
92 |
+
model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
|
93 |
+
model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
|
94 |
+
model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
|
95 |
+
model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
|
96 |
+
model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
|
97 |
+
model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
|
98 |
+
model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
|
99 |
+
|
100 |
+
~~~
|
101 |
+
|
102 |
+
**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
|
103 |
+
|
104 |
+
~~~python
|
105 |
+
import safetensors
|
106 |
+
from safetensors.torch import save_file
|
107 |
+
|
108 |
+
for i in range(1, 164):
|
109 |
+
idx_str = "0" * (5-len(str(i))) + str(i)
|
110 |
+
safetensors_path = f"model-{idx_str}-of-000163.safetensors"
|
111 |
+
print(safetensors_path)
|
112 |
+
tensors = dict()
|
113 |
+
with safetensors.safe_open(safetensors_path, framework="pt") as f:
|
114 |
+
for key in f.keys():
|
115 |
+
tensors[key] = f.get_tensor(key)
|
116 |
+
save_file(tensors, safetensors_path, metadata={'format': 'pt'})
|
117 |
+
~~~
|
118 |
+
|
119 |
+
|
120 |
+
|
121 |
+
**2 replace the modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
|
122 |
+
|
123 |
+
https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
**3 tuning**
|
128 |
+
|
129 |
+
```bash
|
130 |
+
git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
|
131 |
+
```
|
132 |
+
|
133 |
```bash
|
134 |
+
python3 -m auto_round --model "/models/DeepSeek-V3-bf16/" --group_size 128 --format "gguf:q4_0" --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround" --disable_eval 2>&1 | tee -a seekv3.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
```
|
136 |
|
137 |
## Ethical Considerations and Limitations
|