OPEA
/

DeepSeek-V3-int4-sym-gguf-q4-0-inc

Inference Endpoints

Model card Files Files and versions Community

cicdatopea commited on Jan 24

Commit

dfedb9d

·

verified ·

1 Parent(s): 9106678

Update README.md

Files changed (1) hide show

README.md +48 -12

README.md CHANGED Viewed

@@ -83,19 +83,55 @@ Please follow the [Build llama.cpp locally](https://github.com/ggerganov/llama.c
 ### Generate the model
-Here is the sample command to generate the model.
 ```bash
-auto-round \
---model  Pdeepseek-ai/DeepSeek-V3 \
---device 0 \
---group_size 32 \
---bits 4 \
---disable_eval \
---iters 200 \
---nsample 512 \
--devices 0,1,2,3,4 \
---format 'gguf:q4_0' \
---output_dir "./tmp_autoround"
 ```
 ## Ethical Considerations and Limitations

 ### Generate the model
+**5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
+We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
+~~~python
+model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
+model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
+model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
+model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
+model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
+model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
+model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
+model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
+~~~
+**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
+~~~python
+import safetensors
+from safetensors.torch import save_file
+for i in range(1, 164):
+    idx_str = "0" * (5-len(str(i))) + str(i)
+    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
+    print(safetensors_path)
+    tensors = dict()
+    with safetensors.safe_open(safetensors_path, framework="pt") as f:
+        for key in f.keys():
+            tensors[key] = f.get_tensor(key)
+    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
+~~~
+**2 replace the  modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
+https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
+**3   tuning**
+```bash
+git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
+```
 ```bash
+python3 -m auto_round --model  "/models/DeepSeek-V3-bf16/"  --group_size 128 --format "gguf:q4_0"  --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround"  --disable_eval 2>&1 | tee -a seekv3.txt
 ```
 ## Ethical Considerations and Limitations