OPEA
/

GGUF
Inference Endpoints
conversational
cicdatopea commited on
Commit
dfedb9d
·
verified ·
1 Parent(s): 9106678

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -12
README.md CHANGED
@@ -83,19 +83,55 @@ Please follow the [Build llama.cpp locally](https://github.com/ggerganov/llama.c
83
 
84
  ### Generate the model
85
 
86
- Here is the sample command to generate the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ```bash
88
- auto-round \
89
- --model Pdeepseek-ai/DeepSeek-V3 \
90
- --device 0 \
91
- --group_size 32 \
92
- --bits 4 \
93
- --disable_eval \
94
- --iters 200 \
95
- --nsample 512 \
96
- -devices 0,1,2,3,4 \
97
- --format 'gguf:q4_0' \
98
- --output_dir "./tmp_autoround"
99
  ```
100
 
101
  ## Ethical Considerations and Limitations
 
83
 
84
  ### Generate the model
85
 
86
+ **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
87
+
88
+ We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
89
+
90
+ ~~~python
91
+ model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
92
+ model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
93
+ model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
94
+ model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
95
+ model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
96
+ model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
97
+ model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
98
+ model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
99
+
100
+ ~~~
101
+
102
+ **1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
103
+
104
+ ~~~python
105
+ import safetensors
106
+ from safetensors.torch import save_file
107
+
108
+ for i in range(1, 164):
109
+ idx_str = "0" * (5-len(str(i))) + str(i)
110
+ safetensors_path = f"model-{idx_str}-of-000163.safetensors"
111
+ print(safetensors_path)
112
+ tensors = dict()
113
+ with safetensors.safe_open(safetensors_path, framework="pt") as f:
114
+ for key in f.keys():
115
+ tensors[key] = f.get_tensor(key)
116
+ save_file(tensors, safetensors_path, metadata={'format': 'pt'})
117
+ ~~~
118
+
119
+
120
+
121
+ **2 replace the modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
122
+
123
+ https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
124
+
125
+
126
+
127
+ **3 tuning**
128
+
129
+ ```bash
130
+ git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
131
+ ```
132
+
133
  ```bash
134
+ python3 -m auto_round --model "/models/DeepSeek-V3-bf16/" --group_size 128 --format "gguf:q4_0" --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround" --disable_eval 2>&1 | tee -a seekv3.txt
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
  ## Ethical Considerations and Limitations