FILM6912 commited on
Commit
1ac76ab
·
verified ·
1 Parent(s): a32f229

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README_WEIGHTS.md +94 -0
README_WEIGHTS.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepSeek-V3 Weight File Documentation
2
+
3
+ ## New Fields in `config.json`
4
+
5
+ - **model_type**: Specifies the model type, which is updated to `deepseek_v3` in this release.
6
+ - **num_nextn_predict_layers**: Indicates the number of Multi-Token Prediction (MTP) Modules. The open-sourced V3 weights include **1 MTP Module** .
7
+ - **quantization_config**: Describes the configuration for FP8 quantization.
8
+
9
+ ---
10
+
11
+ ## Weight Structure Overview
12
+
13
+ The DeepSeek-V3 weight file consists of two main components: **Main Model Weights** and **MTP Modules**.
14
+
15
+ ### 1. Main Model Weights
16
+
17
+ - **Composition**:
18
+ - Input/output embedding layers and a complete set of 61 Transformer hidden layers.
19
+ - **Parameter Count**:
20
+ - Total parameters: **671B**
21
+ - Activation parameters: **36.7B** (including 0.9B for Embedding and 0.9B for the output Head).
22
+
23
+ #### Structural Details
24
+
25
+ - **Embedding Layer**:
26
+ - `model.embed_tokens.weight`
27
+ - **Transformer Hidden Layers**:
28
+ - `model.layers.0` to `model.layers.60`, totaling `num_hidden_layers` layers.
29
+ - **Output Layer**:
30
+ - `model.norm.weight`
31
+ - `lm_head.weight`
32
+
33
+ ### 2. Multi-Token Prediction (MTP) Modules
34
+
35
+ - **Composition**:
36
+ - Additional MTP Modules defined by the `num_nextn_predict_layers` field. In this model, the value is set to 1.
37
+ - **Parameter Count**:
38
+ - Parameters: **11.5B unique parameters**, excluding the shared 0.9B Embedding and 0.9B output Head).
39
+ - Activation parameters: **2.4B** (including the shared 0.9B Embedding and 0.9B output Head).
40
+
41
+ #### Structural Details
42
+
43
+ - **embed_tokens**: **Shares parameters** with the Embedding layer of the Main Model weights.
44
+ - **enorm & hnorm**: RMSNorm parameters required for speculative decoding.
45
+ - **eh_proj**: Parameters for dimensionality reduction projection on the norm results.
46
+ - **Additional Transformer Hidden Layer**:
47
+ - `model.layers.61.self_attn & mlp` (structure identical to the Main Model hidden layers).
48
+ - **shared_head**: **Shares parameters** with the output Head of the Main Model weights.
49
+
50
+ ---
51
+
52
+ ### Loading Rules
53
+
54
+ - **Main Model Weights**: Loaded via the `num_hidden_layers` parameter in `config.json`.
55
+ - **MTP Modules**: Loaded via the `num_nextn_predict_layers` parameter, with layer IDs appended immediately after the Main Model hidden layers. For example:
56
+ - If `num_hidden_layers = 61` and `num_nextn_predict_layers = 1`, the MTP Module's layer ID is `61`.
57
+
58
+ ---
59
+
60
+ ## FP8 Weight Documentation
61
+
62
+ DeepSeek-V3 natively supports FP8 weight format with 128x128 block scaling.
63
+
64
+ ### FP8 Configuration
65
+
66
+ The FP8 weight file introduces a `quantization_config` field to describe the quantization method. Below is an example configuration:
67
+
68
+ ```json
69
+ "quantization_config": {
70
+ "activation_scheme": "dynamic",
71
+ "fmt": "e4m3",
72
+ "quant_method": "fp8",
73
+ "weight_block_size": [128, 128]
74
+ }
75
+ ```
76
+
77
+ - **Quantization Format**:
78
+ - Format type: `fp8` and `e4m3` (corresponding to `torch.float8_e4m3fn`).
79
+ - Weight block size: `128x128`.
80
+ - **Activation Quantization Scheme**:
81
+ - Utilizes dynamic activation quantization (`dynamic`).
82
+
83
+ ### Dequantization Method
84
+
85
+ The FP8 weight file includes a `weight_scale_inv` field, which stores the dequantization scale for each weight block.
86
+
87
+ - **Storage Format**: `float32 Tensor`, stored alongside the weight data.
88
+ - **Dequantization Formula**:
89
+ - If the weight block is not aligned to 128, it is zero-padded to 128 before calculating the scale. After quantization, the padded portion is removed.
90
+ - The dequantization process is performed as: `(128x128 weight block) * weight_scale_inv`.
91
+
92
+ Through dequantization of the FP8 weights, runtime operations enable online quantization at a granularity of `per-token-per-128-channel`.
93
+
94
+ ---