Update README.md
Browse files
README.md
CHANGED
@@ -30,6 +30,24 @@ This model, designed and pretrained from scratch, was developed without utilizin
|
|
30 |
- **Micro Batch Size**: `128`
|
31 |
- **Sequence Length**: `256`
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
## Dataset Description
|
34 |
|
35 |
### Overview
|
|
|
30 |
- **Micro Batch Size**: `128`
|
31 |
- **Sequence Length**: `256`
|
32 |
|
33 |
+
## Model Parameters Details
|
34 |
+
|
35 |
+
### Decayed Parameters
|
36 |
+
|
37 |
+
- **Total Decayed Parameters**: 95,453,184
|
38 |
+
|
39 |
+
Decayed parameters typically include weights from the model's various layers (like the transformer blocks), which are subject to weight decay during optimization. This technique helps in regularizing the model, potentially reducing overfitting by penalizing large weights.
|
40 |
+
|
41 |
+
### Non-Decayed Parameters
|
42 |
+
- **Total Non-Decayed Parameters**: 81,408
|
43 |
+
|
44 |
+
Non-decayed parameters generally involve biases and layer normalization parameters. These parameters are excluded from weight decay as applying decay can adversely affect the training process by destabilizing the learning dynamics.
|
45 |
+
|
46 |
+
### Total Parameters
|
47 |
+
- **Overall Total Parameters**: 95,534,592
|
48 |
+
|
49 |
+
The calculated total number of parameters includes both decayed and non-decayed tensors, summing up to over 95 million parameters.
|
50 |
+
|
51 |
## Dataset Description
|
52 |
|
53 |
### Overview
|