Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ widget:
|
|
14 |
# Custom GPT Model
|
15 |
|
16 |
## Model Description
|
17 |
-
This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
|
18 |
|
19 |
## Model Parameters
|
20 |
- **Block Size**: `256` (Maximum sequence length)
|
@@ -30,6 +30,21 @@ This model, designed and pretrained from scratch, was developed without utilizin
|
|
30 |
- **Micro Batch Size**: `128`
|
31 |
- **Sequence Length**: `256`
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
### Tokenization
|
34 |
For tokenization, this model uses:
|
35 |
```python
|
|
|
14 |
# Custom GPT Model
|
15 |
|
16 |
## Model Description
|
17 |
+
This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
|
18 |
|
19 |
## Model Parameters
|
20 |
- **Block Size**: `256` (Maximum sequence length)
|
|
|
30 |
- **Micro Batch Size**: `128`
|
31 |
- **Sequence Length**: `256`
|
32 |
|
33 |
+
## Dataset Description
|
34 |
+
|
35 |
+
### Overview
|
36 |
+
For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text.
|
37 |
+
|
38 |
+
### Dataset Source
|
39 |
+
The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page:
|
40 |
+
[HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT)
|
41 |
+
|
42 |
+
### Training Details
|
43 |
+
- **Total Tokens Used for Training**: 3 billion tokens
|
44 |
+
- **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory.
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
### Tokenization
|
49 |
For tokenization, this model uses:
|
50 |
```python
|