temporary0-0name commited on
Commit
0a36318
·
verified ·
1 Parent(s): 2f049ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -1
README.md CHANGED
@@ -14,7 +14,7 @@ widget:
14
  # Custom GPT Model
15
 
16
  ## Model Description
17
- This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library. It was independently trained on custom datasets, specifically focusing on tailored NLP tasks. The training process involved meticulous data preprocessing and training strategies to enhance its language understanding capabilities.
18
 
19
  ## Model Parameters
20
  - **Block Size**: `256` (Maximum sequence length)
@@ -30,6 +30,21 @@ This model, designed and pretrained from scratch, was developed without utilizin
30
  - **Micro Batch Size**: `128`
31
  - **Sequence Length**: `256`
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ### Tokenization
34
  For tokenization, this model uses:
35
  ```python
 
14
  # Custom GPT Model
15
 
16
  ## Model Description
17
+ This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
18
 
19
  ## Model Parameters
20
  - **Block Size**: `256` (Maximum sequence length)
 
30
  - **Micro Batch Size**: `128`
31
  - **Sequence Length**: `256`
32
 
33
+ ## Dataset Description
34
+
35
+ ### Overview
36
+ For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text.
37
+
38
+ ### Dataset Source
39
+ The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page:
40
+ [HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT)
41
+
42
+ ### Training Details
43
+ - **Total Tokens Used for Training**: 3 billion tokens
44
+ - **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory.
45
+
46
+
47
+
48
  ### Tokenization
49
  For tokenization, this model uses:
50
  ```python