thecr7guy commited on
Commit
01b49dc
·
verified ·
1 Parent(s): f2767f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -3
README.md CHANGED
@@ -1,3 +1,84 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - HuggingFaceFW/fineweb-edu
5
+ - common-pile/arxiv_papers_filtered
6
+ - tiiuae/falcon-refinedweb
7
+ - manu/project_gutenberg
8
+ - nampdn-ai/tiny-textbooks
9
+ - SciPhi/textbooks-are-all-you-need-lite
10
+ - abehandlerorg/ccnews
11
+ base_model:
12
+ - openai-community/gpt2
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # GPT-2 from Scratch
17
+
18
+ This model implements the GPT-2 architecture (125M parameters) trained from scratch.
19
+
20
+ ## Model Description
21
+
22
+ - **Model type:** GPT-2 (125M parameters)
23
+ - **Architecture:** Transformer-based autoregressive language model following the original GPT-2 design
24
+ - **Training data:** Combined dataset (18B tokens) from:
25
+ - HuggingFaceFW/fineweb-edu - 7_000_000_000
26
+ - common-pile/arxiv_papers_filtered - 1_500_000_000
27
+ - tiiuae/falcon-refinedweb - 7_000_000_000
28
+ - manu/project_gutenberg - 200_000_000
29
+ - nampdn-ai/tiny-textbooks - 200_000_000
30
+ - SciPhi/textbooks-are-all-you-need-lite - 500_000_000
31
+ - abehandlerorg/ccnews - 1_980_000_000
32
+ - **Training approach:** Built and trained from scratch, not fine-tuned from an existing checkpoint
33
+ - **Language:** English
34
+
35
+ ## Intended Uses & Limitations
36
+
37
+ - **Intended use:** Research and experimentation with language models; reference implementation for reproducing GPT-2
38
+ - **Limitations:** With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions
39
+
40
+ ## Training Details
41
+
42
+ - **Training corpus:** Approximately 18B tokens (120GB)
43
+ - **Training duration:** 1 epochs (approximately 8 hours total)
44
+ - **Hardware:** 8× NVIDIA A100 PCE GPUs via runpod.io
45
+ - **Estimated cost:** $ (8*13.52) for complete training
46
+ - **Token context:** 1024 tokens
47
+
48
+ ### Hyperparameters
49
+
50
+ - context_len: 1024
51
+ - seed: 42
52
+ - epochs: 2
53
+ - batch_size: 64
54
+ - total_batch_size: 524288 tokens
55
+ - grad_clip: 1.0
56
+ - optimizer: "adamw"
57
+ - max_lr: 6.0e-4
58
+ - min_lr: 6.0e-5
59
+ - beta1: 0.9
60
+ - beta2: 0.95
61
+ - weight_decay: 0.1
62
+
63
+
64
+ ## Performance and Evaluation
65
+
66
+ This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch.
67
+ While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.
68
+
69
+ ## Commands used during installation
70
+
71
+ - pip install wandb
72
+ - pip install tiktoken
73
+ - pip install --upgrade huggingface_hub
74
+ - pip install torchinfo
75
+ - pip install datasets
76
+ - sudo apt update && sudo apt install tmux
77
+ - tmux new -s training
78
+ - wandb login
79
+ - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 \
80
+ torchrun --standalone --nproc_per_node=8 train.py
81
+
82
+ ## Contact
83
+
84
+ GitHub: [thecr7guy2](https://github.com/thecr7guy2)