GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

Model type: GPT-2 (125M parameters)
Architecture: Transformer-based autoregressive language model following the original GPT-2 design
Training data: Combined dataset (18B tokens) from:
- HuggingFaceFW/fineweb-edu - 7_000_000_000
- common-pile/arxiv_papers_filtered - 1_500_000_000
- tiiuae/falcon-refinedweb - 7_000_000_000
- manu/project_gutenberg - 200_000_000
- nampdn-ai/tiny-textbooks - 200_000_000
- SciPhi/textbooks-are-all-you-need-lite - 500_000_000
- abehandlerorg/ccnews - 1_980_000_000
Training approach: Built and trained from scratch, not fine-tuned from an existing checkpoint
Language: English

Intended Uses & Limitations

Intended use: Research and experimentation with language models; reference implementation for reproducing GPT-2
Limitations: With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions

Training Details

Training corpus: Approximately 18B tokens (120GB)
Training duration: 1 epochs (approximately 8 hours total)
Hardware: 8× NVIDIA A100 PCE GPUs via runpod.io
Estimated cost: $ (8*13.52) for complete training
Token context: 1024 tokens

Hyperparameters

context_len: 1024
seed: 42
epochs: 2
batch_size: 64
total_batch_size: 524288 tokens
grad_clip: 1.0
optimizer: "adamw"
max_lr: 6.0e-4
min_lr: 6.0e-5
beta1: 0.9
beta2: 0.95
weight_decay: 0.1

Performance and Evaluation

This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch. While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.

Commands used during installation

pip install wandb
pip install tiktoken
pip install --upgrade huggingface_hub
pip install torchinfo
pip install datasets
sudo apt update && sudo apt install tmux
tmux new -s training
wandb login
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1
torchrun --standalone --nproc_per_node=8 train.py

Contact

GitHub: thecr7guy2

thecr7guy
/

gpt2-pretrain