GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

  • Model type: GPT-2 (125M parameters)
  • Architecture: Transformer-based autoregressive language model following the original GPT-2 design
  • Training data: Combined dataset (18B tokens) from:
    • HuggingFaceFW/fineweb-edu - 7_000_000_000
    • common-pile/arxiv_papers_filtered - 1_500_000_000
    • tiiuae/falcon-refinedweb - 7_000_000_000
    • manu/project_gutenberg - 200_000_000
    • nampdn-ai/tiny-textbooks - 200_000_000
    • SciPhi/textbooks-are-all-you-need-lite - 500_000_000
    • abehandlerorg/ccnews - 1_980_000_000
  • Training approach: Built and trained from scratch, not fine-tuned from an existing checkpoint
  • Language: English

Intended Uses & Limitations

  • Intended use: Research and experimentation with language models; reference implementation for reproducing GPT-2
  • Limitations: With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions

Training Details

  • Training corpus: Approximately 18B tokens (120GB)
  • Training duration: 1 epochs (approximately 8 hours total)
  • Hardware: 8× NVIDIA A100 PCE GPUs via runpod.io
  • Estimated cost: $ (8*13.52) for complete training
  • Token context: 1024 tokens

Hyperparameters

  • context_len: 1024
  • seed: 42
  • epochs: 2
  • batch_size: 64
  • total_batch_size: 524288 tokens
  • grad_clip: 1.0
  • optimizer: "adamw"
  • max_lr: 6.0e-4
  • min_lr: 6.0e-5
  • beta1: 0.9
  • beta2: 0.95
  • weight_decay: 0.1

Performance and Evaluation

This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch. While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.

Commands used during installation

  • pip install wandb
  • pip install tiktoken
  • pip install --upgrade huggingface_hub
  • pip install torchinfo
  • pip install datasets
  • sudo apt update && sudo apt install tmux
  • tmux new -s training
  • wandb login
  • CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1
    torchrun --standalone --nproc_per_node=8 train.py

Contact

GitHub: thecr7guy2

Downloads last month
68
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thecr7guy/gpt2-pretrain

Finetuned
(1831)
this model

Datasets used to train thecr7guy/gpt2-pretrain