thecr7guy
/

gpt2-pretrain

Text Generation

Model card Files Files and versions

thecr7guy commited on Aug 27

Commit

01b49dc

·

verified ·

1 Parent(s): f2767f2

Update README.md

Files changed (1) hide show

README.md +84 -3

README.md CHANGED Viewed

@@ -1,3 +1,84 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- HuggingFaceFW/fineweb-edu
+- common-pile/arxiv_papers_filtered
+- tiiuae/falcon-refinedweb
+- manu/project_gutenberg
+- nampdn-ai/tiny-textbooks
+- SciPhi/textbooks-are-all-you-need-lite
+- abehandlerorg/ccnews
+base_model:
+- openai-community/gpt2
+pipeline_tag: text-generation
+---
+# GPT-2 from Scratch
+This model implements the GPT-2 architecture (125M parameters) trained from scratch.
+## Model Description
+- **Model type:** GPT-2 (125M parameters)
+- **Architecture:** Transformer-based autoregressive language model following the original GPT-2 design
+- **Training data:** Combined dataset (18B tokens) from:
+  - HuggingFaceFW/fineweb-edu - 7_000_000_000
+  - common-pile/arxiv_papers_filtered - 1_500_000_000
+  - tiiuae/falcon-refinedweb - 7_000_000_000
+  - manu/project_gutenberg - 200_000_000
+  - nampdn-ai/tiny-textbooks - 200_000_000
+  - SciPhi/textbooks-are-all-you-need-lite - 500_000_000
+  - abehandlerorg/ccnews - 1_980_000_000
+- **Training approach:** Built and trained from scratch, not fine-tuned from an existing checkpoint
+- **Language:** English
+## Intended Uses & Limitations
+- **Intended use:** Research and experimentation with language models; reference implementation for reproducing GPT-2
+- **Limitations:** With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions
+## Training Details
+- **Training corpus:** Approximately 18B tokens (120GB)
+- **Training duration:** 1 epochs (approximately 8 hours total)
+- **Hardware:** 8× NVIDIA A100 PCE GPUs via runpod.io
+- **Estimated cost:** $ (8*13.52) for complete training
+- **Token context:** 1024 tokens
+### Hyperparameters
+- context_len: 1024
+- seed: 42
+- epochs: 2
+- batch_size: 64
+- total_batch_size: 524288 tokens
+- grad_clip: 1.0
+- optimizer: "adamw"
+- max_lr: 6.0e-4
+- min_lr: 6.0e-5
+- beta1: 0.9
+- beta2: 0.95
+- weight_decay: 0.1
+## Performance and Evaluation
+This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch.
+While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.
+## Commands used during installation
+- pip install wandb
+- pip install tiktoken
+- pip install --upgrade huggingface_hub
+- pip install torchinfo
+- pip install datasets
+- sudo apt update && sudo apt install tmux
+- tmux new -s training
+- wandb login
+- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 \
+torchrun --standalone --nproc_per_node=8 train.py
+## Contact
+GitHub: [thecr7guy2](https://github.com/thecr7guy2)