ChessGPT-2
Model Description
ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents large-16 (200M parameters) as our best model.
The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.
Model Details
large-16 (Primary Model)
- Model Type: Autoregressive Transformer Language Model (GPT-2 architecture)
 - Parameters: ~200 million
 - Architecture: 
- Layers: 16
 - Attention Heads: 16
 - Embedding Dimension: 1024
 - Context Length: 1023 tokens
 - Vocabulary Size: 32 tokens (chess-specific vocabulary)
 
 - Training Framework: NanoGPT (PyTorch)
 - Precision: Mixed precision training (bfloat16/float16)
 
Training Data
All models were trained on datasets from @adamkarvonen/chess_games:
Primary Dataset: Stockfish Games
- Dataset: 
stockfish_dataset_blocks.zip - Description: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
 - Format: PGN (Portable Game Notation) games converted to 1024-character blocks
 - Tokenization: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
 - Data Split: 99% training, 1% validation (random split with seed 2357)
 
Training Configuration
large-16 Training Settings
- Batch Size: 32 (micro-batch)
 - Gradient Accumulation: 4 steps (effective batch size: 128)
 - Learning Rate: 3e-4 with cosine decay to 3e-5
 - Warmup: 2000 iterations
 - Max Iterations: 600,000
 - Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95)
 - Dropout: 0.0 (no dropout for pretraining)
 - Training Hardware: RTX 3090/4090 GPUs with distributed training support
 
Usage
Loading the Model
import torch
from model import GPT, GPTConfig
# Load large-16 configuration
config = GPTConfig(
    block_size=1023,
    n_layer=16,
    n_head=16,
    n_embd=1024,
    dropout=0.0,
    bias=False,
    vocab_size=32
)
# Initialize and load model
model = GPT(config)
checkpoint = torch.load('ckpt.pt', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()
# For GPU inference (recommended)
if torch.cuda.is_available():
    model = model.cuda()
# Generate chess moves (requires proper tokenization)
prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
# ... tokenization and generation code ...
Input Format
All models expect properly tokenized chess games:
- Must start with ";" delimiter
 - Standard PGN algebraic notation
 - 1024-character blocks for optimal performance
 
Performance Characteristics
The large-16 model demonstrates:
- Superior Chess Reasoning: Advanced understanding of tactical and strategic patterns
 - High-Quality Planning: Excellent long-term game planning capabilities
 - Pattern Recognition: Enhanced recognition across diverse chess positions
 - Substantial Scale: 202.5M parameters in 2.3GB model size
 - Optimal Architecture: 16 layers, 16 heads, 1024 embedding dimension
 - Near-Expert Performance: Potential for expert-level chess understanding
 
Model Series & Ablation Studies
This repository represents extensive research into scaling transformer models for chess. Our complete series includes:
Parameter Scaling Ablations
| Model Variant | Parameters | Layers | Heads | Embedding | Model Size | Val Loss | Key Characteristics | 
|---|---|---|---|---|---|---|---|
| small-8 | 25.7M | 8 | 8 | 512 | 294MB | 0.2944 | Compact baseline | 
| small-16 | 50.9M | 16 | 8 | 512 | 582MB | 0.2725 | Depth scaling study | 
| small-24 | 76.1M | 24 | 8 | 512 | 871MB | 0.2628 | Deep narrow model | 
| small-36 | 113.8M | 36 | 8 | 512 | 1.3GB | 0.2583 | Maximum depth | 
| medium-12 | 85.8M | 12 | 12 | 768 | 982MB | 0.2652 | Balanced medium | 
| medium-16 | 114.1M | 16 | 12 | 768 | 1.3GB | 0.2608 | Deeper medium | 
| large-16 | 202.5M | 16 | 16 | 1024 | 2.3GB | 0.2578 | Primary model | 
Dataset Comparison Studies
| Model | Dataset | Source | Size | Characteristics | 
|---|---|---|---|---|
| All Stockfish Models | Stockfish | Engine games | 4.5GB | Optimal play patterns | 
| Lichess Model | Lichess | Human games | 6GB | Human decision patterns | 
Key Research Findings
- Depth vs Width Trade-offs: Small models (512 emb, 8 heads) scale from 25.7Mโ113.8M parameters purely through depth (8โ36 layers)
 - Clear Performance Scaling: Validation loss improves consistently with depth: 0.2944 (8-layer) โ 0.2583 (36-layer)
 - Architecture Variations: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
 - Parameter Efficiency: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
 - No Overfitting: All models trained to 600k iterations show continued learning potential
 - Dataset Impact: Significant behavioral differences between engine vs. human training data
 
Evaluation Metrics
Models should be evaluated on:
- Move Legality: Percentage of generated moves that are legal
 - Game Continuation: Quality and coherence of extended game sequences
 - Tactical Recognition: Ability to identify tactical patterns and combinations
 - Strategic Understanding: Long-term positional planning and evaluation
 - Opening Knowledge: Familiarity with established opening theory
 - Endgame Technique: Performance in simplified positions
 
Intended Use
Primary Use Cases
- Chess Analysis: High-quality position evaluation and move suggestion
 - Research: Studying emergent reasoning in language models
 - Education: Chess learning and pattern recognition tools
 - AI Development: Baseline for chess AI systems
 
Limitations
- Specialized for chess gameplay only
 - Limited to standard chess rules and notation
 - Requires proper tokenization format
 - GPU recommended for practical inference
 - May not generalize beyond chess domain
 
Alternative Model Variants
For Different Use Cases:
- Fast Inference: Use small-8 for minimal resource requirements
 - Depth vs Width: Compare small-16/24/36 for layer depth ablations
 - Balanced Performance: Use medium-12 or medium-16 for mid-range applications
 - Maximum Performance: Use large-16 for best overall results
 - Human Behavior Studies: Use lichess model for human-like gameplay patterns
 
Computational Requirements:
- Small Models (8-36 layers): CPU inference possible, GPU recommended
 - Medium Models: GPU recommended for practical use
 - Large Model: Single high-end GPU required
 
Technical Implementation
Model Architecture
Based on GPT-2 with chess-specific adaptations:
- Vocabulary: Reduced to 32 chess-specific tokens
 - Context: Optimized for 1023-token chess game sequences
 - Training: Custom data loading for chess game blocks
 - Framework: Built on NanoGPT for simplicity and efficiency
 
Training Insights
- Convergence: Smooth training curves across all scales
 - Memory Efficiency: Optimized for multi-GPU training
 - Data Processing: Custom tokenization preserving chess structure
 - Evaluation: Chess-specific validation metrics
 
Ethical Considerations
- Models trained exclusively on chess data pose minimal ethical risks
 - No personal data or sensitive information in training datasets
 - Intended for educational, research, and recreational purposes
 - Computational requirements may limit accessibility
 - Models do not generalize beyond chess domain
 
Citation
If you use ChessGPT in your research, please cite:
@misc{chessgpt,
  title={ChessGPT-2},
  author={[Your Name]},
  year={2024},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/[your-username]/chessgpt-2}
}
@dataset{chess_games_dataset,
  title={Chess Games Dataset},
  author={Adam Karvonen},
  year={2024},
  url={https://huggingface.co/datasets/adamkarvonen/chess_games}
}
References
- NanoGPT: karpathy/nanoGPT
 - Chess Dataset: @adamkarvonen/chess_games
 - GPT-2 Paper: Radford et al., 2019
 - Scaling Laws: Kaplan et al., 2020
 
License
[MIT, Apache 2.0]
- Downloads last month
 - -
 
Dataset used to train jd0g/chess-gpt
Evaluation results
- Best Validation Loss (large-16) on Chess Games Datasetself-reported0.258