π GatorGPT2
GatorGPT2 is a small, decoder-only Transformer trained from scratch on a subset of TinyStories for next-token prediction.
It uses RoPE (rotary positional embeddings), GQA (grouped-query attention), RMSNorm, and a SwiGLU MLP.
Tokenizer is tiktoken with p50k_base vocabulary.
Repo:
kunjcr2/GatorGPT2
Intended use: research, experimentation, educational demos for training/serving custom LMs
π§ Architecture
- Type: Decoder-only, causal LM
- Layers:
num_hidden_layers = 10
- Hidden size:
hidden_size = 448
- Heads:
num_attention_heads = 8
(GQA with 2 KV heads per query group) - FFN: SwiGLU,
d_ff β 2Γ hidden_size
- Norm: RMSNorm (pre-norm blocks)
- Positional: RoPE
- Vocab:
vocab_size = 50,257
(tiktoken p50k_base) - Context length:
max_position_embeddings = 1024
- Weight tying: output head tied with token embeddings
- Files:
pytorch_model.bin
(ormodel.safetensors
)config.json
(model_type: "gator-transformer"
,auto_map
provided)modeling_gator.py
,configuration_gator.py
,__init__.py
tokenizer_manifest.json
β{ "library": "tiktoken", "encoding": "p50k_base" }
Custom code is loaded via
trust_remote_code=True
.
π¦ Install
pip install torch transformers tiktoken
π Quickstart (Transformers + tiktoken)
import torch
from transformers import AutoModelForCausalLM
import tiktoken
MODEL_ID = "kunjcr2/GatorGPT2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load model (uses custom modeling code)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.float32,
).to(DEVICE).eval()
# Tokenizer (p50k_base via tiktoken)
tok = tiktoken.get_encoding("p50k_base")
def generate_greedy(prompt: str, max_new_tokens: int = 64) -> str:
ids = tok.encode(prompt)
x = torch.tensor([ids], device=DEVICE)
for _ in range(max_new_tokens):
with torch.no_grad():
out = model(x)
logits = out["logits"] if isinstance(out, dict) else out.logits
next_id = int(torch.argmax(logits[0, -1]))
x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
print(generate_greedy("Little girl was"))
Temperature-only sampling (no top-k/p)
def generate_temp(prompt, max_new_tokens=64, temperature=0.9):
ids = tok.encode(prompt)
x = torch.tensor([ids], device=DEVICE)
for _ in range(max_new_tokens):
with torch.no_grad():
logits = model(x).logits[0, -1] / max(temperature, 1e-6)
probs = torch.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, 1).item()
x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
π Serving with vLLM (Optional)
python -m vllm.entrypoints.openai.api_server \
--model kunjcr2/GatorGPT2 \
--tokenizer kunjcr2/GatorGPT2 \
--trust-remote-code \
--dtype float32 \
--max-model-len 1024 \
--host 0.0.0.0 --port 8000
Call it:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"kunjcr2/GatorGPT2","prompt":"Little girl was","max_tokens":64,"temperature":0.9}'
π§ͺ Training Summary
- Data:
roneneldan/TinyStories
(train split; subset of ~1.5M stories) - Objective: causal LM (next-token prediction), cross-entropy
- Optimizer: AdamW (
lr=3e-4
,weight_decay=0.01
,eps=1e-8
) - Precision: bf16 autocast on CUDA during forward for speed
- Batching: sliding windows via a
FastDataset
(window size e.g. 512, stride 256) - Eval: periodic validation over fixed batches; train loss downsampled to eval steps for plotting
- Hardware: intended for A100-class GPUs; also runs on CPU for debug (slow)
This is a from-scratch toy/educational model; quality depends heavily on steps, data cleaned, and schedule. Expect simple, short English generations.
β Intended Use
Research on small decoder-only Transformers
Educational demos (training, saving, model hub, vLLM serving)
Baseline for experimenting with:
- LoRA/QLoRA, quantization, distillation
- Attention variants (Flash-Attention, GQA configs)
- Data curation and scaling laws
Not intended for production or safety-critical use.
β οΈ Limitations & Risks
- Trained on childrenβs story data β limited world knowledge & reasoning
- May output incoherent, repetitive, or undesirable text
- No instruction-tuning or RLHF
- Tokenizer is
tiktoken p50k_base
(not a standard HF tokenizer), so examples usetiktoken
directly
π Repo Structure
.
βββ config.json
βββ pytorch_model.bin # or model.safetensors
βββ modeling_gator.py # custom architecture (RoPE, GQA, RMSNorm, SwiGLU)
βββ configuration_gator.py
βββ __init__.py
βββ tokenizer_manifest.json # { "library": "tiktoken", "encoding": "p50k_base" }
config.json
includes:
{
"model_type": "gator-transformer",
"architectures": ["GatorModel"],
"auto_map": {
"AutoConfig": "configuration_gator.GatorConfig",
"AutoModelForCausalLM": "modeling_gator.GatorModel"
}
}
π Evaluation
No formal benchmarks reported. You can compute loss/perplexity on your own validation subset:
import math, torch
from torch.utils.data import DataLoader, TensorDataset
# ...build a DataLoader of (input_ids, target_ids) pairs...
def eval_loss(model, loader, device="cuda"):
model.eval(); total, n = 0.0, 0
with torch.no_grad():
for x, y in loader:
x, y = x.to(device), y.to(device)
logits = model(x).logits
loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)), y.view(-1)
)
total += loss.item(); n += 1
return total / max(n,1)
val_loss = eval_loss(model, your_val_loader)
print("val loss:", val_loss, " ppl:", math.exp(val_loss))
π License
apache-2.0
π Acknowledgements
- TinyStories dataset by Ronen Eldan et al. (
roneneldan/TinyStories
) - Community tooling: PyTorch, π€ Transformers, tiktoken, vLLM
βοΈ Citation
If you use this model, please cite this repository:
@software{GatorGPT2_2025,
author = {Kunj},
title = {GatorGPT2: a small decoder-only Transformer with RoPE+GQA},
year = {2025},
url = {https://huggingface.co/kunjcr2/GatorGPT2}
}
- Downloads last month
- 36