🐊 GatorGPT2

GatorGPT2 is a small, decoder-only Transformer trained from scratch on a subset of TinyStories for next-token prediction.
It uses RoPE (rotary positional embeddings), GQA (grouped-query attention), RMSNorm, and a SwiGLU MLP.
Tokenizer is tiktoken with p50k_base vocabulary.

Repo: kunjcr2/GatorGPT2
Intended use: research, experimentation, educational demos for training/serving custom LMs


πŸ”§ Architecture

  • Type: Decoder-only, causal LM
  • Layers: num_hidden_layers = 10
  • Hidden size: hidden_size = 448
  • Heads: num_attention_heads = 8 (GQA with 2 KV heads per query group)
  • FFN: SwiGLU, d_ff β‰ˆ 2Γ— hidden_size
  • Norm: RMSNorm (pre-norm blocks)
  • Positional: RoPE
  • Vocab: vocab_size = 50,257 (tiktoken p50k_base)
  • Context length: max_position_embeddings = 1024
  • Weight tying: output head tied with token embeddings
  • Files:
    • pytorch_model.bin (or model.safetensors)
    • config.json (model_type: "gator-transformer", auto_map provided)
    • modeling_gator.py, configuration_gator.py, __init__.py
    • tokenizer_manifest.json β†’ { "library": "tiktoken", "encoding": "p50k_base" }

Custom code is loaded via trust_remote_code=True.


πŸ“¦ Install

pip install torch transformers tiktoken

πŸš€ Quickstart (Transformers + tiktoken)

import torch
from transformers import AutoModelForCausalLM
import tiktoken

MODEL_ID = "kunjcr2/GatorGPT2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load model (uses custom modeling code)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.float32,
).to(DEVICE).eval()

# Tokenizer (p50k_base via tiktoken)
tok = tiktoken.get_encoding("p50k_base")

def generate_greedy(prompt: str, max_new_tokens: int = 64) -> str:
    ids = tok.encode(prompt)
    x = torch.tensor([ids], device=DEVICE)
    for _ in range(max_new_tokens):
        with torch.no_grad():
            out = model(x)
        logits = out["logits"] if isinstance(out, dict) else out.logits
        next_id = int(torch.argmax(logits[0, -1]))
        x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
    return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()

print(generate_greedy("Little girl was"))

Temperature-only sampling (no top-k/p)

def generate_temp(prompt, max_new_tokens=64, temperature=0.9):
    ids = tok.encode(prompt)
    x = torch.tensor([ids], device=DEVICE)
    for _ in range(max_new_tokens):
        with torch.no_grad():
            logits = model(x).logits[0, -1] / max(temperature, 1e-6)
        probs = torch.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, 1).item()
        x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
    return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()

🌐 Serving with vLLM (Optional)

python -m vllm.entrypoints.openai.api_server \
  --model kunjcr2/GatorGPT2 \
  --tokenizer kunjcr2/GatorGPT2 \
  --trust-remote-code \
  --dtype float32 \
  --max-model-len 1024 \
  --host 0.0.0.0 --port 8000

Call it:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kunjcr2/GatorGPT2","prompt":"Little girl was","max_tokens":64,"temperature":0.9}'

πŸ§ͺ Training Summary

  • Data: roneneldan/TinyStories (train split; subset of ~1.5M stories)
  • Objective: causal LM (next-token prediction), cross-entropy
  • Optimizer: AdamW (lr=3e-4, weight_decay=0.01, eps=1e-8)
  • Precision: bf16 autocast on CUDA during forward for speed
  • Batching: sliding windows via a FastDataset (window size e.g. 512, stride 256)
  • Eval: periodic validation over fixed batches; train loss downsampled to eval steps for plotting
  • Hardware: intended for A100-class GPUs; also runs on CPU for debug (slow)

This is a from-scratch toy/educational model; quality depends heavily on steps, data cleaned, and schedule. Expect simple, short English generations.


βœ… Intended Use

  • Research on small decoder-only Transformers

  • Educational demos (training, saving, model hub, vLLM serving)

  • Baseline for experimenting with:

    • LoRA/QLoRA, quantization, distillation
    • Attention variants (Flash-Attention, GQA configs)
    • Data curation and scaling laws

Not intended for production or safety-critical use.


⚠️ Limitations & Risks

  • Trained on children’s story data β‡’ limited world knowledge & reasoning
  • May output incoherent, repetitive, or undesirable text
  • No instruction-tuning or RLHF
  • Tokenizer is tiktoken p50k_base (not a standard HF tokenizer), so examples use tiktoken directly

πŸ“ Repo Structure

.
β”œβ”€β”€ config.json
β”œβ”€β”€ pytorch_model.bin        # or model.safetensors
β”œβ”€β”€ modeling_gator.py        # custom architecture (RoPE, GQA, RMSNorm, SwiGLU)
β”œβ”€β”€ configuration_gator.py
β”œβ”€β”€ __init__.py
└── tokenizer_manifest.json  # { "library": "tiktoken", "encoding": "p50k_base" }

config.json includes:

{
  "model_type": "gator-transformer",
  "architectures": ["GatorModel"],
  "auto_map": {
    "AutoConfig": "configuration_gator.GatorConfig",
    "AutoModelForCausalLM": "modeling_gator.GatorModel"
  }
}

πŸ“Š Evaluation

No formal benchmarks reported. You can compute loss/perplexity on your own validation subset:

import math, torch
from torch.utils.data import DataLoader, TensorDataset

# ...build a DataLoader of (input_ids, target_ids) pairs...
def eval_loss(model, loader, device="cuda"):
    model.eval(); total, n = 0.0, 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x).logits
            loss = torch.nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)), y.view(-1)
            )
            total += loss.item(); n += 1
    return total / max(n,1)

val_loss = eval_loss(model, your_val_loader)
print("val loss:", val_loss, "  ppl:", math.exp(val_loss))

πŸ“œ License

apache-2.0


πŸ™Œ Acknowledgements

  • TinyStories dataset by Ronen Eldan et al. (roneneldan/TinyStories)
  • Community tooling: PyTorch, πŸ€— Transformers, tiktoken, vLLM

βœ‰οΈ Citation

If you use this model, please cite this repository:

@software{GatorGPT2_2025,
  author = {Kunj},
  title = {GatorGPT2: a small decoder-only Transformer with RoPE+GQA},
  year = {2025},
  url = {https://huggingface.co/kunjcr2/GatorGPT2}
}
Downloads last month
36
Safetensors
Model size
63.1M params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train kunjcr2/GatorGPT2