AGILLM2-fast-training · 5L.py
Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer
Repo: https://huggingface.co/OpenTransformer/AGILLM2-fast-training Org: https://huggingface.co/OpenTransformer Contact: [email protected]
Overview
5L.py is a ~single-file PyTorch training and inference script for language models with:
- AR-only training/decoding
- Qwen3 tokenizer by default (override via
TOKENIZER_ID) - Progressive block growth, AMP/FP8 autocast, OOM backoff
- Time-based checkpointing only (monotonic, resume-safe)
- Sampling controls: top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
- Chinchilla-style target token estimator using all enabled params (core + AR head)
The goal is minimal surface area with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.
Features
- Presets:
small,smallx2,base - Attention: Low-rank MHA with ALiBi relative bias
- Determinism helpers: seed management, checkpoint metadata (RNG states)
- Tokenizer safety: adds
[PAD]if missing; handles EOS fallbacks - Streaming data: uses
datasetsstreaming for large corpora
Requirements
- Python 3.10+
- PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
transformers,datasets,tqdm- CUDA-capable GPU recommended; script also runs CPU-only for smoke tests
Install:
pip install torch --index-url https://download.pytorch.org/whl/cu121 # pick your CUDA/CPU wheel
pip install transformers datasets tqdm
Quick start
1) Set tokenizer (optional)
Default is Qwen3:
export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"
Use any compatible tokenizer:
export TOKENIZER_ID="qwen/qwen2.5-7b"
2) Train
Minimal example on SlimPajama (streaming):
python 5L.py train \
--preset small \
--source cerebras/SlimPajama-627B \
--amp \
--save_dir ckpts_joint \
--save_every_sec 7200
Targets and steps:
# Let script compute Chinchilla-style target tokens automatically
python 5L.py train --preset small --amp
# Or cap by steps
python 5L.py train --preset small --steps 20000 --amp
Warm start / resume:
# Warm-start from a prior final.pt (shape-safe copy of matching tensors)
python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt
# Full resume (optimizer, scaler, seen tokens, timers)
python 5L.py train --resume ckpts_joint/step00050000.pt
Progressive block growth:
python 5L.py train \
--preset small \
--auto_grow \
--grow_plan "576,640,768,896,1024" \
--grow_every_steps 50000
FP8 fast path:
# Try FP8; if not supported, fall back to bf16
python 5L.py train --preset small --fp8-only --fp8-fallback
3) Inference
python 5L.py infer \
--mode ar \
--ckpt ckpts_joint/final.pt \
--preset small \
--prompt "Explain ALiBi in simple terms." \
--max_new 120 \
--top_p 0.9 --top_k 50 \
--repetition_penalty 1.1 \
--no_repeat_ngram_size 3
Greedy decode:
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
--prompt "What is progressive block growth in training?" --greedy --max_new 80
FP8 during decode (if supported):
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
--prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback
Presets
small : d=512, layers=8, heads=16, rank=64
smallx2 : d=512, layers=16, heads=16, rank=64
base : d=768, layers=12, heads=24, rank=96
Use --x2 during training to double layers of an inferred previous config.
Checkpointing & Resume
- Saves only by time interval (
--save_every_sec, default 24h) to avoid step-based drift. final.ptincludes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.- Resume with
--resume <path>to restore optimizer/scaler/wall-clock cadence. - Warm start only copies shape-matched tensors (safe if your topology changed).
Artifacts:
ckpts_joint/stepXXXXXXXX.ptckpts_joint/latest.jsonwith canonical latest path and step
Data
Default streaming dataset:
cerebras/SlimPajama-627B(train split, streaming enabled). Replace--sourcewith anydatasets-compatible corpus that yields{"text": ...}.
EOS handling: if tokenizer’s eos_token_id is missing, uses sep_token_id; if a sample doesn’t end with EOS, one is appended.
Sampling controls
--temperature,--top_k,--top_p,--min_p--repetition_penalty,--presence_penalty,--frequency_penalty,--penalty_last_n--no_repeat_ngram_size
Greedy mode (--greedy) overrides sampling.
FP8 / AMP
--fp8-onlyattemptsfloat8_e4m3fnautocast--fp8-fallbackcontinues with bf16 if FP8 unsupported- Otherwise use
--ampfor bf16/fp16 autocast torch.backends.cuda.matmul.allow_tf32=Trueis enabled when available
OOM backoff & block growth
- On CUDA OOM, the script halves
BLOCK(down to 128), empties cache, and retries the step. - With
--auto_grow, the script periodically attempts to increaseBLOCKalong your--grow_plan.
Token targets (Chinchilla-style)
If --target_tokens is unspecified, the script computes 25 × (enabled parameters) using all trainable params (core + AR head). This provides a rough target for total tokens to consume.
Repro tips
- Pin a specific tokenizer via
TOKENIZER_ID - Log your
--preset,--block, and--grow_plan - Keep
save_every_secstable between resumes for monotonic cadence - Record CUDA/cuDNN versions in your run logs for reproducibility
Limitations
- AR-only trainer (no encoder-decoder, no multimodal)
- Low-rank MHA path; FlashAttention not included
- Single-GPU by default; multi-GPU DDP not wired in this file
- Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)
Roadmap (planned)
- Optional DDP with NCCL/RCCL/HCCL backends
- FlashAttention path when available across vendors
- Export helpers (Safetensors, GGUF) for downstream serving
Responsible Use
- Ensure your dataset usage complies with its license and applicable laws.
- Models trained with this script can generate incorrect or biased outputs. Evaluate and align according to your deployment requirements.
Citation
If this script or training pipeline helps your work, consider citing the repo:
@software{OpenTransformer_AGILLM2_fast_training_2025,
title = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
author = {OpenTransformers},
year = {2025},
url = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
}
Support / Contracts We provide custom development and end-to-end training services (data prep → training → evaluation → deployment). Email: [email protected] Org page: https://huggingface.co/OpenTransformer