YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AGILLM2-fast-training · 5L.py

Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer

Repo: https://huggingface.co/OpenTransformer/AGILLM2-fast-training Org: https://huggingface.co/OpenTransformer Contact: [email protected]

Overview

5L.py is a ~single-file PyTorch training and inference script for language models with:

  • AR-only training/decoding
  • Qwen3 tokenizer by default (override via TOKENIZER_ID)
  • Progressive block growth, AMP/FP8 autocast, OOM backoff
  • Time-based checkpointing only (monotonic, resume-safe)
  • Sampling controls: top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
  • Chinchilla-style target token estimator using all enabled params (core + AR head)

The goal is minimal surface area with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.

Features

  • Presets: small, smallx2, base
  • Attention: Low-rank MHA with ALiBi relative bias
  • Determinism helpers: seed management, checkpoint metadata (RNG states)
  • Tokenizer safety: adds [PAD] if missing; handles EOS fallbacks
  • Streaming data: uses datasets streaming for large corpora

Requirements

  • Python 3.10+
  • PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
  • transformers, datasets, tqdm
  • CUDA-capable GPU recommended; script also runs CPU-only for smoke tests

Install:

pip install torch --index-url https://download.pytorch.org/whl/cu121  # pick your CUDA/CPU wheel
pip install transformers datasets tqdm

Quick start

1) Set tokenizer (optional)

Default is Qwen3:

export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"

Use any compatible tokenizer:

export TOKENIZER_ID="qwen/qwen2.5-7b"

2) Train

Minimal example on SlimPajama (streaming):

python 5L.py train \
  --preset small \
  --source cerebras/SlimPajama-627B \
  --amp \
  --save_dir ckpts_joint \
  --save_every_sec 7200

Targets and steps:

# Let script compute Chinchilla-style target tokens automatically
python 5L.py train --preset small --amp

# Or cap by steps
python 5L.py train --preset small --steps 20000 --amp

Warm start / resume:

# Warm-start from a prior final.pt (shape-safe copy of matching tensors)
python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt

# Full resume (optimizer, scaler, seen tokens, timers)
python 5L.py train --resume ckpts_joint/step00050000.pt

Progressive block growth:

python 5L.py train \
  --preset small \
  --auto_grow \
  --grow_plan "576,640,768,896,1024" \
  --grow_every_steps 50000

FP8 fast path:

# Try FP8; if not supported, fall back to bf16
python 5L.py train --preset small --fp8-only --fp8-fallback

3) Inference

python 5L.py infer \
  --mode ar \
  --ckpt ckpts_joint/final.pt \
  --preset small \
  --prompt "Explain ALiBi in simple terms." \
  --max_new 120 \
  --top_p 0.9 --top_k 50 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3

Greedy decode:

python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "What is progressive block growth in training?" --greedy --max_new 80

FP8 during decode (if supported):

python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback

Presets

small    : d=512, layers=8,  heads=16, rank=64
smallx2  : d=512, layers=16, heads=16, rank=64
base     : d=768, layers=12, heads=24, rank=96

Use --x2 during training to double layers of an inferred previous config.

Checkpointing & Resume

  • Saves only by time interval (--save_every_sec, default 24h) to avoid step-based drift.
  • final.pt includes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.
  • Resume with --resume <path> to restore optimizer/scaler/wall-clock cadence.
  • Warm start only copies shape-matched tensors (safe if your topology changed).

Artifacts:

  • ckpts_joint/stepXXXXXXXX.pt
  • ckpts_joint/latest.json with canonical latest path and step

Data

Default streaming dataset:

  • cerebras/SlimPajama-627B (train split, streaming enabled). Replace --source with any datasets-compatible corpus that yields {"text": ...}.

EOS handling: if tokenizer’s eos_token_id is missing, uses sep_token_id; if a sample doesn’t end with EOS, one is appended.

Sampling controls

  • --temperature, --top_k, --top_p, --min_p
  • --repetition_penalty, --presence_penalty, --frequency_penalty, --penalty_last_n
  • --no_repeat_ngram_size

Greedy mode (--greedy) overrides sampling.

FP8 / AMP

  • --fp8-only attempts float8_e4m3fn autocast
  • --fp8-fallback continues with bf16 if FP8 unsupported
  • Otherwise use --amp for bf16/fp16 autocast
  • torch.backends.cuda.matmul.allow_tf32=True is enabled when available

OOM backoff & block growth

  • On CUDA OOM, the script halves BLOCK (down to 128), empties cache, and retries the step.
  • With --auto_grow, the script periodically attempts to increase BLOCK along your --grow_plan.

Token targets (Chinchilla-style)

If --target_tokens is unspecified, the script computes 25 × (enabled parameters) using all trainable params (core + AR head). This provides a rough target for total tokens to consume.

Repro tips

  • Pin a specific tokenizer via TOKENIZER_ID
  • Log your --preset, --block, and --grow_plan
  • Keep save_every_sec stable between resumes for monotonic cadence
  • Record CUDA/cuDNN versions in your run logs for reproducibility

Limitations

  • AR-only trainer (no encoder-decoder, no multimodal)
  • Low-rank MHA path; FlashAttention not included
  • Single-GPU by default; multi-GPU DDP not wired in this file
  • Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)

Roadmap (planned)

  • Optional DDP with NCCL/RCCL/HCCL backends
  • FlashAttention path when available across vendors
  • Export helpers (Safetensors, GGUF) for downstream serving

Responsible Use

  • Ensure your dataset usage complies with its license and applicable laws.
  • Models trained with this script can generate incorrect or biased outputs. Evaluate and align according to your deployment requirements.

Citation

If this script or training pipeline helps your work, consider citing the repo:

@software{OpenTransformer_AGILLM2_fast_training_2025,
  title   = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
  author  = {OpenTransformers},
  year    = {2025},
  url     = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
}

Support / Contracts We provide custom development and end-to-end training services (data prep → training → evaluation → deployment). Email: [email protected] Org page: https://huggingface.co/OpenTransformer

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using OpenTransformer/AGILLM2-fast-training 2