YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AGILLM2-fast-training · `5L.py`

Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer

Repo: https://huggingface.co/OpenTransformer/AGILLM2-fast-training Org: https://huggingface.co/OpenTransformer Contact: [email protected]

Overview

5L.py is a ~single-file PyTorch training and inference script for language models with:

AR-only training/decoding
Qwen3 tokenizer by default (override via TOKENIZER_ID)
Progressive block growth, AMP/FP8 autocast, OOM backoff
Time-based checkpointing only (monotonic, resume-safe)
Sampling controls: top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
Chinchilla-style target token estimator using all enabled params (core + AR head)

The goal is minimal surface area with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.

Features

Presets: small, smallx2, base
Attention: Low-rank MHA with ALiBi relative bias
Determinism helpers: seed management, checkpoint metadata (RNG states)
Tokenizer safety: adds [PAD] if missing; handles EOS fallbacks
Streaming data: uses datasets streaming for large corpora

Requirements

Python 3.10+
PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
transformers, datasets, tqdm
CUDA-capable GPU recommended; script also runs CPU-only for smoke tests

Install:

pip install torch --index-url https://download.pytorch.org/whl/cu121  # pick your CUDA/CPU wheel
pip install transformers datasets tqdm

Quick start

1) Set tokenizer (optional)

Default is Qwen3:

export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"

Use any compatible tokenizer:

export TOKENIZER_ID="qwen/qwen2.5-7b"

2) Train

Minimal example on SlimPajama (streaming):

python 5L.py train \
  --preset small \
  --source cerebras/SlimPajama-627B \
  --amp \
  --save_dir ckpts_joint \
  --save_every_sec 7200

Targets and steps:

# Let script compute Chinchilla-style target tokens automatically
python 5L.py train --preset small --amp

# Or cap by steps
python 5L.py train --preset small --steps 20000 --amp

Warm start / resume:

# Warm-start from a prior final.pt (shape-safe copy of matching tensors)
python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt

# Full resume (optimizer, scaler, seen tokens, timers)
python 5L.py train --resume ckpts_joint/step00050000.pt

Progressive block growth:

python 5L.py train \
  --preset small \
  --auto_grow \
  --grow_plan "576,640,768,896,1024" \
  --grow_every_steps 50000

FP8 fast path:

# Try FP8; if not supported, fall back to bf16
python 5L.py train --preset small --fp8-only --fp8-fallback

3) Inference

python 5L.py infer \
  --mode ar \
  --ckpt ckpts_joint/final.pt \
  --preset small \
  --prompt "Explain ALiBi in simple terms." \
  --max_new 120 \
  --top_p 0.9 --top_k 50 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3

Greedy decode:

python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "What is progressive block growth in training?" --greedy --max_new 80

FP8 during decode (if supported):

python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback

Presets

small    : d=512, layers=8,  heads=16, rank=64
smallx2  : d=512, layers=16, heads=16, rank=64
base     : d=768, layers=12, heads=24, rank=96

Use --x2 during training to double layers of an inferred previous config.

Checkpointing & Resume

Saves only by time interval (--save_every_sec, default 24h) to avoid step-based drift.
final.pt includes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.
Resume with --resume <path> to restore optimizer/scaler/wall-clock cadence.
Warm start only copies shape-matched tensors (safe if your topology changed).

Artifacts:

ckpts_joint/stepXXXXXXXX.pt
ckpts_joint/latest.json with canonical latest path and step

Data

Default streaming dataset:

cerebras/SlimPajama-627B (train split, streaming enabled). Replace --source with any datasets-compatible corpus that yields {"text": ...}.

EOS handling: if tokenizer’s eos_token_id is missing, uses sep_token_id; if a sample doesn’t end with EOS, one is appended.

Sampling controls

--temperature, --top_k, --top_p, --min_p
--repetition_penalty, --presence_penalty, --frequency_penalty, --penalty_last_n
--no_repeat_ngram_size

Greedy mode (--greedy) overrides sampling.

FP8 / AMP

--fp8-only attempts float8_e4m3fn autocast
--fp8-fallback continues with bf16 if FP8 unsupported
Otherwise use --amp for bf16/fp16 autocast
torch.backends.cuda.matmul.allow_tf32=True is enabled when available

OOM backoff & block growth

On CUDA OOM, the script halves BLOCK (down to 128), empties cache, and retries the step.
With --auto_grow, the script periodically attempts to increase BLOCK along your --grow_plan.

Token targets (Chinchilla-style)

If --target_tokens is unspecified, the script computes 25 × (enabled parameters) using all trainable params (core + AR head). This provides a rough target for total tokens to consume.

Repro tips

Pin a specific tokenizer via TOKENIZER_ID
Log your --preset, --block, and --grow_plan
Keep save_every_sec stable between resumes for monotonic cadence
Record CUDA/cuDNN versions in your run logs for reproducibility

Limitations

AR-only trainer (no encoder-decoder, no multimodal)
Low-rank MHA path; FlashAttention not included
Single-GPU by default; multi-GPU DDP not wired in this file
Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)

Roadmap (planned)

Optional DDP with NCCL/RCCL/HCCL backends
FlashAttention path when available across vendors
Export helpers (Safetensors, GGUF) for downstream serving

Responsible Use

Ensure your dataset usage complies with its license and applicable laws.
Models trained with this script can generate incorrect or biased outputs. Evaluate and align according to your deployment requirements.

Citation

If this script or training pipeline helps your work, consider citing the repo:

@software{OpenTransformer_AGILLM2_fast_training_2025,
  title   = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
  author  = {OpenTransformers},
  year    = {2025},
  url     = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
}

Support / Contracts We provide custom development and end-to-end training services (data prep → training → evaluation → deployment). Email: [email protected] Org page: https://huggingface.co/OpenTransformer

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

OpenTransformer
/

AGILLM2-fast-training

AGILLM2-fast-training · `5L.py`

Overview

Features

Requirements

Quick start

1) Set tokenizer (optional)

2) Train

3) Inference

Presets

Checkpointing & Resume

Data

Sampling controls

FP8 / AMP

OOM backoff & block growth

Token targets (Chinchilla-style)

Repro tips

Limitations

Roadmap (planned)

Responsible Use

Citation

Spaces using OpenTransformer/AGILLM2-fast-training 2

AGILLM2-fast-training · 5L.py

Overview

Features

Requirements

Quick start

1) Set tokenizer (optional)

2) Train

3) Inference

Presets

Checkpointing & Resume

Data

Sampling controls

FP8 / AMP

OOM backoff & block growth

Token targets (Chinchilla-style)

Repro tips

Limitations

Roadmap (planned)

Responsible Use

Citation

Spaces using OpenTransformer/AGILLM2-fast-training 2

AGILLM2-fast-training · `5L.py`