The Ultra-Scale Playbook: Training LLMs on GPU Clusters

FineWeb

Fueled by the scaling laws, the trend of training ever larger language models on vaster amounts of data has been driving progress in AI for the past couple years. Initially, the development of the largest models happened exclusively behind closed doors of a handful of research labs but recently opened up more with the release of models such as Llama 3.1 405B and DeepSeek R1. While these models have openly shared weights and their training recipes are described in technical reports, the challenging engineering to involved to train at the necessary infrastructure scale is still hidden between the lines of a handful of papers and complex training frameworks. This ~~long blog post~~ open-source book is here to open this black box!

In this book we invite you to follow us in the wonderful world of scaling training of Large Language Models to tens, hundreds, thousands of GPUs. It assumes you know the basics on LLM architecture and training, but are new to distributed training. This writing can be seen as a second part of a trilogy following our first blog on processing data for pre-training, the so-called “FineWeb blog post”. Having read both blog posts, you should have almost all the core knowledge needed to deeply understand how LLMs are being built nowadays, just missing a bit the final spices like data mixing or architecture choices to complete the recipe (stay tuned…).

Pre-training LLMs from scratch now requires amounts of compute which exceed in almost every case the use of a single GPU or machine. The clusters used to train these models range from hundreds to thousands of nodes each usually equipped with 4 to 8 GPUs. To make the best use of such an expensive hardware as well as to train in a reasonable time, a range of distributed training methods have been developed with the goal of ensuring that GPUs are highly utilized at all times. Efficiently scaling LLM training is also not confined to pretraining anymore, as fine-tuning larger models on more domain specific data is becoming the standard practice to achieve the best results.

In this post we’ll cover these scaling methods exhaustively while keeping a single story-line to understand where each technique comes from. We’ll cover data, tensor, pipeline and context parallelism as well as ZeRO and kernel fusion. The post is built on the following three foundations:

Quick intros on theory and concepts: before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works:

While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:

image.png

Clear code implementations: theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references: the picotron repository is built for education, thus it implements concepts usually in single, self-contained short files. On the other hand, to look at production ready code, we’ll refer to the nanotron implementations which is a production training codebase used at Hugging Face.

Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation.

Real training efficiency benchmarks: Finally, how to actually scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too

An overview of the over 4000 experiments across all Llama architectures where each data point corresponds to an experiment launch.

As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.

TL;DR

This book is very extensive so we decide to start with a very general overview of how you can think about distributed training. At a high level, the key challenge in scaling LLM training is to make a training step (forward/backward/optimizer step) with a large batch size the fastest possible.

When scaling up models and input batches, we quickly end up in situations where either our target batch size won't fit in memory, or/and the model itself is too large to fit in a single GPU's memory.

To solve this scaling issue we’ll need to carefully evaluate different parallelization strategies and find the optimal balance between three main factors:

  1. Memory Usage
    • Hard limitation - if a training step doesn't fit in memory, training cannot proceed
    • Sometimes we can increase compute (e.g. recomputation) or increase communication (e.g. ZeRO) to reduce memory usage
  2. Compute Efficiency
    • Memory transfer can also decrease compute efficiency.
    • We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or unoptimized kernels.
    • GPUs need sufficient workload (large enough matrices/batch sizes) to maintain high utilization (compute-bound) otherwise they become memory-bound (limited by memory bandwidth).
  3. Communication overhead
    • Two main types. For GPUs: intra-node (NVLink TODO: bandwidth) and inter-node (network TODO: bandwidth)
    • Two main attributes: base latency and peak bandwidth. Base latency is a constant overhead that makes us want to do the least number of comms possible, and peak bandwidth controls the how fast we can move data between gpus
    • We prioritize using the fastest communication channels (like NVLink) for operations that occur frequently and/or block computation (e.g. tensor parallelism)
    • We want to minimize communication overhead as it keeps GPUs idle, so we try to overlap communication with compute as much as possible

But let’s not get too much ahead of our self and scale progressively. To guide you along the journey and as a practical reference we summarized the key concepts in a cheatsheet:

[TODO: ADD CHEATSHEET]

Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!

First Steps: Training on one GPU

Memory usage in Transformers

Memory profiling a training step

Weights/grads/optimizer states memory

Activations memory

Activation recomputation

Gradient accumulation

Data Parallelism

First optimization: Overlap gradient synchronization with backward pass

Second optimization: Bucketing gradients

Third optimization: Interplay with gradient accumulation

Revisit global batch size

Our journey up to now

ZeRO (Zero Redundancy Optimizer)

Memory usage revisited

ZeRO-1: Partitioning Optimizer States

ZeRO-2: Adding Gradient Partitioning

ZeRO-3: Adding Parameter Partitioning

Tensor Parallelism

Tensor Parallelism in a Transformer Block

Sequence Parallelism

Context Parallelism

Introducing Context Parallelism

Discovering Ring Attention

Zig-Zag Ring Attention – A Balanced Compute Implementation

Pipeline Parallelism

Splitting layers on various nodes - All forward, all backward

One-forward-one-backward and LLama 3.1 schemes

Interleaving stages

Zero Bubble and DualPipe

Expert parallelism

5D parallelism in a nutshell

How to Find the Best Training Configuration

Diving in the GPUs – fusing, threading, mixing

A primer on GPU

How to improve performance with Kernels ?

Memory Coalescing

Tiling

Thread Coarsening

Minimizing Control Divergence

Flash Attention 1-3

Fused Kernels

Mixed Precision Training

FP16 and BF16 training

FP8 pretraining

Conclusion

What you learned

What we learned

What’s next?

References

Landmark LLM Scaling Papers

Training Frameworks

Debugging

Distribution Techniques

CUDA Kernels

Hardware

Others

Appendix

Citation

For attribution in academic contexts, please cite this work as

XXX, et al., "The Ultra-Scale Playbook: Training LLMs on GPU Clusterse", 2025.

BibTeX citation

@misc{TODO,
      title={The Ultra-Scale Playbook: Training LLMs on GPU Clusters},
      author={TODO},
      year={2025},
}