thomwolf HF staff commited on
Commit
926e541
·
verified ·
1 Parent(s): 787ae8e
dist/assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: 088cd848100ab26abbffdcc7c0e8f18a83facd0a8637c460e3ac88d483b04b46
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB

Git LFS Details

  • SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB
dist/index.html CHANGED
@@ -90,108 +90,141 @@
90
  <p>The book is built on the following <strong>three general foundations</strong>:</p>
91
 
92
  <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
 
93
 
94
- <div id="graph"></div>
95
- <div id="controls">
96
- <div class="cell column-1">
97
- <label for="a">Attention Heads (a):</label>
98
- <input type="range" id="a" name="a" min="1" max="128" value="8">
99
- <input type="number" id="a_input" value="8" min="1" max="128">
100
  </div>
101
- <div class="cell column-2">
102
- <label for="mixed">Mixed Precision:</label>
103
- <input type="checkbox" id="mixed" name="mixed" checked>
104
- <span></span> <!-- Empty span to maintain grid alignment -->
105
- </div>
106
- <div class="cell column-1">
107
- <label for="b">Micro Batch Size (b):</label>
108
- <input type="range" id="b" name="b" min="1" max="53248" value="32">
109
- <input type="number" id="b_input" value="32" min="1" max="53248">
110
- </div>
111
- <div class="cell column-2">
112
- <label for="seq_parallel">Sequence Parallelism:</label>
113
- <input type="checkbox" id="seq_parallel" name="seq_parallel">
114
- <span></span> <!-- Empty span to maintain grid alignment -->
115
- </div>
116
- <div class="cell column-1">
117
- <label for="h">Hidden Dimension (h):</label>
118
- <input type="range" id="h" name="h" min="1" max="16384" value="512">
119
- <input type="number" id="h_input" value="512" min="128" max="16384">
120
- </div>
121
- <div class="cell column-2">
122
- <label for="recomputation">Recomputation:</label>
123
- <select id="recomputation" name="recomputation">
124
- <option value="none">None</option>
125
- <option value="selective">Selective</option>
126
- <option value="full">Full</option>
127
- </select>
128
- <span></span> <!-- Empty span to maintain grid alignment -->
129
- </div>
130
- <div class="cell column-1">
131
- <label for="h_ff">Feedforward Dimension (h_ff):</label>
132
- <input type="range" id="h_ff" name="h_ff" min="1" max="65536" value="2048">
133
- <input type="number" id="h_ff_input" value="2048" min="512" max="65536">
134
- </div>
135
- <div class="cell column-2">
136
- <label for="zero">Zero:</label>
137
- <select id="zero" name="zero">
138
- <option value="0">0</option>
139
- <option value="1">1</option>
140
- <option value="2">2</option>
141
- <option value="3">3</option>
142
- </select>
143
- <span></span> <!-- Empty span to maintain grid alignment -->
144
- </div>
145
- <div class="cell column-1">
146
- <label for="L">Number of Layers (L):</label>
147
- <input type="range" id="L" name="L" min="1" max="126" value="12">
148
- <input type="number" id="L_input" value="12" min="1" max="126">
149
- </div>
150
- <div class="cell column-2">
151
- <label for="ff_activation">FF Activation:</label>
152
- <select id="ff_activation" name="ff_activation">
153
- <option value="relu">ReLU</option>
154
- <option value="gelu">GELU</option>
155
- <option value="swiglu">SwiGLU</option>
156
- </select>
157
- <span></span> <!-- Empty span to maintain grid alignment -->
158
- </div>
159
- <div class="cell column-1">
160
- <label for="s">Sequence Length (s):</label>
161
- <input type="range" id="s" name="s" min="1" max="128000" value="128">
162
- <input type="number" id="s_input" value="128" min="64" max="128000">
163
- </div>
164
- <div class="cell column-2">
165
- <label for="presets">Presets:</label>
166
- <select id="presets" name="presets">
167
- <option value="Llama 3 Tiny">Llama 3 Tiny</option>
168
- <option value="Llama 3 8B">Llama 3 8B</option>
169
- <option value="Llama 3 70B">Llama 3 70B</option>
170
- <option value="Llama 3 405B">Llama 3 405B</option>
171
- </select>
172
- <span></span> <!-- Empty span to maintain grid alignment -->
173
- </div>
174
- <div class="cell column-1">
175
- <label for="v">Vocabulary Size (v):</label>
176
- <input type="range" id="v" name="v" min="1000" max="100000" value="30522">
177
- <input type="number" id="v_input" value="30522" min="1000" max="100000">
178
- </div>
179
- <div class="cell column-2">
180
- <label for="tp">Tensor Parallelism (t):</label>
181
- <input type="range" id="tp" name="tp" min="1" max="16" value="8">
182
- <input type="number" id="tp_input" value="8" min="1" max="16">
183
- </div>
184
- <div class="cell column-1">
185
- <label for="k">Optimizer Parameters (k):</label>
186
- <input type="range" id="k" name="k" min="1" max="16" value="8">
187
- <input type="number" id="k_input" value="8" min="1" max="16">
188
- </div>
189
- <div class="cell column-2">
190
- <label for="dp">Data Parallelism (d):</label>
191
- <input type="range" id="dp" name="dp" min="1" max="256" value="1">
192
- <input type="number" id="dp_input" value="1" min="1" max="256">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  </div>
194
  </div>
 
195
 
196
  <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
197
  <ul>
@@ -1724,9 +1757,11 @@
1724
 
1725
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1726
 
1727
- <div class="large-image-background">
 
1728
  <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1729
  </div>
 
1730
 
1731
 
1732
  <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
@@ -1737,17 +1772,20 @@
1737
 
1738
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1739
 
1740
- <div class="large-image-background">
 
1741
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
1742
  </div>
1743
-
1744
 
1745
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1746
  <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1747
 
1748
- <div class="large-image-background">
 
1749
  <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
1750
  </div>
 
1751
  <div class="note-box">
1752
  <p class="note-box-title">📝 Note</p>
1753
  <div class="note-box-content">
@@ -1799,15 +1837,19 @@
1799
  <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1800
  <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1801
 
1802
- <div class="large-image-background">
 
1803
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
1804
  </div>
 
1805
 
1806
  <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1807
 
1808
- <div class="large-image-background">
 
1809
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
1810
  </div>
 
1811
 
1812
  <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
1813
 
@@ -1874,7 +1916,7 @@
1874
 
1875
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1876
 
1877
- <h2>How to Find the Best Training Configuration</h2>
1878
 
1879
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1880
 
@@ -1958,10 +2000,12 @@
1958
  <p>All the following benchmarks were conducted with a sequence length of 4096 and a global batch size of 1M tokens. We gathered all the top configurations for each model and cluster size and plotted them in the following heatmaps:</p>
1959
  </p>
1960
 
1961
- <div class="large-image-background">
 
1962
  <p><img alt="image.png" src="/assets/images/what_we_learnt_heatmap.svg" /></p>
1963
  </div>
1964
- <div class="figure-legend">
 
1965
  <p>Heatmap visualization showing the optimal training configurations across different model sizes and compute node counts (we have 8 GPUs per node). For each combination, the configuration details include Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Gradient Accumulation Steps (GAS), Micro Batch Size (MBS), and ZeRO optimization stage. The color intensity indicates the Model FLOPs Utilization (MFU), with brighter colors representing higher efficiency.</p>
1966
  </div>
1967
  <p>From this high-level visualization, we can draw several important insights:
@@ -2265,14 +2309,13 @@
2265
 
2266
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2267
 
2268
- <div class="large-image-background">
2269
- <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
 
 
2270
  </div>
2271
- <div class="large-image-background">
2272
- <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2273
  </div>
2274
 
2275
-
2276
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2277
 
2278
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
@@ -2297,9 +2340,11 @@
2297
 
2298
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2299
 
2300
- <div class="large-image-background">
 
2301
  <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2302
  </div>
 
2303
 
2304
  <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2305
  <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
@@ -2685,7 +2730,15 @@
2685
  </ul>
2686
 
2687
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2688
-
 
 
 
 
 
 
 
 
2689
  <h3>Acknowledgements</h3>
2690
 
2691
  <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
@@ -3395,8 +3448,10 @@
3395
 
3396
  <p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
3397
 
3398
- <div class="large-image-background">
3399
- <img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
 
 
3400
  </div>
3401
 
3402
  <p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
@@ -3410,8 +3465,10 @@
3410
 
3411
  <p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
3412
 
3413
- <div class="large-image-background">
3414
- <img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
 
 
3415
  </div>
3416
 
3417
  <p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
@@ -3437,8 +3494,10 @@
3437
 
3438
  <p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
3439
 
3440
- <div class="large-image-background">
3441
- <img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
 
 
3442
  </div>
3443
 
3444
  <p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
 
90
  <p>The book is built on the following <strong>three general foundations</strong>:</p>
91
 
92
  <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
93
+ <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as exercise for the reader.</aside>
94
 
95
+ <div class="large-image-background-transparent">
96
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
97
+ <div id="graph-all">
98
+ <div class="figure-legend">Memory usage breakdown</div>
99
+ <div id="graph"></div>
 
100
  </div>
101
+ <div id="controls">
102
+ <div class="cell column-1">
103
+ <label for="a">Attention Heads (a):</label>
104
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
105
+ <input type="range" id="a" name="a" min="1" max="128" value="8">
106
+ <input type="number" id="a_input" value="8" min="1" max="128">
107
+ </div>
108
+ </div>
109
+ <div class="cell column-2">
110
+ <label for="mixed">Mixed Precision:</label>
111
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
112
+ <input type="checkbox" id="mixed" name="mixed" checked>
113
+ <span></span> <!-- Empty span to maintain grid alignment -->
114
+ </div>
115
+ </div>
116
+ <div class="cell column-1">
117
+ <label for="b">Micro Batch Size (b):</label>
118
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
119
+ <input type="range" id="b" name="b" min="1" max="53248" value="32">
120
+ <input type="number" id="b_input" value="32" min="1" max="53248">
121
+ </div>
122
+ </div>
123
+ <div class="cell column-2">
124
+ <label for="seq_parallel">Sequence Parallelism:</label>
125
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
126
+ <input type="checkbox" id="seq_parallel" name="seq_parallel">
127
+ <span></span> <!-- Empty span to maintain grid alignment -->
128
+ </div>
129
+ </div>
130
+ <div class="cell column-1">
131
+ <label for="h">Hidden Dimension (h):</label>
132
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
133
+ <input type="range" id="h" name="h" min="1" max="16384" value="512">
134
+ <input type="number" id="h_input" value="512" min="128" max="16384">
135
+ </div>
136
+ </div>
137
+ <div class="cell column-2">
138
+ <label for="recomputation">Recomputation:</label>
139
+ <select id="recomputation" name="recomputation">
140
+ <option value="none">None</option>
141
+ <option value="selective">Selective</option>
142
+ <option value="full">Full</option>
143
+ </select>
144
+ <span></span> <!-- Empty span to maintain grid alignment -->
145
+ </div>
146
+ <div class="cell column-1">
147
+ <label for="h_ff">Feedforward Dimension (h_ff):</label>
148
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
149
+ <input type="range" id="h_ff" name="h_ff" min="1" max="65536" value="2048">
150
+ <input type="number" id="h_ff_input" value="2048" min="512" max="65536">
151
+ </div>
152
+ </div>
153
+ <div class="cell column-2">
154
+ <label for="zero">Zero:</label>
155
+ <select id="zero" name="zero">
156
+ <option value="0">0</option>
157
+ <option value="1">1</option>
158
+ <option value="2">2</option>
159
+ <option value="3">3</option>
160
+ </select>
161
+ <span></span> <!-- Empty span to maintain grid alignment -->
162
+ </div>
163
+ <div class="cell column-1">
164
+ <label for="L">Number of Layers (L):</label>
165
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
166
+ <input type="range" id="L" name="L" min="1" max="126" value="12">
167
+ <input type="number" id="L_input" value="12" min="1" max="126">
168
+ </div>
169
+ </div>
170
+ <div class="cell column-2">
171
+ <label for="ff_activation">FF Activation:</label>
172
+ <select id="ff_activation" name="ff_activation">
173
+ <option value="relu">ReLU</option>
174
+ <option value="gelu">GELU</option>
175
+ <option value="swiglu">SwiGLU</option>
176
+ </select>
177
+ <span></span> <!-- Empty span to maintain grid alignment -->
178
+ </div>
179
+ <div class="cell column-1">
180
+ <label for="s">Sequence Length (s):</label>
181
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
182
+ <input type="range" id="s" name="s" min="1" max="128000" value="128">
183
+ <input type="number" id="s_input" value="128" min="64" max="128000">
184
+ </div>
185
+ </div>
186
+ <div class="cell column-2">
187
+ <label for="v">Vocabulary Size (v):</label>
188
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
189
+ <input type="range" id="v" name="v" min="1000" max="100000" value="30522">
190
+ <input type="number" id="v_input" value="30522" min="1000" max="100000">
191
+ </div>
192
+ </div>
193
+ <div class="cell column-1">
194
+ <label for="tp">Tensor Parallelism (t):</label>
195
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
196
+ <input type="range" id="tp" name="tp" min="1" max="16" value="8">
197
+ <input type="number" id="tp_input" value="8" min="1" max="16">
198
+ </div>
199
+ </div>
200
+ <div class="cell column-2">
201
+ <label for="k">Optimizer Parameters (k):</label>
202
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
203
+ <input type="range" id="k" name="k" min="1" max="16" value="8">
204
+ <input type="number" id="k_input" value="8" min="1" max="16">
205
+ </div>
206
+ </div>
207
+ <div class="cell column-1">
208
+ <label for="dp">Data Parallelism (d):</label>
209
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
210
+ <input type="range" id="dp" name="dp" min="1" max="256" value="1">
211
+ <input type="number" id="dp_input" value="1" min="1" max="256">
212
+ </div>
213
+ </div>
214
+ <div class="cell column-2">
215
+ <label for="presets">Presets:</label>
216
+ <select id="presets" name="presets">
217
+ <option value="Llama 3 Tiny">Llama 3 Tiny</option>
218
+ <option value="Llama 3 8B">Llama 3 8B</option>
219
+ <option value="Llama 3 70B">Llama 3 70B</option>
220
+ <option value="Llama 3 405B">Llama 3 405B</option>
221
+ </select>
222
+ <span></span> <!-- Empty span to maintain grid alignment -->
223
+ </div>
224
+ </div>
225
  </div>
226
  </div>
227
+ <p>(Don't worry if you have no idea what's happening in this widget. That's why we're here.)</p>
228
 
229
  <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
230
  <ul>
 
1757
 
1758
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1759
 
1760
+ <div class="large-image-background-transparent">
1761
+ <div class="boxed-image">
1762
  <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1763
  </div>
1764
+ </div>
1765
 
1766
 
1767
  <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
 
1772
 
1773
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1774
 
1775
+ <div class="large-image-background-transparent">
1776
+ <div class="boxed-image">
1777
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
1778
  </div>
1779
+ </div>
1780
 
1781
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1782
  <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1783
 
1784
+ <div class="large-image-background-transparent">
1785
+ <div class="boxed-image">
1786
  <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
1787
  </div>
1788
+ </div>
1789
  <div class="note-box">
1790
  <p class="note-box-title">📝 Note</p>
1791
  <div class="note-box-content">
 
1837
  <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1838
  <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1839
 
1840
+ <div class="large-image-background-transparent">
1841
+ <div class="boxed-image">
1842
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
1843
  </div>
1844
+ </div>
1845
 
1846
  <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1847
 
1848
+ <div class="large-image-background-transparent">
1849
+ <div class="boxed-image">
1850
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
1851
  </div>
1852
+ </div>
1853
 
1854
  <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
1855
 
 
1916
 
1917
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1918
 
1919
+ <h2>Finding the Best Training Configuration</h2>
1920
 
1921
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1922
 
 
2000
  <p>All the following benchmarks were conducted with a sequence length of 4096 and a global batch size of 1M tokens. We gathered all the top configurations for each model and cluster size and plotted them in the following heatmaps:</p>
2001
  </p>
2002
 
2003
+ <div class="large-image-background-transparent">
2004
+ <div class="boxed-image">
2005
  <p><img alt="image.png" src="/assets/images/what_we_learnt_heatmap.svg" /></p>
2006
  </div>
2007
+ </div>
2008
+ <div class="figure-legend">
2009
  <p>Heatmap visualization showing the optimal training configurations across different model sizes and compute node counts (we have 8 GPUs per node). For each combination, the configuration details include Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Gradient Accumulation Steps (GAS), Micro Batch Size (MBS), and ZeRO optimization stage. The color intensity indicates the Model FLOPs Utilization (MFU), with brighter colors representing higher efficiency.</p>
2010
  </div>
2011
  <p>From this high-level visualization, we can draw several important insights:
 
2309
 
2310
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2311
 
2312
+ <div class="large-image-background-transparent">
2313
+ <div class="boxed-image">
2314
+ <img width="1400px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2315
+ <img width="1400px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2316
  </div>
 
 
2317
  </div>
2318
 
 
2319
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2320
 
2321
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
 
2340
 
2341
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2342
 
2343
+ <div class="large-image-background-transparent">
2344
+ <div class="boxed-image">
2345
  <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2346
  </div>
2347
+ </div>
2348
 
2349
  <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2350
  <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
 
2730
  </ul>
2731
 
2732
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2733
+
2734
+ <hr>
2735
+
2736
+ <p> <strong>One last word</strong> for our first readers. We're so happy with this writing piece that we've decided to distribute a limited number of physical printed editions of it as a gift for our first readers.</p>
2737
+ <p>If you are among the first 50 people to fill in your email address below, we'll contact you later in the year to send you a real physical edition once we've formatted it as a printed copy.</p>
2738
+ <p>We expect the book to be around 100-150 pages and to cover the same content as the blog post but we may also decide to shorten or lengthen it depending on what make sense as a printed object.</p>
2739
+ <p>To get your physical copy, please fill in your email address in the following <a target="_blank" href="https://forms.gle/e1GkAShUCtgcwnne8">google form</a>.</p>
2740
+ <p>Whether you are one of our first readers or coming much later to this blog post, we've very happy to see that you enjoyed this sharing of knowledge. May the force of open-source and open-science always be with you.</p>
2741
+
2742
  <h3>Acknowledgements</h3>
2743
 
2744
  <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
 
3448
 
3449
  <p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
3450
 
3451
+ <div class="large-image-background-transparent">
3452
+ <div class="boxed-image">
3453
+ <img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
3454
+ </div>
3455
  </div>
3456
 
3457
  <p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
 
3465
 
3466
  <p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
3467
 
3468
+ <div class="large-image-background-transparent">
3469
+ <div class="boxed-image">
3470
+ <img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
3471
+ </div>
3472
  </div>
3473
 
3474
  <p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
 
3494
 
3495
  <p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
3496
 
3497
+ <div class="large-image-background-transparent">
3498
+ <div class="boxed-image">
3499
+ <img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
3500
+ </div>
3501
  </div>
3502
 
3503
  <p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
dist/main.bundle.js CHANGED
@@ -4920,8 +4920,8 @@ function updateGraph() {
4920
  }]
4921
  };
4922
  console.log('Data for treemap:', data);
4923
- var width = 700;
4924
- var height = 450;
4925
  var legendHeight = 50;
4926
  var svg = src_select("#graph").select("svg");
4927
  svg.selectAll("*").remove();
@@ -4952,10 +4952,10 @@ function updateGraph() {
4952
 
4953
  // Give distinct colors to the main section containers
4954
  case 'Activation Memory':
4955
- return 'rgb(78, 165, 183)';
4956
  // Orange
4957
  case 'Parameters / Gradients / Optimizer States':
4958
- return 'rgb(232, 137, 171)';
4959
  // Teal Blue
4960
 
4961
  // Parameters / Gradients / Optimizer States branch
 
4920
  }]
4921
  };
4922
  console.log('Data for treemap:', data);
4923
+ var width = 600;
4924
+ var height = 600;
4925
  var legendHeight = 50;
4926
  var svg = src_select("#graph").select("svg");
4927
  svg.selectAll("*").remove();
 
4952
 
4953
  // Give distinct colors to the main section containers
4954
  case 'Activation Memory':
4955
+ return 'rgb(61, 198, 159)';
4956
  // Orange
4957
  case 'Parameters / Gradients / Optimizer States':
4958
+ return 'rgba(232, 137, 170, 0.85)';
4959
  // Teal Blue
4960
 
4961
  // Parameters / Gradients / Optimizer States branch
dist/main.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
dist/style.css CHANGED
@@ -182,7 +182,7 @@ toggle-icon {
182
  }
183
 
184
  toggle-icon.collapsed {
185
- transform: rotate(-90deg);
186
  }
187
 
188
  .toc-content {
@@ -296,80 +296,6 @@ d-contents nav > ul > li > a:hover {
296
  text-decoration: none;
297
  }
298
 
299
- /* memory */
300
- #controls {
301
- display: grid;
302
- grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
303
- column-gap: 10px;
304
- margin-bottom: 20px;
305
- max-width: 100%;
306
- @supports (container-type: inline-size) {
307
- container-type: inline-size;
308
- }
309
- }
310
-
311
- #controls .cell {
312
- padding: 1px;
313
- box-sizing: border-box;
314
- }
315
-
316
- #controls .column-1 {
317
- display: flex;
318
- align-items: center;
319
- }
320
-
321
- #controls .column-2 {
322
- display: flex;
323
- align-items: center;
324
- }
325
- @supports (container-type: inline-size) {
326
- @container (max-width: 600px) {
327
- #controls .column-2 {
328
- order: 2;
329
- }
330
- }
331
- }
332
-
333
- #controls label {
334
- text-align: right;
335
- padding-right: 10px;
336
- flex: 0 0 auto;
337
- width: 150px;
338
- line-height: 1.5em;
339
- font-size: 0.8em;
340
- }
341
-
342
- #controls input[type="range"] {
343
- width: 50%;
344
- margin: 0 10px;
345
- }
346
-
347
- #controls input[type="number"] {
348
- flex-shrink: 0;
349
- width: 60px;
350
- height: 24px;
351
- border: 1px solid var(--distill-gray-light);
352
- border-radius: 0.2rem;
353
- }
354
-
355
- #controls select {
356
- width: 100%;
357
- min-height: 28px;
358
- border: 1px solid var(--distill-gray-light);
359
- border-radius: 0.2rem;
360
- }
361
-
362
- #controls .column {
363
- display: contents;
364
- }
365
-
366
- #graph svg {
367
- font-family: sans-serif;
368
- }
369
-
370
- #graph svg rect {
371
- cursor: pointer;
372
- }
373
  .note-box {
374
  background-color: #f6f8fa;
375
  border-left: 4px solid #444444;
@@ -437,6 +363,28 @@ d-code {
437
  justify-content: center; /* This will center your image */
438
  }
439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
440
  d-article li {
441
  margin-bottom: 0.0em;
442
  }
@@ -452,3 +400,200 @@ d-article ol ol {
452
  d-article hr {
453
  grid-column: text;
454
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  }
183
 
184
  toggle-icon.collapsed {
185
+ transform: rotate(90deg);
186
  }
187
 
188
  .toc-content {
 
296
  text-decoration: none;
297
  }
298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  .note-box {
300
  background-color: #f6f8fa;
301
  border-left: 4px solid #444444;
 
363
  justify-content: center; /* This will center your image */
364
  }
365
 
366
+ .large-image-background-transparent {
367
+ /* width: 100vw; */
368
+ padding-top: 10px;
369
+ padding-bottom: 10px;
370
+ /* margin-left: calc(-50vw + 50%); */
371
+ margin-left:-100px;
372
+ margin-right: -100px;
373
+ /* margin-right: calc(-50vw + 50%); */
374
+ /* background: white; */
375
+ height: fit-content; /* This will make it match the image height */
376
+ display: flex;
377
+ justify-content: center; /* This will center your image */
378
+ }
379
+
380
+ .boxed-image {
381
+ padding: 0.5rem;
382
+ background: white;
383
+ border-radius: 12px;
384
+ border: 1px solid #e5e7eb;
385
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
386
+ }
387
+
388
  d-article li {
389
  margin-bottom: 0.0em;
390
  }
 
400
  d-article hr {
401
  grid-column: text;
402
  }
403
+
404
+ /* Memory visualization */
405
+ #graph-all {
406
+ min-width: 500px;
407
+ margin-right: 10px;
408
+ margin-bottom: 2rem;
409
+ padding: 0.5rem;
410
+ background: #f9fafb;
411
+ border-radius: 12px;
412
+ border: 1px solid #e5e7eb;
413
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
414
+ }
415
+
416
+
417
+ /* Main container styles */
418
+ #controls {
419
+ max-width: 1200px;
420
+ /* margin: 2rem auto; */
421
+ margin-bottom: 2rem;
422
+ margin-left: 10px;
423
+ padding: 0.6rem;
424
+ background: #f9fafb;
425
+ border-radius: 12px;
426
+ border: 1px solid #e5e7eb;
427
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
428
+ }
429
+
430
+ /* Grid layout */
431
+ #controls {
432
+ display: grid;
433
+ grid-template-columns: 1fr 1fr;
434
+ /* gap: 2rem; */
435
+ }
436
+
437
+ /* Cell styles */
438
+ .cell {
439
+ margin-bottom: 0.2rem;
440
+ }
441
+
442
+ /* Label styles */
443
+ label {
444
+ display: block;
445
+ /* margin-bottom: 0.5rem; */
446
+ font-size: 0.8rem;
447
+ font-weight: 500;
448
+ color: #374151;
449
+ }
450
+
451
+ /* Input container for range + number combination */
452
+ .input-container {
453
+ display: flex;
454
+ gap: 1rem;
455
+ align-items: center;
456
+ }
457
+
458
+ /* Range input styling */
459
+ input[type="range"] {
460
+ flex: 1;
461
+ height: 6px;
462
+ background: #e5e7eb;
463
+ border-radius: 3px;
464
+ appearance: none;
465
+ outline: none;
466
+ }
467
+
468
+ input[type="range"]::-webkit-slider-thumb {
469
+ appearance: none;
470
+ width: 16px;
471
+ height: 16px;
472
+ background: #3b82f6;
473
+ border-radius: 50%;
474
+ cursor: pointer;
475
+ transition: background 0.15s ease;
476
+ }
477
+
478
+ input[type="range"]::-webkit-slider-thumb:hover {
479
+ background: #2563eb;
480
+ }
481
+
482
+ /* Number input styling */
483
+ input[type="number"] {
484
+ width: 80px;
485
+ padding: 0.5rem;
486
+ border: 1px solid #e5e7eb;
487
+ border-radius: 6px;
488
+ font-size: 0.9rem;
489
+ color: #374151;
490
+ }
491
+
492
+ /* Select styling */
493
+ select {
494
+ width: 100%;
495
+ padding: 0.5rem;
496
+ border: 1px solid #e5e7eb;
497
+ border-radius: 6px;
498
+ background: white;
499
+ font-size: 0.9rem;
500
+ color: #374151;
501
+ cursor: pointer;
502
+ }
503
+
504
+ /* Checkbox styling */
505
+ input[type="checkbox"] {
506
+ width: 1.2rem;
507
+ height: 1.2rem;
508
+ margin-right: 0.5rem;
509
+ border: 2px solid #e5e7eb;
510
+ border-radius: 4px;
511
+ cursor: pointer;
512
+ }
513
+
514
+ /* Column specific styles */
515
+ .column-1 {
516
+ padding-right: 0.5rem;
517
+ }
518
+
519
+ .column-2 {
520
+ padding-left: 0.5rem;
521
+ }
522
+
523
+ /* Checkbox container */
524
+ .checkbox-container {
525
+ display: flex;
526
+ align-items: center;
527
+ margin-bottom: 1rem;
528
+ }
529
+
530
+ /* Memory visualization styles */
531
+ .memory-block {
532
+ background: #fff;
533
+ border-radius: 8px;
534
+ padding: 1rem;
535
+ margin-bottom: 1rem;
536
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05);
537
+ }
538
+
539
+ .memory-title {
540
+ font-size: 1.1rem;
541
+ font-weight: 500;
542
+ color: #374151;
543
+ margin-bottom: 0.5rem;
544
+ }
545
+
546
+ .memory-value {
547
+ font-size: 1.5rem;
548
+ font-weight: 600;
549
+ color: #3b82f6;
550
+ }
551
+
552
+ /* Responsive adjustments */
553
+ @media (max-width: 768px) {
554
+ #controls {
555
+ grid-template-columns: 1fr;
556
+ padding: 1rem;
557
+ }
558
+
559
+ .column-1, .column-2 {
560
+ padding: 0;
561
+ }
562
+ }
563
+
564
+ /* Hover states and transitions */
565
+ input:hover, select:hover {
566
+ border-color: #3b82f6;
567
+ }
568
+
569
+ input:focus, select:focus {
570
+ border-color: #2563eb;
571
+ outline: none;
572
+ box-shadow: 0 0 0 2px rgba(59, 130, 246, 0.1);
573
+ }
574
+
575
+ /* Add smooth transitions */
576
+ input, select, button {
577
+ transition: all 0.15s ease;
578
+ }
579
+
580
+ /* Preset dropdown special styling */
581
+ select[name="presets"] {
582
+ background-color: #f3f4f6;
583
+ font-weight: 500;
584
+ }
585
+
586
+ /* Memory graph enhancements */
587
+ .activation-memory {
588
+ background: #dbeafe;
589
+ padding: 1rem;
590
+ border-radius: 8px;
591
+ margin-bottom: 1rem;
592
+ }
593
+
594
+ .gradient-memory {
595
+ background: #ede9fe;
596
+ padding: 1rem;
597
+ border-radius: 8px;
598
+ }
599
+
src/index.html CHANGED
@@ -90,108 +90,141 @@
90
  <p>The book is built on the following <strong>three general foundations</strong>:</p>
91
 
92
  <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
 
93
 
94
- <div id="graph"></div>
95
- <div id="controls">
96
- <div class="cell column-1">
97
- <label for="a">Attention Heads (a):</label>
98
- <input type="range" id="a" name="a" min="1" max="128" value="8">
99
- <input type="number" id="a_input" value="8" min="1" max="128">
100
  </div>
101
- <div class="cell column-2">
102
- <label for="mixed">Mixed Precision:</label>
103
- <input type="checkbox" id="mixed" name="mixed" checked>
104
- <span></span> <!-- Empty span to maintain grid alignment -->
105
- </div>
106
- <div class="cell column-1">
107
- <label for="b">Micro Batch Size (b):</label>
108
- <input type="range" id="b" name="b" min="1" max="53248" value="32">
109
- <input type="number" id="b_input" value="32" min="1" max="53248">
110
- </div>
111
- <div class="cell column-2">
112
- <label for="seq_parallel">Sequence Parallelism:</label>
113
- <input type="checkbox" id="seq_parallel" name="seq_parallel">
114
- <span></span> <!-- Empty span to maintain grid alignment -->
115
- </div>
116
- <div class="cell column-1">
117
- <label for="h">Hidden Dimension (h):</label>
118
- <input type="range" id="h" name="h" min="1" max="16384" value="512">
119
- <input type="number" id="h_input" value="512" min="128" max="16384">
120
- </div>
121
- <div class="cell column-2">
122
- <label for="recomputation">Recomputation:</label>
123
- <select id="recomputation" name="recomputation">
124
- <option value="none">None</option>
125
- <option value="selective">Selective</option>
126
- <option value="full">Full</option>
127
- </select>
128
- <span></span> <!-- Empty span to maintain grid alignment -->
129
- </div>
130
- <div class="cell column-1">
131
- <label for="h_ff">Feedforward Dimension (h_ff):</label>
132
- <input type="range" id="h_ff" name="h_ff" min="1" max="65536" value="2048">
133
- <input type="number" id="h_ff_input" value="2048" min="512" max="65536">
134
- </div>
135
- <div class="cell column-2">
136
- <label for="zero">Zero:</label>
137
- <select id="zero" name="zero">
138
- <option value="0">0</option>
139
- <option value="1">1</option>
140
- <option value="2">2</option>
141
- <option value="3">3</option>
142
- </select>
143
- <span></span> <!-- Empty span to maintain grid alignment -->
144
- </div>
145
- <div class="cell column-1">
146
- <label for="L">Number of Layers (L):</label>
147
- <input type="range" id="L" name="L" min="1" max="126" value="12">
148
- <input type="number" id="L_input" value="12" min="1" max="126">
149
- </div>
150
- <div class="cell column-2">
151
- <label for="ff_activation">FF Activation:</label>
152
- <select id="ff_activation" name="ff_activation">
153
- <option value="relu">ReLU</option>
154
- <option value="gelu">GELU</option>
155
- <option value="swiglu">SwiGLU</option>
156
- </select>
157
- <span></span> <!-- Empty span to maintain grid alignment -->
158
- </div>
159
- <div class="cell column-1">
160
- <label for="s">Sequence Length (s):</label>
161
- <input type="range" id="s" name="s" min="1" max="128000" value="128">
162
- <input type="number" id="s_input" value="128" min="64" max="128000">
163
- </div>
164
- <div class="cell column-2">
165
- <label for="presets">Presets:</label>
166
- <select id="presets" name="presets">
167
- <option value="Llama 3 Tiny">Llama 3 Tiny</option>
168
- <option value="Llama 3 8B">Llama 3 8B</option>
169
- <option value="Llama 3 70B">Llama 3 70B</option>
170
- <option value="Llama 3 405B">Llama 3 405B</option>
171
- </select>
172
- <span></span> <!-- Empty span to maintain grid alignment -->
173
- </div>
174
- <div class="cell column-1">
175
- <label for="v">Vocabulary Size (v):</label>
176
- <input type="range" id="v" name="v" min="1000" max="100000" value="30522">
177
- <input type="number" id="v_input" value="30522" min="1000" max="100000">
178
- </div>
179
- <div class="cell column-2">
180
- <label for="tp">Tensor Parallelism (t):</label>
181
- <input type="range" id="tp" name="tp" min="1" max="16" value="8">
182
- <input type="number" id="tp_input" value="8" min="1" max="16">
183
- </div>
184
- <div class="cell column-1">
185
- <label for="k">Optimizer Parameters (k):</label>
186
- <input type="range" id="k" name="k" min="1" max="16" value="8">
187
- <input type="number" id="k_input" value="8" min="1" max="16">
188
- </div>
189
- <div class="cell column-2">
190
- <label for="dp">Data Parallelism (d):</label>
191
- <input type="range" id="dp" name="dp" min="1" max="256" value="1">
192
- <input type="number" id="dp_input" value="1" min="1" max="256">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  </div>
194
  </div>
 
195
 
196
  <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
197
  <ul>
@@ -1724,9 +1757,11 @@
1724
 
1725
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1726
 
1727
- <div class="large-image-background">
 
1728
  <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1729
  </div>
 
1730
 
1731
 
1732
  <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
@@ -1737,17 +1772,20 @@
1737
 
1738
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1739
 
1740
- <div class="large-image-background">
 
1741
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
1742
  </div>
1743
-
1744
 
1745
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1746
  <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1747
 
1748
- <div class="large-image-background">
 
1749
  <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
1750
  </div>
 
1751
  <div class="note-box">
1752
  <p class="note-box-title">📝 Note</p>
1753
  <div class="note-box-content">
@@ -1799,15 +1837,19 @@
1799
  <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1800
  <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1801
 
1802
- <div class="large-image-background">
 
1803
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
1804
  </div>
 
1805
 
1806
  <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1807
 
1808
- <div class="large-image-background">
 
1809
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
1810
  </div>
 
1811
 
1812
  <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
1813
 
@@ -1958,10 +2000,12 @@
1958
  <p>All the following benchmarks were conducted with a sequence length of 4096 and a global batch size of 1M tokens. We gathered all the top configurations for each model and cluster size and plotted them in the following heatmaps:</p>
1959
  </p>
1960
 
1961
- <div class="large-image-background">
 
1962
  <p><img alt="image.png" src="/assets/images/what_we_learnt_heatmap.svg" /></p>
1963
  </div>
1964
- <div class="figure-legend">
 
1965
  <p>Heatmap visualization showing the optimal training configurations across different model sizes and compute node counts (we have 8 GPUs per node). For each combination, the configuration details include Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Gradient Accumulation Steps (GAS), Micro Batch Size (MBS), and ZeRO optimization stage. The color intensity indicates the Model FLOPs Utilization (MFU), with brighter colors representing higher efficiency.</p>
1966
  </div>
1967
  <p>From this high-level visualization, we can draw several important insights:
@@ -2265,14 +2309,13 @@
2265
 
2266
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2267
 
2268
- <div class="large-image-background">
2269
- <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
 
 
2270
  </div>
2271
- <div class="large-image-background">
2272
- <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2273
  </div>
2274
 
2275
-
2276
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2277
 
2278
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
@@ -2297,9 +2340,11 @@
2297
 
2298
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2299
 
2300
- <div class="large-image-background">
 
2301
  <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2302
  </div>
 
2303
 
2304
  <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2305
  <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
@@ -2685,7 +2730,15 @@
2685
  </ul>
2686
 
2687
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2688
-
 
 
 
 
 
 
 
 
2689
  <h3>Acknowledgements</h3>
2690
 
2691
  <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
@@ -3395,8 +3448,10 @@
3395
 
3396
  <p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
3397
 
3398
- <div class="large-image-background">
3399
- <img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
 
 
3400
  </div>
3401
 
3402
  <p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
@@ -3410,8 +3465,10 @@
3410
 
3411
  <p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
3412
 
3413
- <div class="large-image-background">
3414
- <img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
 
 
3415
  </div>
3416
 
3417
  <p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
@@ -3437,8 +3494,10 @@
3437
 
3438
  <p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
3439
 
3440
- <div class="large-image-background">
3441
- <img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
 
 
3442
  </div>
3443
 
3444
  <p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
 
90
  <p>The book is built on the following <strong>three general foundations</strong>:</p>
91
 
92
  <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
93
+ <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as exercise for the reader.</aside>
94
 
95
+ <div class="large-image-background-transparent">
96
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
97
+ <div id="graph-all">
98
+ <div class="figure-legend">Memory usage breakdown</div>
99
+ <div id="graph"></div>
 
100
  </div>
101
+ <div id="controls">
102
+ <div class="cell column-1">
103
+ <label for="a">Attention Heads (a):</label>
104
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
105
+ <input type="range" id="a" name="a" min="1" max="128" value="8">
106
+ <input type="number" id="a_input" value="8" min="1" max="128">
107
+ </div>
108
+ </div>
109
+ <div class="cell column-2">
110
+ <label for="mixed">Mixed Precision:</label>
111
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
112
+ <input type="checkbox" id="mixed" name="mixed" checked>
113
+ <span></span> <!-- Empty span to maintain grid alignment -->
114
+ </div>
115
+ </div>
116
+ <div class="cell column-1">
117
+ <label for="b">Micro Batch Size (b):</label>
118
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
119
+ <input type="range" id="b" name="b" min="1" max="53248" value="32">
120
+ <input type="number" id="b_input" value="32" min="1" max="53248">
121
+ </div>
122
+ </div>
123
+ <div class="cell column-2">
124
+ <label for="seq_parallel">Sequence Parallelism:</label>
125
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
126
+ <input type="checkbox" id="seq_parallel" name="seq_parallel">
127
+ <span></span> <!-- Empty span to maintain grid alignment -->
128
+ </div>
129
+ </div>
130
+ <div class="cell column-1">
131
+ <label for="h">Hidden Dimension (h):</label>
132
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
133
+ <input type="range" id="h" name="h" min="1" max="16384" value="512">
134
+ <input type="number" id="h_input" value="512" min="128" max="16384">
135
+ </div>
136
+ </div>
137
+ <div class="cell column-2">
138
+ <label for="recomputation">Recomputation:</label>
139
+ <select id="recomputation" name="recomputation">
140
+ <option value="none">None</option>
141
+ <option value="selective">Selective</option>
142
+ <option value="full">Full</option>
143
+ </select>
144
+ <span></span> <!-- Empty span to maintain grid alignment -->
145
+ </div>
146
+ <div class="cell column-1">
147
+ <label for="h_ff">Feedforward Dimension (h_ff):</label>
148
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
149
+ <input type="range" id="h_ff" name="h_ff" min="1" max="65536" value="2048">
150
+ <input type="number" id="h_ff_input" value="2048" min="512" max="65536">
151
+ </div>
152
+ </div>
153
+ <div class="cell column-2">
154
+ <label for="zero">Zero:</label>
155
+ <select id="zero" name="zero">
156
+ <option value="0">0</option>
157
+ <option value="1">1</option>
158
+ <option value="2">2</option>
159
+ <option value="3">3</option>
160
+ </select>
161
+ <span></span> <!-- Empty span to maintain grid alignment -->
162
+ </div>
163
+ <div class="cell column-1">
164
+ <label for="L">Number of Layers (L):</label>
165
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
166
+ <input type="range" id="L" name="L" min="1" max="126" value="12">
167
+ <input type="number" id="L_input" value="12" min="1" max="126">
168
+ </div>
169
+ </div>
170
+ <div class="cell column-2">
171
+ <label for="ff_activation">FF Activation:</label>
172
+ <select id="ff_activation" name="ff_activation">
173
+ <option value="relu">ReLU</option>
174
+ <option value="gelu">GELU</option>
175
+ <option value="swiglu">SwiGLU</option>
176
+ </select>
177
+ <span></span> <!-- Empty span to maintain grid alignment -->
178
+ </div>
179
+ <div class="cell column-1">
180
+ <label for="s">Sequence Length (s):</label>
181
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
182
+ <input type="range" id="s" name="s" min="1" max="128000" value="128">
183
+ <input type="number" id="s_input" value="128" min="64" max="128000">
184
+ </div>
185
+ </div>
186
+ <div class="cell column-2">
187
+ <label for="v">Vocabulary Size (v):</label>
188
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
189
+ <input type="range" id="v" name="v" min="1000" max="100000" value="30522">
190
+ <input type="number" id="v_input" value="30522" min="1000" max="100000">
191
+ </div>
192
+ </div>
193
+ <div class="cell column-1">
194
+ <label for="tp">Tensor Parallelism (t):</label>
195
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
196
+ <input type="range" id="tp" name="tp" min="1" max="16" value="8">
197
+ <input type="number" id="tp_input" value="8" min="1" max="16">
198
+ </div>
199
+ </div>
200
+ <div class="cell column-2">
201
+ <label for="k">Optimizer Parameters (k):</label>
202
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
203
+ <input type="range" id="k" name="k" min="1" max="16" value="8">
204
+ <input type="number" id="k_input" value="8" min="1" max="16">
205
+ </div>
206
+ </div>
207
+ <div class="cell column-1">
208
+ <label for="dp">Data Parallelism (d):</label>
209
+ <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
210
+ <input type="range" id="dp" name="dp" min="1" max="256" value="1">
211
+ <input type="number" id="dp_input" value="1" min="1" max="256">
212
+ </div>
213
+ </div>
214
+ <div class="cell column-2">
215
+ <label for="presets">Presets:</label>
216
+ <select id="presets" name="presets">
217
+ <option value="Llama 3 Tiny">Llama 3 Tiny</option>
218
+ <option value="Llama 3 8B">Llama 3 8B</option>
219
+ <option value="Llama 3 70B">Llama 3 70B</option>
220
+ <option value="Llama 3 405B">Llama 3 405B</option>
221
+ </select>
222
+ <span></span> <!-- Empty span to maintain grid alignment -->
223
+ </div>
224
+ </div>
225
  </div>
226
  </div>
227
+ <p>(Don't worry if you have no idea what's happening in this widget. That's why we're here.)</p>
228
 
229
  <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
230
  <ul>
 
1757
 
1758
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1759
 
1760
+ <div class="large-image-background-transparent">
1761
+ <div class="boxed-image">
1762
  <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1763
  </div>
1764
+ </div>
1765
 
1766
 
1767
  <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
 
1772
 
1773
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1774
 
1775
+ <div class="large-image-background-transparent">
1776
+ <div class="boxed-image">
1777
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
1778
  </div>
1779
+ </div>
1780
 
1781
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1782
  <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1783
 
1784
+ <div class="large-image-background-transparent">
1785
+ <div class="boxed-image">
1786
  <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
1787
  </div>
1788
+ </div>
1789
  <div class="note-box">
1790
  <p class="note-box-title">📝 Note</p>
1791
  <div class="note-box-content">
 
1837
  <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1838
  <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1839
 
1840
+ <div class="large-image-background-transparent">
1841
+ <div class="boxed-image">
1842
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
1843
  </div>
1844
+ </div>
1845
 
1846
  <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1847
 
1848
+ <div class="large-image-background-transparent">
1849
+ <div class="boxed-image">
1850
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
1851
  </div>
1852
+ </div>
1853
 
1854
  <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
1855
 
 
2000
  <p>All the following benchmarks were conducted with a sequence length of 4096 and a global batch size of 1M tokens. We gathered all the top configurations for each model and cluster size and plotted them in the following heatmaps:</p>
2001
  </p>
2002
 
2003
+ <div class="large-image-background-transparent">
2004
+ <div class="boxed-image">
2005
  <p><img alt="image.png" src="/assets/images/what_we_learnt_heatmap.svg" /></p>
2006
  </div>
2007
+ </div>
2008
+ <div class="figure-legend">
2009
  <p>Heatmap visualization showing the optimal training configurations across different model sizes and compute node counts (we have 8 GPUs per node). For each combination, the configuration details include Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Gradient Accumulation Steps (GAS), Micro Batch Size (MBS), and ZeRO optimization stage. The color intensity indicates the Model FLOPs Utilization (MFU), with brighter colors representing higher efficiency.</p>
2010
  </div>
2011
  <p>From this high-level visualization, we can draw several important insights:
 
2309
 
2310
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2311
 
2312
+ <div class="large-image-background-transparent">
2313
+ <div class="boxed-image">
2314
+ <img width="1400px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2315
+ <img width="1400px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2316
  </div>
 
 
2317
  </div>
2318
 
 
2319
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2320
 
2321
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
 
2340
 
2341
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2342
 
2343
+ <div class="large-image-background-transparent">
2344
+ <div class="boxed-image">
2345
  <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2346
  </div>
2347
+ </div>
2348
 
2349
  <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2350
  <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
 
2730
  </ul>
2731
 
2732
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2733
+
2734
+ <hr>
2735
+
2736
+ <p> <strong>One last word</strong> for our first readers. We're so happy with this writing piece that we've decided to distribute a limited number of physical printed editions of it as a gift for our first readers.</p>
2737
+ <p>If you are among the first 50 people to fill in your email address below, we'll contact you later in the year to send you a real physical edition once we've formatted it as a printed copy.</p>
2738
+ <p>We expect the book to be around 100-150 pages and to cover the same content as the blog post but we may also decide to shorten or lengthen it depending on what make sense as a printed object.</p>
2739
+ <p>To get your physical copy, please fill in your email address in the following <a target="_blank" href="https://forms.gle/e1GkAShUCtgcwnne8">google form</a>.</p>
2740
+ <p>Whether you are one of our first readers or coming much later to this blog post, we've very happy to see that you enjoyed this sharing of knowledge. May the force of open-source and open-science always be with you.</p>
2741
+
2742
  <h3>Acknowledgements</h3>
2743
 
2744
  <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
 
3448
 
3449
  <p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
3450
 
3451
+ <div class="large-image-background-transparent">
3452
+ <div class="boxed-image">
3453
+ <img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
3454
+ </div>
3455
  </div>
3456
 
3457
  <p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
 
3465
 
3466
  <p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
3467
 
3468
+ <div class="large-image-background-transparent">
3469
+ <div class="boxed-image">
3470
+ <img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
3471
+ </div>
3472
  </div>
3473
 
3474
  <p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
 
3494
 
3495
  <p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
3496
 
3497
+ <div class="large-image-background-transparent">
3498
+ <div class="boxed-image">
3499
+ <img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
3500
+ </div>
3501
  </div>
3502
 
3503
  <p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
src/memory.js CHANGED
@@ -189,8 +189,8 @@ export function updateGraph() {
189
 
190
  console.log('Data for treemap:', data);
191
 
192
- const width = 700;
193
- const height = 450;
194
  const legendHeight = 50;
195
 
196
  const svg = d3.select("#graph").select("svg");
@@ -225,8 +225,8 @@ export function updateGraph() {
225
  case 'Total': return 'rgb(225, 225, 225)'; // Light Grey
226
 
227
  // Give distinct colors to the main section containers
228
- case 'Activation Memory': return 'rgb(78, 165, 183)'; // Orange
229
- case 'Parameters / Gradients / Optimizer States': return 'rgb(232, 137, 171)'; // Teal Blue
230
 
231
  // Parameters / Gradients / Optimizer States branch
232
  case 'Parameters': return 'rgb(206, 192, 250)'; // Blue
 
189
 
190
  console.log('Data for treemap:', data);
191
 
192
+ const width = 600;
193
+ const height = 600;
194
  const legendHeight = 50;
195
 
196
  const svg = d3.select("#graph").select("svg");
 
225
  case 'Total': return 'rgb(225, 225, 225)'; // Light Grey
226
 
227
  // Give distinct colors to the main section containers
228
+ case 'Activation Memory': return 'rgb(61, 198, 159)'; // Orange
229
+ case 'Parameters / Gradients / Optimizer States': return 'rgba(232, 137, 170, 0.85)'; // Teal Blue
230
 
231
  // Parameters / Gradients / Optimizer States branch
232
  case 'Parameters': return 'rgb(206, 192, 250)'; // Blue
src/style.css CHANGED
@@ -182,7 +182,7 @@ toggle-icon {
182
  }
183
 
184
  toggle-icon.collapsed {
185
- transform: rotate(-90deg);
186
  }
187
 
188
  .toc-content {
@@ -296,80 +296,6 @@ d-contents nav > ul > li > a:hover {
296
  text-decoration: none;
297
  }
298
 
299
- /* memory */
300
- #controls {
301
- display: grid;
302
- grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
303
- column-gap: 10px;
304
- margin-bottom: 20px;
305
- max-width: 100%;
306
- @supports (container-type: inline-size) {
307
- container-type: inline-size;
308
- }
309
- }
310
-
311
- #controls .cell {
312
- padding: 1px;
313
- box-sizing: border-box;
314
- }
315
-
316
- #controls .column-1 {
317
- display: flex;
318
- align-items: center;
319
- }
320
-
321
- #controls .column-2 {
322
- display: flex;
323
- align-items: center;
324
- }
325
- @supports (container-type: inline-size) {
326
- @container (max-width: 600px) {
327
- #controls .column-2 {
328
- order: 2;
329
- }
330
- }
331
- }
332
-
333
- #controls label {
334
- text-align: right;
335
- padding-right: 10px;
336
- flex: 0 0 auto;
337
- width: 150px;
338
- line-height: 1.5em;
339
- font-size: 0.8em;
340
- }
341
-
342
- #controls input[type="range"] {
343
- width: 50%;
344
- margin: 0 10px;
345
- }
346
-
347
- #controls input[type="number"] {
348
- flex-shrink: 0;
349
- width: 60px;
350
- height: 24px;
351
- border: 1px solid var(--distill-gray-light);
352
- border-radius: 0.2rem;
353
- }
354
-
355
- #controls select {
356
- width: 100%;
357
- min-height: 28px;
358
- border: 1px solid var(--distill-gray-light);
359
- border-radius: 0.2rem;
360
- }
361
-
362
- #controls .column {
363
- display: contents;
364
- }
365
-
366
- #graph svg {
367
- font-family: sans-serif;
368
- }
369
-
370
- #graph svg rect {
371
- cursor: pointer;
372
- }
373
  .note-box {
374
  background-color: #f6f8fa;
375
  border-left: 4px solid #444444;
@@ -437,6 +363,28 @@ d-code {
437
  justify-content: center; /* This will center your image */
438
  }
439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
440
  d-article li {
441
  margin-bottom: 0.0em;
442
  }
@@ -452,3 +400,200 @@ d-article ol ol {
452
  d-article hr {
453
  grid-column: text;
454
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  }
183
 
184
  toggle-icon.collapsed {
185
+ transform: rotate(90deg);
186
  }
187
 
188
  .toc-content {
 
296
  text-decoration: none;
297
  }
298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  .note-box {
300
  background-color: #f6f8fa;
301
  border-left: 4px solid #444444;
 
363
  justify-content: center; /* This will center your image */
364
  }
365
 
366
+ .large-image-background-transparent {
367
+ /* width: 100vw; */
368
+ padding-top: 10px;
369
+ padding-bottom: 10px;
370
+ /* margin-left: calc(-50vw + 50%); */
371
+ margin-left:-100px;
372
+ margin-right: -100px;
373
+ /* margin-right: calc(-50vw + 50%); */
374
+ /* background: white; */
375
+ height: fit-content; /* This will make it match the image height */
376
+ display: flex;
377
+ justify-content: center; /* This will center your image */
378
+ }
379
+
380
+ .boxed-image {
381
+ padding: 0.5rem;
382
+ background: white;
383
+ border-radius: 12px;
384
+ border: 1px solid #e5e7eb;
385
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
386
+ }
387
+
388
  d-article li {
389
  margin-bottom: 0.0em;
390
  }
 
400
  d-article hr {
401
  grid-column: text;
402
  }
403
+
404
+ /* Memory visualization */
405
+ #graph-all {
406
+ min-width: 500px;
407
+ margin-right: 10px;
408
+ margin-bottom: 2rem;
409
+ padding: 0.5rem;
410
+ background: #f9fafb;
411
+ border-radius: 12px;
412
+ border: 1px solid #e5e7eb;
413
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
414
+ }
415
+
416
+
417
+ /* Main container styles */
418
+ #controls {
419
+ max-width: 1200px;
420
+ /* margin: 2rem auto; */
421
+ margin-bottom: 2rem;
422
+ margin-left: 10px;
423
+ padding: 0.6rem;
424
+ background: #f9fafb;
425
+ border-radius: 12px;
426
+ border: 1px solid #e5e7eb;
427
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
428
+ }
429
+
430
+ /* Grid layout */
431
+ #controls {
432
+ display: grid;
433
+ grid-template-columns: 1fr 1fr;
434
+ /* gap: 2rem; */
435
+ }
436
+
437
+ /* Cell styles */
438
+ .cell {
439
+ margin-bottom: 0.2rem;
440
+ }
441
+
442
+ /* Label styles */
443
+ label {
444
+ display: block;
445
+ /* margin-bottom: 0.5rem; */
446
+ font-size: 0.8rem;
447
+ font-weight: 500;
448
+ color: #374151;
449
+ }
450
+
451
+ /* Input container for range + number combination */
452
+ .input-container {
453
+ display: flex;
454
+ gap: 1rem;
455
+ align-items: center;
456
+ }
457
+
458
+ /* Range input styling */
459
+ input[type="range"] {
460
+ flex: 1;
461
+ height: 6px;
462
+ background: #e5e7eb;
463
+ border-radius: 3px;
464
+ appearance: none;
465
+ outline: none;
466
+ }
467
+
468
+ input[type="range"]::-webkit-slider-thumb {
469
+ appearance: none;
470
+ width: 16px;
471
+ height: 16px;
472
+ background: #3b82f6;
473
+ border-radius: 50%;
474
+ cursor: pointer;
475
+ transition: background 0.15s ease;
476
+ }
477
+
478
+ input[type="range"]::-webkit-slider-thumb:hover {
479
+ background: #2563eb;
480
+ }
481
+
482
+ /* Number input styling */
483
+ input[type="number"] {
484
+ width: 80px;
485
+ padding: 0.5rem;
486
+ border: 1px solid #e5e7eb;
487
+ border-radius: 6px;
488
+ font-size: 0.9rem;
489
+ color: #374151;
490
+ }
491
+
492
+ /* Select styling */
493
+ select {
494
+ width: 100%;
495
+ padding: 0.5rem;
496
+ border: 1px solid #e5e7eb;
497
+ border-radius: 6px;
498
+ background: white;
499
+ font-size: 0.9rem;
500
+ color: #374151;
501
+ cursor: pointer;
502
+ }
503
+
504
+ /* Checkbox styling */
505
+ input[type="checkbox"] {
506
+ width: 1.2rem;
507
+ height: 1.2rem;
508
+ margin-right: 0.5rem;
509
+ border: 2px solid #e5e7eb;
510
+ border-radius: 4px;
511
+ cursor: pointer;
512
+ }
513
+
514
+ /* Column specific styles */
515
+ .column-1 {
516
+ padding-right: 0.5rem;
517
+ }
518
+
519
+ .column-2 {
520
+ padding-left: 0.5rem;
521
+ }
522
+
523
+ /* Checkbox container */
524
+ .checkbox-container {
525
+ display: flex;
526
+ align-items: center;
527
+ margin-bottom: 1rem;
528
+ }
529
+
530
+ /* Memory visualization styles */
531
+ .memory-block {
532
+ background: #fff;
533
+ border-radius: 8px;
534
+ padding: 1rem;
535
+ margin-bottom: 1rem;
536
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05);
537
+ }
538
+
539
+ .memory-title {
540
+ font-size: 1.1rem;
541
+ font-weight: 500;
542
+ color: #374151;
543
+ margin-bottom: 0.5rem;
544
+ }
545
+
546
+ .memory-value {
547
+ font-size: 1.5rem;
548
+ font-weight: 600;
549
+ color: #3b82f6;
550
+ }
551
+
552
+ /* Responsive adjustments */
553
+ @media (max-width: 768px) {
554
+ #controls {
555
+ grid-template-columns: 1fr;
556
+ padding: 1rem;
557
+ }
558
+
559
+ .column-1, .column-2 {
560
+ padding: 0;
561
+ }
562
+ }
563
+
564
+ /* Hover states and transitions */
565
+ input:hover, select:hover {
566
+ border-color: #3b82f6;
567
+ }
568
+
569
+ input:focus, select:focus {
570
+ border-color: #2563eb;
571
+ outline: none;
572
+ box-shadow: 0 0 0 2px rgba(59, 130, 246, 0.1);
573
+ }
574
+
575
+ /* Add smooth transitions */
576
+ input, select, button {
577
+ transition: all 0.15s ease;
578
+ }
579
+
580
+ /* Preset dropdown special styling */
581
+ select[name="presets"] {
582
+ background-color: #f3f4f6;
583
+ font-weight: 500;
584
+ }
585
+
586
+ /* Memory graph enhancements */
587
+ .activation-memory {
588
+ background: #dbeafe;
589
+ padding: 1rem;
590
+ border-radius: 8px;
591
+ margin-bottom: 1rem;
592
+ }
593
+
594
+ .gradient-memory {
595
+ background: #ede9fe;
596
+ padding: 1rem;
597
+ border-radius: 8px;
598
+ }
599
+