lvwerra HF staff commited on
Commit
0ccc803
·
1 Parent(s): 6b8b1ef

add sections

Browse files
Files changed (2) hide show
  1. dist/index.html +114 -0
  2. src/index.html +114 -0
dist/index.html CHANGED
@@ -212,6 +212,120 @@
212
  <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
213
 
214
  <h2>First Steps: Training on one GPU</h2>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  </d-article>
217
 
 
212
  <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
213
 
214
  <h2>First Steps: Training on one GPU</h2>
215
+
216
+ <h3>Memory usage in Transformers</h3>
217
+
218
+ <h4>Memory profiling a training step</h4>
219
+
220
+ <h4>Weights/grads/optimizer states memory</h4>
221
+
222
+ <h4>Activations memory</h4>
223
+
224
+ <h3>Activation recomputation</h3>
225
+
226
+ <h3>Gradient accumulation</h3>
227
+
228
+ <h2>Data Parallelism</h2>
229
+
230
+ <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
231
+
232
+ <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
233
+
234
+ <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
235
+
236
+ <h3>Revisit global batch size</h3>
237
+
238
+ <h3>Our journey up to now</h3>
239
+
240
+ <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
241
+
242
+ <h4>Memory usage revisited</h4>
243
+
244
+ <h4>ZeRO-1: Partitioning Optimizer States</h4>
245
+
246
+ <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
247
+
248
+ <h4>ZeRO-3: Adding <strong>Parameter Partitioning</strong></h4>
249
+
250
+ <h2>Tensor Parallelism</h2>
251
+
252
+ <h3>Tensor Parallelism in a Transformer Block</h3>
253
+
254
+ <h3>Sequence Parallelism</h3>
255
+
256
+ <h2>Context Parallelism</h2>
257
+
258
+ <h3>Introducing Context Parallelism</h3>
259
+
260
+ <h3>Discovering Ring Attention</h3>
261
+
262
+ <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
263
+
264
+ <h2>Pipeline Parallelism</h2>
265
+
266
+ <h3>Splitting layers on various nodes - All forward, all backward</h3>
267
+
268
+ <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
269
+
270
+ <h3>Interleaving stages</h3>
271
+
272
+ <h3>Zero Bubble and DualPipe</h3>
273
+
274
+ <h2>Expert parallelism</h2>
275
+
276
+ <h2>5D parallelism in a nutshell</h2>
277
+
278
+ <h2>How to Find the Best Training Configuration</h2>
279
+
280
+ <h2>Diving in the GPUs – fusing, threading, mixing</h2>
281
+
282
+ <h4>A primer on GPU</h4>
283
+
284
+ <h3>How to improve performance with Kernels ?</h3>
285
+
286
+ <h4>Memory Coalescing</h4>
287
+
288
+ <h4>Tiling</h4>
289
+
290
+ <h4>Thread Coarsening</h4>
291
+
292
+ <h4>Minimizing Control Divergence</h4>
293
+
294
+ <h3>Flash Attention 1-3</h3>
295
+
296
+ <h3>Fused Kernels</h3>
297
+
298
+ <h3>Mixed Precision Training</h3>
299
+
300
+ <h4>FP16 and BF16 training</h4>
301
+
302
+ <h4>FP8 pretraining</h4>
303
+
304
+ <h2>Conclusion</h2>
305
+
306
+ <h3>What you learned</h3>
307
+
308
+ <h3>What we learned</h3>
309
+
310
+ <h3>What’s next?</h3>
311
+
312
+ <h2>References</h2>
313
+
314
+ <h3>Landmark LLM Scaling Papers</h3>
315
+
316
+ <h3>Training Frameworks</h3>
317
+
318
+ <h3>Debugging</h3>
319
+
320
+ <h3>Distribution Techniques</h3>
321
+
322
+ <h3>CUDA Kernels</h3>
323
+
324
+ <h3>Hardware</h3>
325
+
326
+ <h3>Others</h3>
327
+
328
+ <h2>Appendix</h2>
329
 
330
  </d-article>
331
 
src/index.html CHANGED
@@ -212,6 +212,120 @@
212
  <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
213
 
214
  <h2>First Steps: Training on one GPU</h2>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  </d-article>
217
 
 
212
  <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
213
 
214
  <h2>First Steps: Training on one GPU</h2>
215
+
216
+ <h3>Memory usage in Transformers</h3>
217
+
218
+ <h4>Memory profiling a training step</h4>
219
+
220
+ <h4>Weights/grads/optimizer states memory</h4>
221
+
222
+ <h4>Activations memory</h4>
223
+
224
+ <h3>Activation recomputation</h3>
225
+
226
+ <h3>Gradient accumulation</h3>
227
+
228
+ <h2>Data Parallelism</h2>
229
+
230
+ <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
231
+
232
+ <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
233
+
234
+ <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
235
+
236
+ <h3>Revisit global batch size</h3>
237
+
238
+ <h3>Our journey up to now</h3>
239
+
240
+ <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
241
+
242
+ <h4>Memory usage revisited</h4>
243
+
244
+ <h4>ZeRO-1: Partitioning Optimizer States</h4>
245
+
246
+ <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
247
+
248
+ <h4>ZeRO-3: Adding <strong>Parameter Partitioning</strong></h4>
249
+
250
+ <h2>Tensor Parallelism</h2>
251
+
252
+ <h3>Tensor Parallelism in a Transformer Block</h3>
253
+
254
+ <h3>Sequence Parallelism</h3>
255
+
256
+ <h2>Context Parallelism</h2>
257
+
258
+ <h3>Introducing Context Parallelism</h3>
259
+
260
+ <h3>Discovering Ring Attention</h3>
261
+
262
+ <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
263
+
264
+ <h2>Pipeline Parallelism</h2>
265
+
266
+ <h3>Splitting layers on various nodes - All forward, all backward</h3>
267
+
268
+ <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
269
+
270
+ <h3>Interleaving stages</h3>
271
+
272
+ <h3>Zero Bubble and DualPipe</h3>
273
+
274
+ <h2>Expert parallelism</h2>
275
+
276
+ <h2>5D parallelism in a nutshell</h2>
277
+
278
+ <h2>How to Find the Best Training Configuration</h2>
279
+
280
+ <h2>Diving in the GPUs – fusing, threading, mixing</h2>
281
+
282
+ <h4>A primer on GPU</h4>
283
+
284
+ <h3>How to improve performance with Kernels ?</h3>
285
+
286
+ <h4>Memory Coalescing</h4>
287
+
288
+ <h4>Tiling</h4>
289
+
290
+ <h4>Thread Coarsening</h4>
291
+
292
+ <h4>Minimizing Control Divergence</h4>
293
+
294
+ <h3>Flash Attention 1-3</h3>
295
+
296
+ <h3>Fused Kernels</h3>
297
+
298
+ <h3>Mixed Precision Training</h3>
299
+
300
+ <h4>FP16 and BF16 training</h4>
301
+
302
+ <h4>FP8 pretraining</h4>
303
+
304
+ <h2>Conclusion</h2>
305
+
306
+ <h3>What you learned</h3>
307
+
308
+ <h3>What we learned</h3>
309
+
310
+ <h3>What’s next?</h3>
311
+
312
+ <h2>References</h2>
313
+
314
+ <h3>Landmark LLM Scaling Papers</h3>
315
+
316
+ <h3>Training Frameworks</h3>
317
+
318
+ <h3>Debugging</h3>
319
+
320
+ <h3>Distribution Techniques</h3>
321
+
322
+ <h3>CUDA Kernels</h3>
323
+
324
+ <h3>Hardware</h3>
325
+
326
+ <h3>Others</h3>
327
+
328
+ <h2>Appendix</h2>
329
 
330
  </d-article>
331