Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

lvwerra HF staff commited on 14 days ago

Commit

0ccc803

1 Parent(s): 6b8b1ef

add sections

Browse files

Files changed (2) hide show

dist/index.html +114 -0
src/index.html +114 -0

dist/index.html CHANGED Viewed

@@ -212,6 +212,120 @@
         <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
         <h2>First Steps: Training on one GPU</h2>
     </d-article>

         <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
         <h2>First Steps: Training on one GPU</h2>
+        <h3>Memory usage in Transformers</h3>
+        <h4>Memory profiling a training step</h4>
+        <h4>Weights/grads/optimizer states memory</h4>
+        <h4>Activations memory</h4>
+        <h3>Activation recomputation</h3>
+        <h3>Gradient accumulation</h3>
+        <h2>Data Parallelism</h2>
+        <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
+        <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
+        <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
+        <h3>Revisit global batch size</h3>
+        <h3>Our journey up to now</h3>
+        <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
+        <h4>Memory usage revisited</h4>
+        <h4>ZeRO-1: Partitioning Optimizer States</h4>
+        <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
+        <h4>ZeRO-3: Adding <strong>Parameter Partitioning</strong></h4>
+        <h2>Tensor Parallelism</h2>
+        <h3>Tensor Parallelism in a Transformer Block</h3>
+        <h3>Sequence Parallelism</h3>
+        <h2>Context Parallelism</h2>
+        <h3>Introducing Context Parallelism</h3>
+        <h3>Discovering Ring Attention</h3>
+        <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
+        <h2>Pipeline Parallelism</h2>
+        <h3>Splitting layers on various nodes - All forward, all backward</h3>
+        <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
+        <h3>Interleaving stages</h3>
+        <h3>Zero Bubble and DualPipe</h3>
+        <h2>Expert parallelism</h2>
+        <h2>5D parallelism in a nutshell</h2>
+        <h2>How to Find the Best Training Configuration</h2>
+        <h2>Diving in the GPUs – fusing, threading, mixing</h2>
+        <h4>A primer on GPU</h4>
+        <h3>How to improve performance with Kernels ?</h3>
+        <h4>Memory Coalescing</h4>
+        <h4>Tiling</h4>
+        <h4>Thread Coarsening</h4>
+        <h4>Minimizing Control Divergence</h4>
+        <h3>Flash Attention 1-3</h3>
+        <h3>Fused Kernels</h3>
+        <h3>Mixed Precision Training</h3>
+        <h4>FP16 and BF16 training</h4>
+        <h4>FP8 pretraining</h4>
+        <h2>Conclusion</h2>
+        <h3>What you learned</h3>
+        <h3>What we learned</h3>
+        <h3>What’s next?</h3>
+        <h2>References</h2>
+        <h3>Landmark LLM Scaling Papers</h3>
+        <h3>Training Frameworks</h3>
+        <h3>Debugging</h3>
+        <h3>Distribution Techniques</h3>
+        <h3>CUDA Kernels</h3>
+        <h3>Hardware</h3>
+        <h3>Others</h3>
+        <h2>Appendix</h2>
     </d-article>

src/index.html CHANGED Viewed

@@ -212,6 +212,120 @@
         <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
         <h2>First Steps: Training on one GPU</h2>
     </d-article>

         <p>Now that we nailed a few key concept and terms let’s get started by revisiting the basic training steps of an LLM!</p>
         <h2>First Steps: Training on one GPU</h2>
+        <h3>Memory usage in Transformers</h3>
+        <h4>Memory profiling a training step</h4>
+        <h4>Weights/grads/optimizer states memory</h4>
+        <h4>Activations memory</h4>
+        <h3>Activation recomputation</h3>
+        <h3>Gradient accumulation</h3>
+        <h2>Data Parallelism</h2>
+        <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
+        <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
+        <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
+        <h3>Revisit global batch size</h3>
+        <h3>Our journey up to now</h3>
+        <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
+        <h4>Memory usage revisited</h4>
+        <h4>ZeRO-1: Partitioning Optimizer States</h4>
+        <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
+        <h4>ZeRO-3: Adding <strong>Parameter Partitioning</strong></h4>
+        <h2>Tensor Parallelism</h2>
+        <h3>Tensor Parallelism in a Transformer Block</h3>
+        <h3>Sequence Parallelism</h3>
+        <h2>Context Parallelism</h2>
+        <h3>Introducing Context Parallelism</h3>
+        <h3>Discovering Ring Attention</h3>
+        <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
+        <h2>Pipeline Parallelism</h2>
+        <h3>Splitting layers on various nodes - All forward, all backward</h3>
+        <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
+        <h3>Interleaving stages</h3>
+        <h3>Zero Bubble and DualPipe</h3>
+        <h2>Expert parallelism</h2>
+        <h2>5D parallelism in a nutshell</h2>
+        <h2>How to Find the Best Training Configuration</h2>
+        <h2>Diving in the GPUs – fusing, threading, mixing</h2>
+        <h4>A primer on GPU</h4>
+        <h3>How to improve performance with Kernels ?</h3>
+        <h4>Memory Coalescing</h4>
+        <h4>Tiling</h4>
+        <h4>Thread Coarsening</h4>
+        <h4>Minimizing Control Divergence</h4>
+        <h3>Flash Attention 1-3</h3>
+        <h3>Fused Kernels</h3>
+        <h3>Mixed Precision Training</h3>
+        <h4>FP16 and BF16 training</h4>
+        <h4>FP8 pretraining</h4>
+        <h2>Conclusion</h2>
+        <h3>What you learned</h3>
+        <h3>What we learned</h3>
+        <h3>What’s next?</h3>
+        <h2>References</h2>
+        <h3>Landmark LLM Scaling Papers</h3>
+        <h3>Training Frameworks</h3>
+        <h3>Debugging</h3>
+        <h3>Distribution Techniques</h3>
+        <h3>CUDA Kernels</h3>
+        <h3>Hardware</h3>
+        <h3>Others</h3>
+        <h2>Appendix</h2>
     </d-article>