Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

hynky HF staff commited on 5 days ago

Commit

d3b8b05

2 Parent(s): 1fa3117 e7323a7

Merge branch 'main' of hf.co:spaces/nanotron/Nanotron-Gigablogpost

Browse files

Files changed (2) hide show

dist/index.html +101 -90
src/index.html +101 -90

dist/index.html CHANGED Viewed

@@ -1592,21 +1592,28 @@
         <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
         <ol>
-            <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
-            <li>Tensor Parallelism - along the hidden dimension</li>
-            <li>Sequence and Context Parallelism - along the sequence dimension</li>
-            <li>Pipeline Parallelism - along the model layers</li>
-            <li>Expert Parallelism - along the model experts</li>
         </ol>
-        <p>At this stage, one aspect you are probably curious about is how all these parallelism strategies (and ZeRO) compare to each other and how they interact with each other? In a nutshell, which one should we use and combine?</p>
-        <p>Let’s take a look at the similarities and interplay. We'll start by bringing Pipeline parallelism are ZeRO-3 side-by-side as they have interesting similarities and differences.</p>
-        <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
         <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
-        <p>However, there are a few major differences between the two:</p>
         <div class="l-body">
         <table>
@@ -1647,50 +1654,50 @@
            </table>
           </div>
-        <p>As you can see, ZeRO-3 and PP sove the same challenge through quite different approaches, whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize as much as possible the communication overhead.</p>
-        <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1!</p>
-        <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined.</p>
           <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
           <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
-          <p>In practice TP has two important limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
-        <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
-        <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP — The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
-        <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
         <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
         <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
-        <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
-        <p>It's worth noting the scope of impact for these different parallelism strategies:</p>
         <ul>
-            <li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
             <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
             <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
         </ul>
-        <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
-        <div class="note-box">
-          <p class="note-box-title">📝 Note</p>
-          <div class="note-box-content">
-              <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
-              </div>
-      </div>
         <table>
             <thead>
               <tr>
@@ -1723,25 +1730,22 @@
             </tbody>
            </table>
-        <p>Which leads us to this beautiful diagram to summarize all what we’ve seen:</p>
         <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
-        <p>And to have an idea of the memory benefits of each parallelism:</p>
         <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
-        <h2>How to Find the Best Training Configuration</h2>
-        <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
-        <p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
         <table>
             <thead>
               <tr>
                 <th><strong>Method</strong></th>
-                <th><strong>Memory savings</strong></th>
                 <th><strong>Parallel/sharding dimension</strong></th>
                 <th><strong>Disadvantage</strong></th>
               </tr>
@@ -1749,37 +1753,19 @@
             <tbody>
               <tr>
                 <td>DP</td>
-                <td>None (replicates everything)</td>
                 <td>Batch</td>
                 <td>Limited by max batch size</td>
               </tr>
-              <tr>
-                <td>ZeRO-1</td>
-                <td>Optimizer states</td>
-                <td>Batch</td>
-                <td>Params communication overhead</td>
-              </tr>
-              <tr>
-                <td>ZeRO-2</td>
-                <td>Optimizer states and gradients</td>
-                <td>Batch</td>
-                <td>Params communication overhead</td>
-              </tr>
-              <tr>
-                <td>ZeRO-3</td>
-                <td>Optimizer states, gradients, and model parameters</td>
-                <td>Batch and Model Params</td>
-                <td>Params communication overhead</td>
-              </tr>
               <tr>
                 <td>PP</td>
-                <td>Model</td>
                 <td>Model layers</td>
                 <td>Idle bubble and complex schedules</td>
               </tr>
               <tr>
                 <td>TP/SP</td>
-                <td>Model and activations</td>
                 <td>Hidden dimension / Sequence length</td>
                 <td>Requires high bandwidth communication</td>
               </tr>
@@ -1787,86 +1773,111 @@
                 <td>CP</td>
                 <td>Activations</td>
                 <td>Sequence length</td>
-                <td>Communication overhead in attention</td>
               </tr>
               <tr>
                 <td>EP</td>
                 <td>Experts parameters</td>
                 <td>Expert dimension</td>
-                <td>Requires MoE layers, routing overhead</td>
               </tr>
             </tbody>
            </table>
-        <p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
         <h3>Step 1: Fitting a Training Step in Memory</h3>
-        <p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
         <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
         <ul>
-            <li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
             <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
             <ul>
-            <li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
-            <li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
-            <li>Pure Data Parallelism with ZeRO-3</li>
             </ul>
-            <li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
-            <li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
         </ul>
         <p>Special considerations:</p>
         <ul>
-            <li>For very long sequences, add Context Parallelism (CP) across nodes</li>
-            <li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
         </ul>
-        <p>GPU-poor case 😭 - when running out of GPU resources:</p>
         <ul>
-            <li>Enable full activation recomputation to trade compute for memory</li>
-            <li>Use gradient accumulation to process larger batches with limited memory
             </li>
         </ul>
-        <p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
         <h3>Step 2: Achieving Target Global Batch Size </h3>
-        <p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
-        <p>To increase global batch size:</p>
         <ul>
-            <li>Scale up Data Parallelism or gradient accumulation steps</li>
-            <li>For long sequences, leverage Context Parallelism</li>
         </ul>
-        <p>To decrease global batch size:</p>
         <ul>
-            <li>Reduce Data Parallelism in favor of other parallelization strategies</li>
-            <li>For long sequences, reduce Context Parallelism</li>
         </ul>
-        <p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
         <h3>Step 3: Optimizing Training Throughput</h3>
         <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
         <ul>
-            <li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
-            <li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
-            <li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
-            <li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
         </ul>
-        <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
         <p>Time to turn the lights off and activate CUDA mode! </p>

         <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
         <ol>
+            <li>Data Parallelism (DP) – along the batch dimension</li>
+            <li>Tensor Parallelism (TP) - along the hidden dimension</li>
+            <li>Sequence and Context Parallelism (SP/CP) - along the sequence dimension</li>
+            <li>Pipeline Parallelism (PP) - along the model layers</li>
+            <li>Expert Parallelism (EP) - along the model experts</li>
         </ol>
+        <p>As well as the 3 ZeRO strategies which can be combined with Data Parallelism for memory reduction: </p>
+        <ol>
+            <li>ZeRO-1 – sharding optimizer states among the DP replicas</li>
+            <li>ZeRO-2 – sharding optimizer states and gradients among the DP replicas</li>
+            <li>ZeRO-3 – sharding optimizer states, gradients and parameters among the DP replicas</li>
+        </ol>
+        <p>At this stage, one aspect you are probably curious about is how all these parallelism and ZeRO strategies compare to, and interact with, each other. In other words, which ones should we use and efficiently combine together, and which ones should we rather keep separated?</p>
+        <p>Let’s take a look at the similarities and interplay. We'll start by comparing Pipeline parallelism are ZeRO-3 side-by-side as they have some very close similarities but also important differences.</p>
+        <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both PP and ZeRO-3 are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases full layer operations are computed on each device, as opposed to TP or EP for instance in which computation are performed on sub-layer units.</p>
         <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
+        <p>However, there are a few major differences between PP and ZeRO-3 approaches:</p>
         <div class="l-body">
         <table>
            </table>
           </div>
+        <p>As you can see, ZeRO-3 and PP sove the same challenge but involve different approaches and the choice between both will depend whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible un-necessary communication overhead.</p>
+        <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic).</p>
+        <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
           <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
           <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
+          <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
+        <p>As a consequence, when combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can be used for parallelism groups spanning lower speed inter-node communications as their communication patterns require less bandwidth (for PP) or can be more easily overlapped with computation (for ZeRO-3). The main consideration when combining these techniques is to organize the GPU efficiently in groups for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations. For instance, the groups of GPUs communicating for TP should be kept inside nodes.</p>
+        <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us shard activations, and can be seen as complimentary to TP. The first one handles long sequences while the second enables distributed Mixture of Experts training and they can be combined together without any particular issue.</p>
+        <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
         <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
         <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
+        <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
+        <aside>For instance DeepSeek V3 uses 256 experts.</aside>
+        <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
+        <div class="note-box">
+            <p class="note-box-title">📝 Note</p>
+            <div class="note-box-content">
+                <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
+                </div>
+        </div>
+        <p><strong>Scope and focus</strong> Let's also quickly summarize the sub-part of the model where some of these different parallelism strategies have the most impact:</p>
         <ul>
+            <li>Tensor Parallelism (and Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
             <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
             <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
+            <li>Pipeline Parallelism and ZeRO are not especially specific to any sub-module or component with the exception that modules and layers need to be balanced in Pipaline Parallelism, the first and last layers are thus often treated differently due to the additional embedding layers.</li>
         </ul>
         <table>
             <thead>
               <tr>
             </tbody>
            </table>
+        <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
+        <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
         <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
+        <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
         <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
+        <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
         <table>
             <thead>
               <tr>
                 <th><strong>Method</strong></th>
+                <th><strong>Memory savings applies specifically on</strong></th>
                 <th><strong>Parallel/sharding dimension</strong></th>
                 <th><strong>Disadvantage</strong></th>
               </tr>
             <tbody>
               <tr>
                 <td>DP</td>
+                <td>Activations (reduce local batch size)</td>
                 <td>Batch</td>
                 <td>Limited by max batch size</td>
               </tr>
               <tr>
                 <td>PP</td>
+                <td>Model parameters</td>
                 <td>Model layers</td>
                 <td>Idle bubble and complex schedules</td>
               </tr>
               <tr>
                 <td>TP/SP</td>
+                <td>Model parameters and activations</td>
                 <td>Hidden dimension / Sequence length</td>
                 <td>Requires high bandwidth communication</td>
               </tr>
                 <td>CP</td>
                 <td>Activations</td>
                 <td>Sequence length</td>
+                <td>Add communication overhead in attention modules</td>
               </tr>
               <tr>
                 <td>EP</td>
                 <td>Experts parameters</td>
                 <td>Expert dimension</td>
+                <td>Requires MoE layers, add routing communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-1</td>
+                <td>Optimizer states</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-2</td>
+                <td>Optimizer states and gradients</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-3</td>
+                <td>Optimizer states, gradients, and model parameters</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
               </tr>
             </tbody>
            </table>
+        <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
+        <h2>How to Find the Best Training Configuration</h2>
+        <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
+        <p>We touched this a little bit in the previous section but let's now walk in details through a possible decision process, step by step, keeping in mind that you'll always have to run a few experiments to find the definitive optimal setup for your compute cluster given its various physical properties, network bandwidth, GPUs per node, memory per GPU, etc.</p>
         <h3>Step 1: Fitting a Training Step in Memory</h3>
+        <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
         <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
         <ul>
+            <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
             <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
             <ul>
+            <li>Combining Tensor Parallelism (TP=8) with Pipeline Parallelism</li>
+            <li>Combining Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
+            <li>Using only ZeRO-3 (i.e. only pure Data Parallelism) </li>
             </ul>
+            <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
+            <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
         </ul>
         <p>Special considerations:</p>
         <ul>
+            <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
+            <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
         </ul>
+        <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
         <ul>
+            <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
+            <li>You can increase gradient accumulation to process larger batches with limited memory.
             </li>
         </ul>
+        <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
         <h3>Step 2: Achieving Target Global Batch Size </h3>
+        <p>Depending on where step 1 left us in terms of micro batch size and DP, our current batch size might be too small or too big. It's now time to hit our target batch size.</p>
+        <p>To increase our current global batch size:</p>
         <ul>
+            <li>We can scale up Data Parallelism or gradient accumulation steps</li>
+            <li>For long sequences, we can leverage Context Parallelism</li>
         </ul>
+        <p>To decrease our current global batch size:</p>
         <ul>
+            <li>We can reduce Data Parallelism in favor of other parallelization strategies</li>
+            <li>For long sequences, we can reduce Context Parallelism</li>
         </ul>
+        <p>Ok, now we have the model running in the general configuration we want in terms of model size and batch size, but are we training it the fastest way? Let's now start to optimize throughput as much as possible.</p>
         <h3>Step 3: Optimizing Training Throughput</h3>
         <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
         <ul>
+            <li>Scale up Tensor Parallelism (using the fast intra-node bandwidth) until we reach a degree close to the node size, so that we can reduce other parallelism</li>
+            <li>Increase Data Parallelism with ZeRO-3 while keeping target batch size</li>
+            <li>When Data Parallelism communication starts to become a bottleneck, transition to using Pipeline Parallelism</li>
+            <li>Try scaling up different parallelisms one by one</li>
+            <li>Experiment with several micro batch size (mbs) to aim for an optimal balance between max GBS, model size, compute, and communication.</li>
         </ul>
+        <!-- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
+        <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
         <p>Time to turn the lights off and activate CUDA mode! </p>

src/index.html CHANGED Viewed

@@ -1592,21 +1592,28 @@
         <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
         <ol>
-            <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
-            <li>Tensor Parallelism - along the hidden dimension</li>
-            <li>Sequence and Context Parallelism - along the sequence dimension</li>
-            <li>Pipeline Parallelism - along the model layers</li>
-            <li>Expert Parallelism - along the model experts</li>
         </ol>
-        <p>At this stage, one aspect you are probably curious about is how all these parallelism strategies (and ZeRO) compare to each other and how they interact with each other? In a nutshell, which one should we use and combine?</p>
-        <p>Let’s take a look at the similarities and interplay. We'll start by bringing Pipeline parallelism are ZeRO-3 side-by-side as they have interesting similarities and differences.</p>
-        <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
         <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
-        <p>However, there are a few major differences between the two:</p>
         <div class="l-body">
         <table>
@@ -1647,50 +1654,50 @@
            </table>
           </div>
-        <p>As you can see, ZeRO-3 and PP sove the same challenge through quite different approaches, whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize as much as possible the communication overhead.</p>
-        <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1!</p>
-        <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined.</p>
           <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
           <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
-          <p>In practice TP has two important limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
-        <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
-        <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP — The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
-        <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
         <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
         <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
-        <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
-        <p>It's worth noting the scope of impact for these different parallelism strategies:</p>
         <ul>
-            <li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
             <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
             <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
         </ul>
-        <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
-        <div class="note-box">
-          <p class="note-box-title">📝 Note</p>
-          <div class="note-box-content">
-              <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
-              </div>
-      </div>
         <table>
             <thead>
               <tr>
@@ -1723,25 +1730,22 @@
             </tbody>
            </table>
-        <p>Which leads us to this beautiful diagram to summarize all what we’ve seen:</p>
         <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
-        <p>And to have an idea of the memory benefits of each parallelism:</p>
         <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
-        <h2>How to Find the Best Training Configuration</h2>
-        <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
-        <p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
         <table>
             <thead>
               <tr>
                 <th><strong>Method</strong></th>
-                <th><strong>Memory savings</strong></th>
                 <th><strong>Parallel/sharding dimension</strong></th>
                 <th><strong>Disadvantage</strong></th>
               </tr>
@@ -1749,37 +1753,19 @@
             <tbody>
               <tr>
                 <td>DP</td>
-                <td>None (replicates everything)</td>
                 <td>Batch</td>
                 <td>Limited by max batch size</td>
               </tr>
-              <tr>
-                <td>ZeRO-1</td>
-                <td>Optimizer states</td>
-                <td>Batch</td>
-                <td>Params communication overhead</td>
-              </tr>
-              <tr>
-                <td>ZeRO-2</td>
-                <td>Optimizer states and gradients</td>
-                <td>Batch</td>
-                <td>Params communication overhead</td>
-              </tr>
-              <tr>
-                <td>ZeRO-3</td>
-                <td>Optimizer states, gradients, and model parameters</td>
-                <td>Batch and Model Params</td>
-                <td>Params communication overhead</td>
-              </tr>
               <tr>
                 <td>PP</td>
-                <td>Model</td>
                 <td>Model layers</td>
                 <td>Idle bubble and complex schedules</td>
               </tr>
               <tr>
                 <td>TP/SP</td>
-                <td>Model and activations</td>
                 <td>Hidden dimension / Sequence length</td>
                 <td>Requires high bandwidth communication</td>
               </tr>
@@ -1787,86 +1773,111 @@
                 <td>CP</td>
                 <td>Activations</td>
                 <td>Sequence length</td>
-                <td>Communication overhead in attention</td>
               </tr>
               <tr>
                 <td>EP</td>
                 <td>Experts parameters</td>
                 <td>Expert dimension</td>
-                <td>Requires MoE layers, routing overhead</td>
               </tr>
             </tbody>
            </table>
-        <p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
         <h3>Step 1: Fitting a Training Step in Memory</h3>
-        <p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
         <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
         <ul>
-            <li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
             <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
             <ul>
-            <li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
-            <li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
-            <li>Pure Data Parallelism with ZeRO-3</li>
             </ul>
-            <li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
-            <li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
         </ul>
         <p>Special considerations:</p>
         <ul>
-            <li>For very long sequences, add Context Parallelism (CP) across nodes</li>
-            <li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
         </ul>
-        <p>GPU-poor case 😭 - when running out of GPU resources:</p>
         <ul>
-            <li>Enable full activation recomputation to trade compute for memory</li>
-            <li>Use gradient accumulation to process larger batches with limited memory
             </li>
         </ul>
-        <p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
         <h3>Step 2: Achieving Target Global Batch Size </h3>
-        <p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
-        <p>To increase global batch size:</p>
         <ul>
-            <li>Scale up Data Parallelism or gradient accumulation steps</li>
-            <li>For long sequences, leverage Context Parallelism</li>
         </ul>
-        <p>To decrease global batch size:</p>
         <ul>
-            <li>Reduce Data Parallelism in favor of other parallelization strategies</li>
-            <li>For long sequences, reduce Context Parallelism</li>
         </ul>
-        <p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
         <h3>Step 3: Optimizing Training Throughput</h3>
         <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
         <ul>
-            <li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
-            <li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
-            <li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
-            <li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
         </ul>
-        <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
         <p>Time to turn the lights off and activate CUDA mode! </p>

         <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
         <ol>
+            <li>Data Parallelism (DP) – along the batch dimension</li>
+            <li>Tensor Parallelism (TP) - along the hidden dimension</li>
+            <li>Sequence and Context Parallelism (SP/CP) - along the sequence dimension</li>
+            <li>Pipeline Parallelism (PP) - along the model layers</li>
+            <li>Expert Parallelism (EP) - along the model experts</li>
         </ol>
+        <p>As well as the 3 ZeRO strategies which can be combined with Data Parallelism for memory reduction: </p>
+        <ol>
+            <li>ZeRO-1 – sharding optimizer states among the DP replicas</li>
+            <li>ZeRO-2 – sharding optimizer states and gradients among the DP replicas</li>
+            <li>ZeRO-3 – sharding optimizer states, gradients and parameters among the DP replicas</li>
+        </ol>
+        <p>At this stage, one aspect you are probably curious about is how all these parallelism and ZeRO strategies compare to, and interact with, each other. In other words, which ones should we use and efficiently combine together, and which ones should we rather keep separated?</p>
+        <p>Let’s take a look at the similarities and interplay. We'll start by comparing Pipeline parallelism are ZeRO-3 side-by-side as they have some very close similarities but also important differences.</p>
+        <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both PP and ZeRO-3 are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases full layer operations are computed on each device, as opposed to TP or EP for instance in which computation are performed on sub-layer units.</p>
         <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
+        <p>However, there are a few major differences between PP and ZeRO-3 approaches:</p>
         <div class="l-body">
         <table>
            </table>
           </div>
+        <p>As you can see, ZeRO-3 and PP sove the same challenge but involve different approaches and the choice between both will depend whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible un-necessary communication overhead.</p>
+        <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic).</p>
+        <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
           <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
           <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
+          <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
+        <p>As a consequence, when combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can be used for parallelism groups spanning lower speed inter-node communications as their communication patterns require less bandwidth (for PP) or can be more easily overlapped with computation (for ZeRO-3). The main consideration when combining these techniques is to organize the GPU efficiently in groups for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations. For instance, the groups of GPUs communicating for TP should be kept inside nodes.</p>
+        <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us shard activations, and can be seen as complimentary to TP. The first one handles long sequences while the second enables distributed Mixture of Experts training and they can be combined together without any particular issue.</p>
+        <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
         <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
         <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
+        <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
+        <aside>For instance DeepSeek V3 uses 256 experts.</aside>
+        <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
+        <div class="note-box">
+            <p class="note-box-title">📝 Note</p>
+            <div class="note-box-content">
+                <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
+                </div>
+        </div>
+        <p><strong>Scope and focus</strong> Let's also quickly summarize the sub-part of the model where some of these different parallelism strategies have the most impact:</p>
         <ul>
+            <li>Tensor Parallelism (and Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
             <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
             <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
+            <li>Pipeline Parallelism and ZeRO are not especially specific to any sub-module or component with the exception that modules and layers need to be balanced in Pipaline Parallelism, the first and last layers are thus often treated differently due to the additional embedding layers.</li>
         </ul>
         <table>
             <thead>
               <tr>
             </tbody>
            </table>
+        <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
+        <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
         <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
+        <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
         <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
+        <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
         <table>
             <thead>
               <tr>
                 <th><strong>Method</strong></th>
+                <th><strong>Memory savings applies specifically on</strong></th>
                 <th><strong>Parallel/sharding dimension</strong></th>
                 <th><strong>Disadvantage</strong></th>
               </tr>
             <tbody>
               <tr>
                 <td>DP</td>
+                <td>Activations (reduce local batch size)</td>
                 <td>Batch</td>
                 <td>Limited by max batch size</td>
               </tr>
               <tr>
                 <td>PP</td>
+                <td>Model parameters</td>
                 <td>Model layers</td>
                 <td>Idle bubble and complex schedules</td>
               </tr>
               <tr>
                 <td>TP/SP</td>
+                <td>Model parameters and activations</td>
                 <td>Hidden dimension / Sequence length</td>
                 <td>Requires high bandwidth communication</td>
               </tr>
                 <td>CP</td>
                 <td>Activations</td>
                 <td>Sequence length</td>
+                <td>Add communication overhead in attention modules</td>
               </tr>
               <tr>
                 <td>EP</td>
                 <td>Experts parameters</td>
                 <td>Expert dimension</td>
+                <td>Requires MoE layers, add routing communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-1</td>
+                <td>Optimizer states</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-2</td>
+                <td>Optimizer states and gradients</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
+              </tr>
+              <tr>
+                <td>ZeRO-3</td>
+                <td>Optimizer states, gradients, and model parameters</td>
+                <td>Sharded among DP replicas</td>
+                <td>Params communication overhead</td>
               </tr>
             </tbody>
            </table>
+        <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
+        <h2>How to Find the Best Training Configuration</h2>
+        <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
+        <p>We touched this a little bit in the previous section but let's now walk in details through a possible decision process, step by step, keeping in mind that you'll always have to run a few experiments to find the definitive optimal setup for your compute cluster given its various physical properties, network bandwidth, GPUs per node, memory per GPU, etc.</p>
         <h3>Step 1: Fitting a Training Step in Memory</h3>
+        <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
         <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
         <ul>
+            <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
             <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
             <ul>
+            <li>Combining Tensor Parallelism (TP=8) with Pipeline Parallelism</li>
+            <li>Combining Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
+            <li>Using only ZeRO-3 (i.e. only pure Data Parallelism) </li>
             </ul>
+            <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
+            <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
         </ul>
         <p>Special considerations:</p>
         <ul>
+            <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
+            <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
         </ul>
+        <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
         <ul>
+            <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
+            <li>You can increase gradient accumulation to process larger batches with limited memory.
             </li>
         </ul>
+        <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
         <h3>Step 2: Achieving Target Global Batch Size </h3>
+        <p>Depending on where step 1 left us in terms of micro batch size and DP, our current batch size might be too small or too big. It's now time to hit our target batch size.</p>
+        <p>To increase our current global batch size:</p>
         <ul>
+            <li>We can scale up Data Parallelism or gradient accumulation steps</li>
+            <li>For long sequences, we can leverage Context Parallelism</li>
         </ul>
+        <p>To decrease our current global batch size:</p>
         <ul>
+            <li>We can reduce Data Parallelism in favor of other parallelization strategies</li>
+            <li>For long sequences, we can reduce Context Parallelism</li>
         </ul>
+        <p>Ok, now we have the model running in the general configuration we want in terms of model size and batch size, but are we training it the fastest way? Let's now start to optimize throughput as much as possible.</p>
         <h3>Step 3: Optimizing Training Throughput</h3>
         <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
         <ul>
+            <li>Scale up Tensor Parallelism (using the fast intra-node bandwidth) until we reach a degree close to the node size, so that we can reduce other parallelism</li>
+            <li>Increase Data Parallelism with ZeRO-3 while keeping target batch size</li>
+            <li>When Data Parallelism communication starts to become a bottleneck, transition to using Pipeline Parallelism</li>
+            <li>Try scaling up different parallelisms one by one</li>
+            <li>Experiment with several micro batch size (mbs) to aim for an optimal balance between max GBS, model size, compute, and communication.</li>
         </ul>
+        <!-- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
+        <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
         <p>Time to turn the lights off and activate CUDA mode! </p>