hynky HF staff commited on
Commit
d3b8b05
·
2 Parent(s): 1fa3117 e7323a7

Merge branch 'main' of hf.co:spaces/nanotron/Nanotron-Gigablogpost

Browse files
Files changed (2) hide show
  1. dist/index.html +101 -90
  2. src/index.html +101 -90
dist/index.html CHANGED
@@ -1592,21 +1592,28 @@
1592
 
1593
  <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
1594
  <ol>
1595
- <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1596
- <li>Tensor Parallelism - along the hidden dimension</li>
1597
- <li>Sequence and Context Parallelism - along the sequence dimension</li>
1598
- <li>Pipeline Parallelism - along the model layers</li>
1599
- <li>Expert Parallelism - along the model experts</li>
1600
  </ol>
1601
 
1602
- <p>At this stage, one aspect you are probably curious about is how all these parallelism strategies (and ZeRO) compare to each other and how they interact with each other? In a nutshell, which one should we use and combine?</p>
 
 
 
 
 
1603
 
1604
- <p>Let’s take a look at the similarities and interplay. We'll start by bringing Pipeline parallelism are ZeRO-3 side-by-side as they have interesting similarities and differences.</p>
 
 
1605
 
1606
- <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
1607
  <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
1608
 
1609
- <p>However, there are a few major differences between the two:</p>
1610
 
1611
  <div class="l-body">
1612
  <table>
@@ -1647,50 +1654,50 @@
1647
  </table>
1648
  </div>
1649
 
1650
- <p>As you can see, ZeRO-3 and PP sove the same challenge through quite different approaches, whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize as much as possible the communication overhead.</p>
1651
 
1652
- <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1!</p>
1653
 
1654
- <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined.</p>
1655
 
1656
  <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
1657
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1658
 
1659
 
1660
- <p>In practice TP has two important limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1661
-
1662
- <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
1663
 
 
1664
 
1665
- <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
1666
 
1667
- <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1668
 
1669
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
1670
 
1671
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1672
 
1673
 
1674
- <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1675
 
1676
- <p>It's worth noting the scope of impact for these different parallelism strategies:</p>
 
 
 
 
 
 
 
 
 
1677
 
1678
  <ul>
1679
- <li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
1680
  <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
1681
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
 
1682
  </ul>
1683
 
1684
- <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
1685
-
1686
- <div class="note-box">
1687
- <p class="note-box-title">📝 Note</p>
1688
- <div class="note-box-content">
1689
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1690
- </div>
1691
- </div>
1692
-
1693
-
1694
  <table>
1695
  <thead>
1696
  <tr>
@@ -1723,25 +1730,22 @@
1723
  </tbody>
1724
  </table>
1725
 
1726
- <p>Which leads us to this beautiful diagram to summarize all what weve seen:</p>
 
1727
 
1728
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
1729
 
1730
- <p>And to have an idea of the memory benefits of each parallelism:</p>
1731
 
1732
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
1733
 
1734
- <h2>How to Find the Best Training Configuration</h2>
1735
-
1736
- <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
1737
-
1738
- <p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
1739
 
1740
  <table>
1741
  <thead>
1742
  <tr>
1743
  <th><strong>Method</strong></th>
1744
- <th><strong>Memory savings</strong></th>
1745
  <th><strong>Parallel/sharding dimension</strong></th>
1746
  <th><strong>Disadvantage</strong></th>
1747
  </tr>
@@ -1749,37 +1753,19 @@
1749
  <tbody>
1750
  <tr>
1751
  <td>DP</td>
1752
- <td>None (replicates everything)</td>
1753
  <td>Batch</td>
1754
  <td>Limited by max batch size</td>
1755
  </tr>
1756
- <tr>
1757
- <td>ZeRO-1</td>
1758
- <td>Optimizer states</td>
1759
- <td>Batch</td>
1760
- <td>Params communication overhead</td>
1761
- </tr>
1762
- <tr>
1763
- <td>ZeRO-2</td>
1764
- <td>Optimizer states and gradients</td>
1765
- <td>Batch</td>
1766
- <td>Params communication overhead</td>
1767
- </tr>
1768
- <tr>
1769
- <td>ZeRO-3</td>
1770
- <td>Optimizer states, gradients, and model parameters</td>
1771
- <td>Batch and Model Params</td>
1772
- <td>Params communication overhead</td>
1773
- </tr>
1774
  <tr>
1775
  <td>PP</td>
1776
- <td>Model</td>
1777
  <td>Model layers</td>
1778
  <td>Idle bubble and complex schedules</td>
1779
  </tr>
1780
  <tr>
1781
  <td>TP/SP</td>
1782
- <td>Model and activations</td>
1783
  <td>Hidden dimension / Sequence length</td>
1784
  <td>Requires high bandwidth communication</td>
1785
  </tr>
@@ -1787,86 +1773,111 @@
1787
  <td>CP</td>
1788
  <td>Activations</td>
1789
  <td>Sequence length</td>
1790
- <td>Communication overhead in attention</td>
1791
  </tr>
1792
  <tr>
1793
  <td>EP</td>
1794
  <td>Experts parameters</td>
1795
  <td>Expert dimension</td>
1796
- <td>Requires MoE layers, routing overhead</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1797
  </tr>
1798
  </tbody>
1799
  </table>
1800
 
1801
- <p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
 
 
 
 
 
 
1802
 
1803
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1804
 
1805
- <p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
1806
 
1807
  <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1808
  <ul>
1809
- <li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
1810
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
1811
  <ul>
1812
- <li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
1813
- <li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
1814
- <li>Pure Data Parallelism with ZeRO-3</li>
1815
  </ul>
1816
- <li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
1817
- <li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1818
  </ul>
1819
 
1820
  <p>Special considerations:</p>
1821
  <ul>
1822
- <li>For very long sequences, add Context Parallelism (CP) across nodes</li>
1823
- <li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
1824
  </ul>
1825
 
1826
- <p>GPU-poor case 😭 - when running out of GPU resources:</p>
1827
  <ul>
1828
- <li>Enable full activation recomputation to trade compute for memory</li>
1829
- <li>Use gradient accumulation to process larger batches with limited memory
1830
  </li>
1831
  </ul>
1832
 
1833
- <p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
1834
 
1835
  <h3>Step 2: Achieving Target Global Batch Size </h3>
1836
 
1837
- <p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
1838
 
1839
- <p>To increase global batch size:</p>
1840
  <ul>
1841
- <li>Scale up Data Parallelism or gradient accumulation steps</li>
1842
- <li>For long sequences, leverage Context Parallelism</li>
1843
  </ul>
1844
 
1845
- <p>To decrease global batch size:</p>
1846
  <ul>
1847
- <li>Reduce Data Parallelism in favor of other parallelization strategies</li>
1848
- <li>For long sequences, reduce Context Parallelism</li>
1849
  </ul>
1850
 
1851
- <p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
1852
 
1853
  <h3>Step 3: Optimizing Training Throughput</h3>
1854
 
1855
  <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
1856
 
1857
  <ul>
1858
- <li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
1859
- <li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
1860
- <li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
1861
- <li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
 
1862
  </ul>
1863
 
1864
- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
1865
 
1866
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1867
 
1868
-
1869
- <p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
1870
 
1871
  <p>Time to turn the lights off and activate CUDA mode! </p>
1872
 
 
1592
 
1593
  <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
1594
  <ol>
1595
+ <li>Data Parallelism (DP) – along the batch dimension</li>
1596
+ <li>Tensor Parallelism (TP) - along the hidden dimension</li>
1597
+ <li>Sequence and Context Parallelism (SP/CP) - along the sequence dimension</li>
1598
+ <li>Pipeline Parallelism (PP) - along the model layers</li>
1599
+ <li>Expert Parallelism (EP) - along the model experts</li>
1600
  </ol>
1601
 
1602
+ <p>As well as the 3 ZeRO strategies which can be combined with Data Parallelism for memory reduction: </p>
1603
+ <ol>
1604
+ <li>ZeRO-1 – sharding optimizer states among the DP replicas</li>
1605
+ <li>ZeRO-2 – sharding optimizer states and gradients among the DP replicas</li>
1606
+ <li>ZeRO-3 – sharding optimizer states, gradients and parameters among the DP replicas</li>
1607
+ </ol>
1608
 
1609
+ <p>At this stage, one aspect you are probably curious about is how all these parallelism and ZeRO strategies compare to, and interact with, each other. In other words, which ones should we use and efficiently combine together, and which ones should we rather keep separated?</p>
1610
+
1611
+ <p>Let’s take a look at the similarities and interplay. We'll start by comparing Pipeline parallelism are ZeRO-3 side-by-side as they have some very close similarities but also important differences.</p>
1612
 
1613
+ <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both PP and ZeRO-3 are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases full layer operations are computed on each device, as opposed to TP or EP for instance in which computation are performed on sub-layer units.</p>
1614
  <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
1615
 
1616
+ <p>However, there are a few major differences between PP and ZeRO-3 approaches:</p>
1617
 
1618
  <div class="l-body">
1619
  <table>
 
1654
  </table>
1655
  </div>
1656
 
1657
+ <p>As you can see, ZeRO-3 and PP sove the same challenge but involve different approaches and the choice between both will depend whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible un-necessary communication overhead.</p>
1658
 
1659
+ <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic).</p>
1660
 
1661
+ <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1662
 
1663
  <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
1664
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1665
 
1666
 
1667
+ <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
 
 
1668
 
1669
+ <p>As a consequence, when combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can be used for parallelism groups spanning lower speed inter-node communications as their communication patterns require less bandwidth (for PP) or can be more easily overlapped with computation (for ZeRO-3). The main consideration when combining these techniques is to organize the GPU efficiently in groups for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations. For instance, the groups of GPUs communicating for TP should be kept inside nodes.</p>
1670
 
1671
+ <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us shard activations, and can be seen as complimentary to TP. The first one handles long sequences while the second enables distributed Mixture of Experts training and they can be combined together without any particular issue.</p>
1672
 
1673
+ <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1674
 
1675
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
1676
 
1677
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1678
 
1679
 
1680
+ <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1681
+ <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1682
 
1683
+ <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
1684
+
1685
+ <div class="note-box">
1686
+ <p class="note-box-title">📝 Note</p>
1687
+ <div class="note-box-content">
1688
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1689
+ </div>
1690
+ </div>
1691
+
1692
+ <p><strong>Scope and focus</strong> Let's also quickly summarize the sub-part of the model where some of these different parallelism strategies have the most impact:</p>
1693
 
1694
  <ul>
1695
+ <li>Tensor Parallelism (and Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
1696
  <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
1697
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1698
+ <li>Pipeline Parallelism and ZeRO are not especially specific to any sub-module or component with the exception that modules and layers need to be balanced in Pipaline Parallelism, the first and last layers are thus often treated differently due to the additional embedding layers.</li>
1699
  </ul>
1700
 
 
 
 
 
 
 
 
 
 
 
1701
  <table>
1702
  <thead>
1703
  <tr>
 
1730
  </tbody>
1731
  </table>
1732
 
1733
+ <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1734
+ <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1735
 
1736
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
1737
 
1738
+ <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1739
 
1740
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
1741
 
1742
+ <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
 
 
 
 
1743
 
1744
  <table>
1745
  <thead>
1746
  <tr>
1747
  <th><strong>Method</strong></th>
1748
+ <th><strong>Memory savings applies specifically on</strong></th>
1749
  <th><strong>Parallel/sharding dimension</strong></th>
1750
  <th><strong>Disadvantage</strong></th>
1751
  </tr>
 
1753
  <tbody>
1754
  <tr>
1755
  <td>DP</td>
1756
+ <td>Activations (reduce local batch size)</td>
1757
  <td>Batch</td>
1758
  <td>Limited by max batch size</td>
1759
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1760
  <tr>
1761
  <td>PP</td>
1762
+ <td>Model parameters</td>
1763
  <td>Model layers</td>
1764
  <td>Idle bubble and complex schedules</td>
1765
  </tr>
1766
  <tr>
1767
  <td>TP/SP</td>
1768
+ <td>Model parameters and activations</td>
1769
  <td>Hidden dimension / Sequence length</td>
1770
  <td>Requires high bandwidth communication</td>
1771
  </tr>
 
1773
  <td>CP</td>
1774
  <td>Activations</td>
1775
  <td>Sequence length</td>
1776
+ <td>Add communication overhead in attention modules</td>
1777
  </tr>
1778
  <tr>
1779
  <td>EP</td>
1780
  <td>Experts parameters</td>
1781
  <td>Expert dimension</td>
1782
+ <td>Requires MoE layers, add routing communication overhead</td>
1783
+ </tr>
1784
+ <tr>
1785
+ <td>ZeRO-1</td>
1786
+ <td>Optimizer states</td>
1787
+ <td>Sharded among DP replicas</td>
1788
+ <td>Params communication overhead</td>
1789
+ </tr>
1790
+ <tr>
1791
+ <td>ZeRO-2</td>
1792
+ <td>Optimizer states and gradients</td>
1793
+ <td>Sharded among DP replicas</td>
1794
+ <td>Params communication overhead</td>
1795
+ </tr>
1796
+ <tr>
1797
+ <td>ZeRO-3</td>
1798
+ <td>Optimizer states, gradients, and model parameters</td>
1799
+ <td>Sharded among DP replicas</td>
1800
+ <td>Params communication overhead</td>
1801
  </tr>
1802
  </tbody>
1803
  </table>
1804
 
1805
+ <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
1806
+
1807
+ <h2>How to Find the Best Training Configuration</h2>
1808
+
1809
+ <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1810
+
1811
+ <p>We touched this a little bit in the previous section but let's now walk in details through a possible decision process, step by step, keeping in mind that you'll always have to run a few experiments to find the definitive optimal setup for your compute cluster given its various physical properties, network bandwidth, GPUs per node, memory per GPU, etc.</p>
1812
 
1813
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1814
 
1815
+ <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
1816
 
1817
  <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1818
  <ul>
1819
+ <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1820
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
1821
  <ul>
1822
+ <li>Combining Tensor Parallelism (TP=8) with Pipeline Parallelism</li>
1823
+ <li>Combining Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
1824
+ <li>Using only ZeRO-3 (i.e. only pure Data Parallelism) </li>
1825
  </ul>
1826
+ <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1827
+ <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1828
  </ul>
1829
 
1830
  <p>Special considerations:</p>
1831
  <ul>
1832
+ <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1833
+ <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1834
  </ul>
1835
 
1836
+ <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
1837
  <ul>
1838
+ <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1839
+ <li>You can increase gradient accumulation to process larger batches with limited memory.
1840
  </li>
1841
  </ul>
1842
 
1843
+ <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1844
 
1845
  <h3>Step 2: Achieving Target Global Batch Size </h3>
1846
 
1847
+ <p>Depending on where step 1 left us in terms of micro batch size and DP, our current batch size might be too small or too big. It's now time to hit our target batch size.</p>
1848
 
1849
+ <p>To increase our current global batch size:</p>
1850
  <ul>
1851
+ <li>We can scale up Data Parallelism or gradient accumulation steps</li>
1852
+ <li>For long sequences, we can leverage Context Parallelism</li>
1853
  </ul>
1854
 
1855
+ <p>To decrease our current global batch size:</p>
1856
  <ul>
1857
+ <li>We can reduce Data Parallelism in favor of other parallelization strategies</li>
1858
+ <li>For long sequences, we can reduce Context Parallelism</li>
1859
  </ul>
1860
 
1861
+ <p>Ok, now we have the model running in the general configuration we want in terms of model size and batch size, but are we training it the fastest way? Let's now start to optimize throughput as much as possible.</p>
1862
 
1863
  <h3>Step 3: Optimizing Training Throughput</h3>
1864
 
1865
  <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
1866
 
1867
  <ul>
1868
+ <li>Scale up Tensor Parallelism (using the fast intra-node bandwidth) until we reach a degree close to the node size, so that we can reduce other parallelism</li>
1869
+ <li>Increase Data Parallelism with ZeRO-3 while keeping target batch size</li>
1870
+ <li>When Data Parallelism communication starts to become a bottleneck, transition to using Pipeline Parallelism</li>
1871
+ <li>Try scaling up different parallelisms one by one</li>
1872
+ <li>Experiment with several micro batch size (mbs) to aim for an optimal balance between max GBS, model size, compute, and communication.</li>
1873
  </ul>
1874
 
1875
+ <!-- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
1876
 
1877
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1878
+ -->
1879
 
1880
+ <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
 
1881
 
1882
  <p>Time to turn the lights off and activate CUDA mode! </p>
1883
 
src/index.html CHANGED
@@ -1592,21 +1592,28 @@
1592
 
1593
  <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
1594
  <ol>
1595
- <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1596
- <li>Tensor Parallelism - along the hidden dimension</li>
1597
- <li>Sequence and Context Parallelism - along the sequence dimension</li>
1598
- <li>Pipeline Parallelism - along the model layers</li>
1599
- <li>Expert Parallelism - along the model experts</li>
1600
  </ol>
1601
 
1602
- <p>At this stage, one aspect you are probably curious about is how all these parallelism strategies (and ZeRO) compare to each other and how they interact with each other? In a nutshell, which one should we use and combine?</p>
 
 
 
 
 
1603
 
1604
- <p>Let’s take a look at the similarities and interplay. We'll start by bringing Pipeline parallelism are ZeRO-3 side-by-side as they have interesting similarities and differences.</p>
 
 
1605
 
1606
- <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
1607
  <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
1608
 
1609
- <p>However, there are a few major differences between the two:</p>
1610
 
1611
  <div class="l-body">
1612
  <table>
@@ -1647,50 +1654,50 @@
1647
  </table>
1648
  </div>
1649
 
1650
- <p>As you can see, ZeRO-3 and PP sove the same challenge through quite different approaches, whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize as much as possible the communication overhead.</p>
1651
 
1652
- <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1!</p>
1653
 
1654
- <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined.</p>
1655
 
1656
  <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
1657
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1658
 
1659
 
1660
- <p>In practice TP has two important limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1661
-
1662
- <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
1663
 
 
1664
 
1665
- <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
1666
 
1667
- <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1668
 
1669
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
1670
 
1671
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1672
 
1673
 
1674
- <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1675
 
1676
- <p>It's worth noting the scope of impact for these different parallelism strategies:</p>
 
 
 
 
 
 
 
 
 
1677
 
1678
  <ul>
1679
- <li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
1680
  <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
1681
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
 
1682
  </ul>
1683
 
1684
- <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
1685
-
1686
- <div class="note-box">
1687
- <p class="note-box-title">📝 Note</p>
1688
- <div class="note-box-content">
1689
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1690
- </div>
1691
- </div>
1692
-
1693
-
1694
  <table>
1695
  <thead>
1696
  <tr>
@@ -1723,25 +1730,22 @@
1723
  </tbody>
1724
  </table>
1725
 
1726
- <p>Which leads us to this beautiful diagram to summarize all what weve seen:</p>
 
1727
 
1728
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
1729
 
1730
- <p>And to have an idea of the memory benefits of each parallelism:</p>
1731
 
1732
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
1733
 
1734
- <h2>How to Find the Best Training Configuration</h2>
1735
-
1736
- <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
1737
-
1738
- <p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
1739
 
1740
  <table>
1741
  <thead>
1742
  <tr>
1743
  <th><strong>Method</strong></th>
1744
- <th><strong>Memory savings</strong></th>
1745
  <th><strong>Parallel/sharding dimension</strong></th>
1746
  <th><strong>Disadvantage</strong></th>
1747
  </tr>
@@ -1749,37 +1753,19 @@
1749
  <tbody>
1750
  <tr>
1751
  <td>DP</td>
1752
- <td>None (replicates everything)</td>
1753
  <td>Batch</td>
1754
  <td>Limited by max batch size</td>
1755
  </tr>
1756
- <tr>
1757
- <td>ZeRO-1</td>
1758
- <td>Optimizer states</td>
1759
- <td>Batch</td>
1760
- <td>Params communication overhead</td>
1761
- </tr>
1762
- <tr>
1763
- <td>ZeRO-2</td>
1764
- <td>Optimizer states and gradients</td>
1765
- <td>Batch</td>
1766
- <td>Params communication overhead</td>
1767
- </tr>
1768
- <tr>
1769
- <td>ZeRO-3</td>
1770
- <td>Optimizer states, gradients, and model parameters</td>
1771
- <td>Batch and Model Params</td>
1772
- <td>Params communication overhead</td>
1773
- </tr>
1774
  <tr>
1775
  <td>PP</td>
1776
- <td>Model</td>
1777
  <td>Model layers</td>
1778
  <td>Idle bubble and complex schedules</td>
1779
  </tr>
1780
  <tr>
1781
  <td>TP/SP</td>
1782
- <td>Model and activations</td>
1783
  <td>Hidden dimension / Sequence length</td>
1784
  <td>Requires high bandwidth communication</td>
1785
  </tr>
@@ -1787,86 +1773,111 @@
1787
  <td>CP</td>
1788
  <td>Activations</td>
1789
  <td>Sequence length</td>
1790
- <td>Communication overhead in attention</td>
1791
  </tr>
1792
  <tr>
1793
  <td>EP</td>
1794
  <td>Experts parameters</td>
1795
  <td>Expert dimension</td>
1796
- <td>Requires MoE layers, routing overhead</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1797
  </tr>
1798
  </tbody>
1799
  </table>
1800
 
1801
- <p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
 
 
 
 
 
 
1802
 
1803
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1804
 
1805
- <p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
1806
 
1807
  <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1808
  <ul>
1809
- <li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
1810
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
1811
  <ul>
1812
- <li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
1813
- <li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
1814
- <li>Pure Data Parallelism with ZeRO-3</li>
1815
  </ul>
1816
- <li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
1817
- <li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1818
  </ul>
1819
 
1820
  <p>Special considerations:</p>
1821
  <ul>
1822
- <li>For very long sequences, add Context Parallelism (CP) across nodes</li>
1823
- <li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
1824
  </ul>
1825
 
1826
- <p>GPU-poor case 😭 - when running out of GPU resources:</p>
1827
  <ul>
1828
- <li>Enable full activation recomputation to trade compute for memory</li>
1829
- <li>Use gradient accumulation to process larger batches with limited memory
1830
  </li>
1831
  </ul>
1832
 
1833
- <p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
1834
 
1835
  <h3>Step 2: Achieving Target Global Batch Size </h3>
1836
 
1837
- <p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
1838
 
1839
- <p>To increase global batch size:</p>
1840
  <ul>
1841
- <li>Scale up Data Parallelism or gradient accumulation steps</li>
1842
- <li>For long sequences, leverage Context Parallelism</li>
1843
  </ul>
1844
 
1845
- <p>To decrease global batch size:</p>
1846
  <ul>
1847
- <li>Reduce Data Parallelism in favor of other parallelization strategies</li>
1848
- <li>For long sequences, reduce Context Parallelism</li>
1849
  </ul>
1850
 
1851
- <p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
1852
 
1853
  <h3>Step 3: Optimizing Training Throughput</h3>
1854
 
1855
  <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
1856
 
1857
  <ul>
1858
- <li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
1859
- <li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
1860
- <li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
1861
- <li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
 
1862
  </ul>
1863
 
1864
- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
1865
 
1866
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1867
 
1868
-
1869
- <p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
1870
 
1871
  <p>Time to turn the lights off and activate CUDA mode! </p>
1872
 
 
1592
 
1593
  <p>Congratulation reader, you have now seen all 5 parallelism strategies you can use to scale model training: </p>
1594
  <ol>
1595
+ <li>Data Parallelism (DP) – along the batch dimension</li>
1596
+ <li>Tensor Parallelism (TP) - along the hidden dimension</li>
1597
+ <li>Sequence and Context Parallelism (SP/CP) - along the sequence dimension</li>
1598
+ <li>Pipeline Parallelism (PP) - along the model layers</li>
1599
+ <li>Expert Parallelism (EP) - along the model experts</li>
1600
  </ol>
1601
 
1602
+ <p>As well as the 3 ZeRO strategies which can be combined with Data Parallelism for memory reduction: </p>
1603
+ <ol>
1604
+ <li>ZeRO-1 – sharding optimizer states among the DP replicas</li>
1605
+ <li>ZeRO-2 – sharding optimizer states and gradients among the DP replicas</li>
1606
+ <li>ZeRO-3 – sharding optimizer states, gradients and parameters among the DP replicas</li>
1607
+ </ol>
1608
 
1609
+ <p>At this stage, one aspect you are probably curious about is how all these parallelism and ZeRO strategies compare to, and interact with, each other. In other words, which ones should we use and efficiently combine together, and which ones should we rather keep separated?</p>
1610
+
1611
+ <p>Let’s take a look at the similarities and interplay. We'll start by comparing Pipeline parallelism are ZeRO-3 side-by-side as they have some very close similarities but also important differences.</p>
1612
 
1613
+ <p><strong>Pipeline parallelism vs. ZeRO-3 -</strong> Both PP and ZeRO-3 are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). This means in both cases full layer operations are computed on each device, as opposed to TP or EP for instance in which computation are performed on sub-layer units.</p>
1614
  <aside>In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model).</aside>
1615
 
1616
+ <p>However, there are a few major differences between PP and ZeRO-3 approaches:</p>
1617
 
1618
  <div class="l-body">
1619
  <table>
 
1654
  </table>
1655
  </div>
1656
 
1657
+ <p>As you can see, ZeRO-3 and PP sove the same challenge but involve different approaches and the choice between both will depend whether you decide to focus communication either on weights or on activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible un-necessary communication overhead.</p>
1658
 
1659
+ <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with Pipeline Parallelism and are complementary to it. Combining them don't raise any particular new challenge. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic).</p>
1660
 
1661
+ <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1662
 
1663
  <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1000px; max-width: none;" />
1664
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1665
 
1666
 
1667
+ <p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
 
 
1668
 
1669
+ <p>As a consequence, when combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can be used for parallelism groups spanning lower speed inter-node communications as their communication patterns require less bandwidth (for PP) or can be more easily overlapped with computation (for ZeRO-3). The main consideration when combining these techniques is to organize the GPU efficiently in groups for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations. For instance, the groups of GPUs communicating for TP should be kept inside nodes.</p>
1670
 
1671
+ <p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us shard activations, and can be seen as complimentary to TP. The first one handles long sequences while the second enables distributed Mixture of Experts training and they can be combined together without any particular issue.</p>
1672
 
1673
+ <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
1674
 
1675
  <img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1000px; max-width: none;" />
1676
 
1677
  <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1678
 
1679
 
1680
+ <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
1681
+ <aside>For instance DeepSeek V3 uses 256 experts.</aside>
1682
 
1683
+ <img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1000px; max-width: none;" />
1684
+
1685
+ <div class="note-box">
1686
+ <p class="note-box-title">📝 Note</p>
1687
+ <div class="note-box-content">
1688
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1689
+ </div>
1690
+ </div>
1691
+
1692
+ <p><strong>Scope and focus</strong> Let's also quickly summarize the sub-part of the model where some of these different parallelism strategies have the most impact:</p>
1693
 
1694
  <ul>
1695
+ <li>Tensor Parallelism (and Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
1696
  <li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
1697
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1698
+ <li>Pipeline Parallelism and ZeRO are not especially specific to any sub-module or component with the exception that modules and layers need to be balanced in Pipaline Parallelism, the first and last layers are thus often treated differently due to the additional embedding layers.</li>
1699
  </ul>
1700
 
 
 
 
 
 
 
 
 
 
 
1701
  <table>
1702
  <thead>
1703
  <tr>
 
1730
  </tbody>
1731
  </table>
1732
 
1733
+ <p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
1734
+ <p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
1735
 
1736
  <p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1000px; max-width: none;"/></p>
1737
 
1738
+ <p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
1739
 
1740
  <img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1000px; max-width: none;"/>
1741
 
1742
+ <p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
 
 
 
 
1743
 
1744
  <table>
1745
  <thead>
1746
  <tr>
1747
  <th><strong>Method</strong></th>
1748
+ <th><strong>Memory savings applies specifically on</strong></th>
1749
  <th><strong>Parallel/sharding dimension</strong></th>
1750
  <th><strong>Disadvantage</strong></th>
1751
  </tr>
 
1753
  <tbody>
1754
  <tr>
1755
  <td>DP</td>
1756
+ <td>Activations (reduce local batch size)</td>
1757
  <td>Batch</td>
1758
  <td>Limited by max batch size</td>
1759
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1760
  <tr>
1761
  <td>PP</td>
1762
+ <td>Model parameters</td>
1763
  <td>Model layers</td>
1764
  <td>Idle bubble and complex schedules</td>
1765
  </tr>
1766
  <tr>
1767
  <td>TP/SP</td>
1768
+ <td>Model parameters and activations</td>
1769
  <td>Hidden dimension / Sequence length</td>
1770
  <td>Requires high bandwidth communication</td>
1771
  </tr>
 
1773
  <td>CP</td>
1774
  <td>Activations</td>
1775
  <td>Sequence length</td>
1776
+ <td>Add communication overhead in attention modules</td>
1777
  </tr>
1778
  <tr>
1779
  <td>EP</td>
1780
  <td>Experts parameters</td>
1781
  <td>Expert dimension</td>
1782
+ <td>Requires MoE layers, add routing communication overhead</td>
1783
+ </tr>
1784
+ <tr>
1785
+ <td>ZeRO-1</td>
1786
+ <td>Optimizer states</td>
1787
+ <td>Sharded among DP replicas</td>
1788
+ <td>Params communication overhead</td>
1789
+ </tr>
1790
+ <tr>
1791
+ <td>ZeRO-2</td>
1792
+ <td>Optimizer states and gradients</td>
1793
+ <td>Sharded among DP replicas</td>
1794
+ <td>Params communication overhead</td>
1795
+ </tr>
1796
+ <tr>
1797
+ <td>ZeRO-3</td>
1798
+ <td>Optimizer states, gradients, and model parameters</td>
1799
+ <td>Sharded among DP replicas</td>
1800
+ <td>Params communication overhead</td>
1801
  </tr>
1802
  </tbody>
1803
  </table>
1804
 
1805
+ <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
1806
+
1807
+ <h2>How to Find the Best Training Configuration</h2>
1808
+
1809
+ <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1810
+
1811
+ <p>We touched this a little bit in the previous section but let's now walk in details through a possible decision process, step by step, keeping in mind that you'll always have to run a few experiments to find the definitive optimal setup for your compute cluster given its various physical properties, network bandwidth, GPUs per node, memory per GPU, etc.</p>
1812
 
1813
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1814
 
1815
+ <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
1816
 
1817
  <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1818
  <ul>
1819
+ <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1820
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
1821
  <ul>
1822
+ <li>Combining Tensor Parallelism (TP=8) with Pipeline Parallelism</li>
1823
+ <li>Combining Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
1824
+ <li>Using only ZeRO-3 (i.e. only pure Data Parallelism) </li>
1825
  </ul>
1826
+ <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1827
+ <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1828
  </ul>
1829
 
1830
  <p>Special considerations:</p>
1831
  <ul>
1832
+ <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1833
+ <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1834
  </ul>
1835
 
1836
+ <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
1837
  <ul>
1838
+ <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1839
+ <li>You can increase gradient accumulation to process larger batches with limited memory.
1840
  </li>
1841
  </ul>
1842
 
1843
+ <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1844
 
1845
  <h3>Step 2: Achieving Target Global Batch Size </h3>
1846
 
1847
+ <p>Depending on where step 1 left us in terms of micro batch size and DP, our current batch size might be too small or too big. It's now time to hit our target batch size.</p>
1848
 
1849
+ <p>To increase our current global batch size:</p>
1850
  <ul>
1851
+ <li>We can scale up Data Parallelism or gradient accumulation steps</li>
1852
+ <li>For long sequences, we can leverage Context Parallelism</li>
1853
  </ul>
1854
 
1855
+ <p>To decrease our current global batch size:</p>
1856
  <ul>
1857
+ <li>We can reduce Data Parallelism in favor of other parallelization strategies</li>
1858
+ <li>For long sequences, we can reduce Context Parallelism</li>
1859
  </ul>
1860
 
1861
+ <p>Ok, now we have the model running in the general configuration we want in terms of model size and batch size, but are we training it the fastest way? Let's now start to optimize throughput as much as possible.</p>
1862
 
1863
  <h3>Step 3: Optimizing Training Throughput</h3>
1864
 
1865
  <p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
1866
 
1867
  <ul>
1868
+ <li>Scale up Tensor Parallelism (using the fast intra-node bandwidth) until we reach a degree close to the node size, so that we can reduce other parallelism</li>
1869
+ <li>Increase Data Parallelism with ZeRO-3 while keeping target batch size</li>
1870
+ <li>When Data Parallelism communication starts to become a bottleneck, transition to using Pipeline Parallelism</li>
1871
+ <li>Try scaling up different parallelisms one by one</li>
1872
+ <li>Experiment with several micro batch size (mbs) to aim for an optimal balance between max GBS, model size, compute, and communication.</li>
1873
  </ul>
1874
 
1875
+ <!-- <p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
1876
 
1877
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1878
+ -->
1879
 
1880
+ <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
 
1881
 
1882
  <p>Time to turn the lights off and activate CUDA mode! </p>
1883