Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

thomwolf HF staff commited on 5 days ago

Commit

2cb0f7e

verified ·

1 Parent(s): b6b4552

continuing work on updates (#44)

Browse files

- update (8e45c5c44a5832fdcf46078789ff4bd2e08afb28)

Files changed (6) hide show

assets/.DS_Store +0 -0
dist/assets/.DS_Store +0 -0
dist/index.html +51 -38
dist/style.css +7 -0
src/index.html +50 -37
src/style.css +7 -0

assets/.DS_Store DELETED Viewed

Binary file (6.15 kB)

dist/assets/.DS_Store DELETED Viewed

Binary file (6.15 kB)

dist/index.html CHANGED Viewed

@@ -1352,7 +1352,7 @@
         <h2>Pipeline Parallelism</h2>
-        <p>In the TP section we saw that if we try to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) we hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we perform it across several nodes:</p>
         <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
@@ -1366,9 +1366,11 @@
         <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
         <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
-        <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
-        <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
         <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
@@ -1380,23 +1382,23 @@
         </script>
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
-        <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
-        <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>
         <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
-        <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. This is a huge difference e.g. compared to the communication in Tensor Parallelism, happening several times within each layer.</p>
-        <p>But maybe you start feeling a glimpse of the troubles to come: “sequentially” and “successively”?!? This doesn’t sound very efficient in the world of parallel computation, especially after our discussion about computation and communication overlap.</p>
-        <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model where the numbers indicate the model layers:</p>
         <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
-        <p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
         <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
         <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have   <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
@@ -1408,8 +1410,8 @@
             r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
         </d-math>
-        <p>As we add more stages the bubble time thus increases and the utilization drops.</p>
-        <p>Thankfully, various pipeline parallelism schemes have been designed to reduce the size of the bubble which as you can see on this naive example can be very large in a naive implementation.</p>
         <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
@@ -1417,7 +1419,7 @@
         <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
-        <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
         <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
@@ -1448,15 +1450,13 @@
         <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
-        <p>The bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches instead of <d-math>m</d-math> which quite reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
-        <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not cleanly consecutive anymore but performed in parallel across devices. This means we will have to schedule the switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
         <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
-        <p>Here is the example training loop from the above gist:</p>
-        <p>You can find the full implementation in picotron as well:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -1467,28 +1467,31 @@
                 </div>
         </details>
-        <p>Let's look at how the 1F1B Pipeline Parallelism schedule scales in practice:</p>
         <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
-        <p>On the left, with microbatches equal to PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance drops significantly as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps reduce this effect. However, we can't maintain this ratio of <d-math>m \gg p - 1</d-math> indefinitely since we're ultimately constrained by our target global batch size - as we add more PP degree, we're increasing the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
-        <p>Interestingly, when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>), the performance only drops by 14% - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
-        <p>While 1F1B significantly reduces our activation memory footprint, the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
         <h3>Interleaving stages</h3>
-        <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Can we also also reduce the time spent in the bubble?</p>
-        <p>Well it turns out this is possible if we are willing to bring in a few additional communications. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
-        <p>Up to now we’ve sliced our model naively along the model depth dimensions, locating for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
-        <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model.</p>
         <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
         <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
@@ -1513,32 +1516,42 @@
         <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
-        <p>Scheduling also becomes more complex here as we need to decide on a GPU whether we are prioritizing at a given moment earlier micro-batches meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or we prioritize to first complete the forward passes of all microbatches in the queue before going over to backward passes (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This is explained in detail in the "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
-        <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and bread-first.</p>
         <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
-        <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to reduce the bubble to virtually zero! Peaked your curiosity? Let’s have a look!</p>
         <h3>Zero Bubble and DualPipe</h3>
-        <p>There are even more sophisticated ways to reduce the bubble more and reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe reach close to a zero bubble regime.</p>
-        <p>Let’s very quickly see how this can work by detailing briefly the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that a backward through a matrix multiplication involve actually two separated operations: backward for the inputs (B) and the backward for the weights (W):</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
-        <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
-        <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only need to be performed before the optimiser step. This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
-        <p>DeepSeek’s DualPipe introduced with V3 proposes an extension of this decomposed approach to the case of two stream propagating from both sides of the PP ranks and being interleaved to minimize even further idle time in the GPUs are displayed in the following scheduling graph:</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
-        <p>The ZeroBubble and DualPipe schedules are a bit too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. In practice, optimizing these schedules requires careful measurements of the time for each operations followed by a scheduling algorithm able to find the most optimal allocation of time given the constrains. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling.</p>
-        <p>This concludes our tour into the world of pipeline schedules and bubbles. Let's turn to the last parallelism method we can use to train large models efficiently: Expert parallelism.</p>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
@@ -2283,7 +2296,7 @@
         <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
-        <iframe class="l-body-outset" id="plotFP8Loss" src="assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8.  </p>

         <h2>Pipeline Parallelism</h2>
+        <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
         <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
         <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
         <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
+        <p>Sequence and context parallelism can help for long sequences but don’t help much if the sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
+        <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's see the effect of Pipeline Parallelism in action on the memory usage for a 8B model:</p>
+        <aside>This technique may remind you of our discussion on <a target="_self" href="#zero-redundancy-optimizer">ZeRO-3</a> where we split the model parameters across GPUs. We compare both techniques in details later in the <a target="_self" href="#5d_parallelism_in_a_nutshell">5D parallelism in a nutshell</a> section.</aside>
         <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
         </script>
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
+        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
+        <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>
         <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
+        <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. It can make a huge difference versus e.g. communications in Tensor Parallelism, which happens several times within each layer.</p>
+        <p>But maybe you start feeling a glimpse of the troubles to come: <strong>“sequentially”</strong> and <strong>“successively”</strong>?!? This doesn’t sound very efficient in the world of parallel computations, especially after our discussion on computation and communication overlap.</p>
+        <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model (here the numbers indicate the model layers):</p>
         <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
+        <div  class="figure-legend"><p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
+        </div>
         <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
         <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have   <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
             r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
         </d-math>
+        <p>As we add more stages the bubble time thus increases and the utilization drops. As we can see, the bubble can be very large in a naive implementation!</p>
+        <p>Thankfully, various pipeline parallelism schemes have been designed to <strong>reduce the size of the bubble</strong>.</p>
         <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
         <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
+        <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so we're preserving the general organization of our model training code. It makes this PP implementation one of the simplest to implement.</p>
         <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
         <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
+        <p>If you count carefully you'll see that the bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches (where <d-math>p</d-math> is the degree of pipeline parallelism) instead of <d-math>m</d-math> (where <d-math>m</d-math> was the number of microbatches) which can reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
+        <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not anymore cleanly sequential but performed in parallel across devices and interleaved. This means we will have to schedule a switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
         <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
+        <p>You can find a full implementation of 1F1B in picotron as well:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
                 </div>
         </details>
+        <p>Let's take a look at how the 1F1B Pipeline Parallelism schedule scales in practice with some benchmarks on our cluster:</p>
         <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
+        <p>On the left, with a number of microbatches equal to –or less than– PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance are low and even drops as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps improve low-PP-degree performances while still staying limited at very large PP degree. In practice it's not possible to arbitrarly increase the number of microbatches to maintain the ratio of <d-math>m \gg p - 1</d-math> since we're ultimately constrained by the target global batch size. With a maximal possible number of microbatches as we add more PP degree, we'll ultimately have to increase the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
+        <p>Interestingly, at small number of micro-batches the performance only drops by 14% when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>) - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This type of behavior when hitting the lower-bandwith inter-node network makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
+        <p>While 1F1B significantly reduces our activation memory footprint, we see on this last graph that the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
         <h3>Interleaving stages</h3>
+        <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Any way we could still push this frontier?</p>
+        <p>Well it turns out this is possible if we are willing to bring in a few additional communication operations. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
+        <p>Up to now we’ve sliced our model naively along the model depth dimensions, hosting for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
+        <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model. Let's take a graphical look at how this works:</p>
         <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
+        <div class="figure-legend"><p>An example of interleaved pipeline parallelism for a model with layers distributed across 4 GPUs. Numbers still correspond to the microbatches IDs but for clarity we've colored differently the first and the last layers of the model to illustrate how layers are spread accross GPUs.</p>
+          </div>
         <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
         <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
+        <p>Scheduling also becomes more complex here as we have to decide on a given GPU and at a given moment whether we are prioritizing earlier micro-batches going through later layers –meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible)– or if we prioritize to first have later micro-batches going through earlier layers (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This choice is explained in detail in the nice "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
+        <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and breadth-first.</p>
         <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
+        <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to <strong>reduce the bubble to virtually zero</strong>! These techniques were for instance used in the DeepSeek V3/R1 implementation<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. Peaked your curiosity? Let’s have a final quick look at these magical schedules before we leave the world of Pipeline Parallelism!</p>
         <h3>Zero Bubble and DualPipe</h3>
+        <p>Even more sophisticated ways to reduce the bubble have recently been proposed which reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe, reaches close to a zero bubble regime.</p>
+        <aside>Ultimate "flex" in DeepSeek V3 technical report<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> where the authors indicate that their setup "achiev[ed] a near-zero all-to-all communication overhead".</aside>
+        <p>Let’s briefly see how this can work by summarizing the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that the backward pass through a matrix multiplication actually involves two separated operations: backward operation for the inputs (B) and the backward operation for the weights (W):</p>
+        <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only needs to be performed before the optimiser step. We can see that in the following diagram: </p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
+        <p>This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
+        <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
+        <div class="figure-legend"><p>On the top (Figure 2 from the ZeroBubble paper): the classical 1F1B schedule, interleaving forward and backward pass but keeping a coarse-grained backward pass. On the bottom two graphs (Figure 3 from the ZeroBubble paper), two variantes of the ZeroBubble schedule, splitting the backward operation in a "B" and a "W" finer-grained operations. The last schedule, so-called "ZB-H2" is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
+        </div>
+        <p>DeepSeek’s DualPipe introduced with its V3 technical report <d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> an extension of this decomposed approach to the additional case of two streams propagating from both ends of the PP dimension, these streams being interleaved to minimize even further idle time in the GPUs. This schedule is displayed in the following scheduling graph and is even more complex than the previous ones:</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
+        <p>In general, fully optimizing such complex schedules involve carfully measuring the duration of the various fine-grained operations and solving a ILP to minimize the final bubble time. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling. As a result, the ZeroBubble and DualPipe schedules are too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. </p>
+        <p>This concludes our tour into the world of pipeline schedules and bubbles. We hope you enjoyed this guided tour!</p>
+        <p>It's now time to turn to the last parallelism method we'll detail and which we can use to train large models efficiently: <strong>Expert parallelism</strong>.</p>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
         <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
+        <iframe class="l-body-outset" id="plotFP8Loss" src="/assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8.  </p>

dist/style.css CHANGED Viewed

@@ -414,6 +414,13 @@ d-article {
     font-size: 1.0em;
 }
 d-code {
     font-size: 12px;
 }

     font-size: 1.0em;
 }
+.figure-legend {
+    font-size: 0.9em;
+    font-style: italic;
+    color: var(--distill-gray);
+    line-height: 1.5em;
+}
 d-code {
     font-size: 12px;
 }

src/index.html CHANGED Viewed

@@ -1352,7 +1352,7 @@
         <h2>Pipeline Parallelism</h2>
-        <p>In the TP section we saw that if we try to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) we hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we perform it across several nodes:</p>
         <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
@@ -1366,9 +1366,11 @@
         <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
         <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
-        <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
-        <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
         <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
@@ -1380,23 +1382,23 @@
         </script>
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
-        <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
-        <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>
         <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
-        <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. This is a huge difference e.g. compared to the communication in Tensor Parallelism, happening several times within each layer.</p>
-        <p>But maybe you start feeling a glimpse of the troubles to come: “sequentially” and “successively”?!? This doesn’t sound very efficient in the world of parallel computation, especially after our discussion about computation and communication overlap.</p>
-        <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model where the numbers indicate the model layers:</p>
         <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
-        <p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
         <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
         <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have   <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
@@ -1408,8 +1410,8 @@
             r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
         </d-math>
-        <p>As we add more stages the bubble time thus increases and the utilization drops.</p>
-        <p>Thankfully, various pipeline parallelism schemes have been designed to reduce the size of the bubble which as you can see on this naive example can be very large in a naive implementation.</p>
         <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
@@ -1417,7 +1419,7 @@
         <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
-        <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
         <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
@@ -1448,15 +1450,13 @@
         <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
-        <p>The bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches instead of <d-math>m</d-math> which quite reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
-        <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not cleanly consecutive anymore but performed in parallel across devices. This means we will have to schedule the switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
         <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
-        <p>Here is the example training loop from the above gist:</p>
-        <p>You can find the full implementation in picotron as well:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -1467,28 +1467,31 @@
                 </div>
         </details>
-        <p>Let's look at how the 1F1B Pipeline Parallelism schedule scales in practice:</p>
         <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
-        <p>On the left, with microbatches equal to PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance drops significantly as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps reduce this effect. However, we can't maintain this ratio of <d-math>m \gg p - 1</d-math> indefinitely since we're ultimately constrained by our target global batch size - as we add more PP degree, we're increasing the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
-        <p>Interestingly, when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>), the performance only drops by 14% - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
-        <p>While 1F1B significantly reduces our activation memory footprint, the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
         <h3>Interleaving stages</h3>
-        <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Can we also also reduce the time spent in the bubble?</p>
-        <p>Well it turns out this is possible if we are willing to bring in a few additional communications. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
-        <p>Up to now we’ve sliced our model naively along the model depth dimensions, locating for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
-        <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model.</p>
         <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
         <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
@@ -1513,32 +1516,42 @@
         <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
-        <p>Scheduling also becomes more complex here as we need to decide on a GPU whether we are prioritizing at a given moment earlier micro-batches meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or we prioritize to first complete the forward passes of all microbatches in the queue before going over to backward passes (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This is explained in detail in the "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
-        <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and bread-first.</p>
         <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
-        <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to reduce the bubble to virtually zero! Peaked your curiosity? Let’s have a look!</p>
         <h3>Zero Bubble and DualPipe</h3>
-        <p>There are even more sophisticated ways to reduce the bubble more and reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe reach close to a zero bubble regime.</p>
-        <p>Let’s very quickly see how this can work by detailing briefly the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that a backward through a matrix multiplication involve actually two separated operations: backward for the inputs (B) and the backward for the weights (W):</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
-        <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
-        <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only need to be performed before the optimiser step. This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
-        <p>DeepSeek’s DualPipe introduced with V3 proposes an extension of this decomposed approach to the case of two stream propagating from both sides of the PP ranks and being interleaved to minimize even further idle time in the GPUs are displayed in the following scheduling graph:</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
-        <p>The ZeroBubble and DualPipe schedules are a bit too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. In practice, optimizing these schedules requires careful measurements of the time for each operations followed by a scheduling algorithm able to find the most optimal allocation of time given the constrains. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling.</p>
-        <p>This concludes our tour into the world of pipeline schedules and bubbles. Let's turn to the last parallelism method we can use to train large models efficiently: Expert parallelism.</p>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>

         <h2>Pipeline Parallelism</h2>
+        <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
         <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
         <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
         <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
+        <p>Sequence and context parallelism can help for long sequences but don’t help much if the sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
+        <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's see the effect of Pipeline Parallelism in action on the memory usage for a 8B model:</p>
+        <aside>This technique may remind you of our discussion on <a target="_self" href="#zero-redundancy-optimizer">ZeRO-3</a> where we split the model parameters across GPUs. We compare both techniques in details later in the <a target="_self" href="#5d_parallelism_in_a_nutshell">5D parallelism in a nutshell</a> section.</aside>
         <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
         <script>
         </script>
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
+        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
+        <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>
         <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
+        <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. It can make a huge difference versus e.g. communications in Tensor Parallelism, which happens several times within each layer.</p>
+        <p>But maybe you start feeling a glimpse of the troubles to come: <strong>“sequentially”</strong> and <strong>“successively”</strong>?!? This doesn’t sound very efficient in the world of parallel computations, especially after our discussion on computation and communication overlap.</p>
+        <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model (here the numbers indicate the model layers):</p>
         <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
+        <div  class="figure-legend"><p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
+        </div>
         <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
         <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have   <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
             r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
         </d-math>
+        <p>As we add more stages the bubble time thus increases and the utilization drops. As we can see, the bubble can be very large in a naive implementation!</p>
+        <p>Thankfully, various pipeline parallelism schemes have been designed to <strong>reduce the size of the bubble</strong>.</p>
         <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
         <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
+        <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so we're preserving the general organization of our model training code. It makes this PP implementation one of the simplest to implement.</p>
         <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
         <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
+        <p>If you count carefully you'll see that the bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches (where <d-math>p</d-math> is the degree of pipeline parallelism) instead of <d-math>m</d-math> (where <d-math>m</d-math> was the number of microbatches) which can reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
+        <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not anymore cleanly sequential but performed in parallel across devices and interleaved. This means we will have to schedule a switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
         <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
+        <p>You can find a full implementation of 1F1B in picotron as well:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
                 </div>
         </details>
+        <p>Let's take a look at how the 1F1B Pipeline Parallelism schedule scales in practice with some benchmarks on our cluster:</p>
         <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
+        <p>On the left, with a number of microbatches equal to –or less than– PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance are low and even drops as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps improve low-PP-degree performances while still staying limited at very large PP degree. In practice it's not possible to arbitrarly increase the number of microbatches to maintain the ratio of <d-math>m \gg p - 1</d-math> since we're ultimately constrained by the target global batch size. With a maximal possible number of microbatches as we add more PP degree, we'll ultimately have to increase the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
+        <p>Interestingly, at small number of micro-batches the performance only drops by 14% when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>) - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This type of behavior when hitting the lower-bandwith inter-node network makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
+        <p>While 1F1B significantly reduces our activation memory footprint, we see on this last graph that the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
         <h3>Interleaving stages</h3>
+        <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Any way we could still push this frontier?</p>
+        <p>Well it turns out this is possible if we are willing to bring in a few additional communication operations. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
+        <p>Up to now we’ve sliced our model naively along the model depth dimensions, hosting for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
+        <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model. Let's take a graphical look at how this works:</p>
         <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
+        <div class="figure-legend"><p>An example of interleaved pipeline parallelism for a model with layers distributed across 4 GPUs. Numbers still correspond to the microbatches IDs but for clarity we've colored differently the first and the last layers of the model to illustrate how layers are spread accross GPUs.</p>
+          </div>
         <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
         <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
+        <p>Scheduling also becomes more complex here as we have to decide on a given GPU and at a given moment whether we are prioritizing earlier micro-batches going through later layers –meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible)– or if we prioritize to first have later micro-batches going through earlier layers (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This choice is explained in detail in the nice "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
+        <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and breadth-first.</p>
         <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
+        <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to <strong>reduce the bubble to virtually zero</strong>! These techniques were for instance used in the DeepSeek V3/R1 implementation<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. Peaked your curiosity? Let’s have a final quick look at these magical schedules before we leave the world of Pipeline Parallelism!</p>
         <h3>Zero Bubble and DualPipe</h3>
+        <p>Even more sophisticated ways to reduce the bubble have recently been proposed which reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe, reaches close to a zero bubble regime.</p>
+        <aside>Ultimate "flex" in DeepSeek V3 technical report<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> where the authors indicate that their setup "achiev[ed] a near-zero all-to-all communication overhead".</aside>
+        <p>Let’s briefly see how this can work by summarizing the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that the backward pass through a matrix multiplication actually involves two separated operations: backward operation for the inputs (B) and the backward operation for the weights (W):</p>
+        <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only needs to be performed before the optimiser step. We can see that in the following diagram: </p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
+        <p>This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
+        <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
+        <div class="figure-legend"><p>On the top (Figure 2 from the ZeroBubble paper): the classical 1F1B schedule, interleaving forward and backward pass but keeping a coarse-grained backward pass. On the bottom two graphs (Figure 3 from the ZeroBubble paper), two variantes of the ZeroBubble schedule, splitting the backward operation in a "B" and a "W" finer-grained operations. The last schedule, so-called "ZB-H2" is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
+        </div>
+        <p>DeepSeek’s DualPipe introduced with its V3 technical report <d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> an extension of this decomposed approach to the additional case of two streams propagating from both ends of the PP dimension, these streams being interleaved to minimize even further idle time in the GPUs. This schedule is displayed in the following scheduling graph and is even more complex than the previous ones:</p>
         <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
+        <p>In general, fully optimizing such complex schedules involve carfully measuring the duration of the various fine-grained operations and solving a ILP to minimize the final bubble time. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling. As a result, the ZeroBubble and DualPipe schedules are too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. </p>
+        <p>This concludes our tour into the world of pipeline schedules and bubbles. We hope you enjoyed this guided tour!</p>
+        <p>It's now time to turn to the last parallelism method we'll detail and which we can use to train large models efficiently: <strong>Expert parallelism</strong>.</p>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>

src/style.css CHANGED Viewed

@@ -414,6 +414,13 @@ d-article {
     font-size: 1.0em;
 }
 d-code {
     font-size: 12px;
 }

     font-size: 1.0em;
 }
+.figure-legend {
+    font-size: 0.9em;
+    font-style: italic;
+    color: var(--distill-gray);
+    line-height: 1.5em;
+}
 d-code {
     font-size: 12px;
 }