thomwolf HF staff commited on
Commit
2cb0f7e
·
verified ·
1 Parent(s): b6b4552

continuing work on updates (#44)

Browse files

- update (8e45c5c44a5832fdcf46078789ff4bd2e08afb28)

assets/.DS_Store DELETED
Binary file (6.15 kB)
 
dist/assets/.DS_Store DELETED
Binary file (6.15 kB)
 
dist/index.html CHANGED
@@ -1352,7 +1352,7 @@
1352
 
1353
  <h2>Pipeline Parallelism</h2>
1354
 
1355
- <p>In the TP section we saw that if we try to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) we hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we perform it across several nodes:</p>
1356
 
1357
  <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
1358
  <script>
@@ -1366,9 +1366,11 @@
1366
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1367
  <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
1368
 
1369
- <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1370
 
1371
- <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
 
 
1372
 
1373
  <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
1374
  <script>
@@ -1380,23 +1382,23 @@
1380
  </script>
1381
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1382
 
1383
- <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
1384
 
1385
- <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
1386
 
1387
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1388
 
1389
  <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
1390
 
1391
- <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. This is a huge difference e.g. compared to the communication in Tensor Parallelism, happening several times within each layer.</p>
1392
 
1393
- <p>But maybe you start feeling a glimpse of the troubles to come: sequentially and successively”?!? This doesn’t sound very efficient in the world of parallel computation, especially after our discussion about computation and communication overlap.</p>
1394
 
1395
- <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model where the numbers indicate the model layers:</p>
1396
 
1397
  <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
1398
- <p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
1399
-
1400
  <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
1401
 
1402
  <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
@@ -1408,8 +1410,8 @@
1408
  r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
1409
  </d-math>
1410
 
1411
- <p>As we add more stages the bubble time thus increases and the utilization drops.</p>
1412
- <p>Thankfully, various pipeline parallelism schemes have been designed to reduce the size of the bubble which as you can see on this naive example can be very large in a naive implementation.</p>
1413
 
1414
  <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
1415
 
@@ -1417,7 +1419,7 @@
1417
 
1418
  <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
1419
 
1420
- <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1421
 
1422
  <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1423
 
@@ -1448,15 +1450,13 @@
1448
 
1449
  <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
1450
 
1451
- <p>The bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches instead of <d-math>m</d-math> which quite reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
1452
 
1453
- <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not cleanly consecutive anymore but performed in parallel across devices. This means we will have to schedule the switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
1454
 
1455
  <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
1456
 
1457
- <p>Here is the example training loop from the above gist:</p>
1458
-
1459
- <p>You can find the full implementation in picotron as well:</p>
1460
 
1461
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1462
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -1467,28 +1467,31 @@
1467
  </div>
1468
  </details>
1469
 
1470
- <p>Let's look at how the 1F1B Pipeline Parallelism schedule scales in practice:</p>
1471
 
1472
  <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
1473
 
1474
- <p>On the left, with microbatches equal to PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance drops significantly as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps reduce this effect. However, we can't maintain this ratio of <d-math>m \gg p - 1</d-math> indefinitely since we're ultimately constrained by our target global batch size - as we add more PP degree, we're increasing the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
1475
 
1476
- <p>Interestingly, when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>), the performance only drops by 14% - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
1477
 
1478
- <p>While 1F1B significantly reduces our activation memory footprint, the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
1479
 
1480
  <h3>Interleaving stages</h3>
1481
 
1482
- <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Can we also also reduce the time spent in the bubble?</p>
1483
 
1484
- <p>Well it turns out this is possible if we are willing to bring in a few additional communications. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
1485
 
1486
- <p>Up to now we’ve sliced our model naively along the model depth dimensions, locating for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
1487
 
1488
- <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model.</p>
1489
 
1490
  <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
1491
 
 
 
 
1492
  <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
1493
 
1494
 
@@ -1513,32 +1516,42 @@
1513
  <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
1514
 
1515
 
1516
- <p>Scheduling also becomes more complex here as we need to decide on a GPU whether we are prioritizing at a given moment earlier micro-batches meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or we prioritize to first complete the forward passes of all microbatches in the queue before going over to backward passes (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This is explained in detail in the "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
1517
 
1518
- <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and bread-first.</p>
1519
 
1520
  <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
1521
 
1522
- <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to reduce the bubble to virtually zero! Peaked your curiosity? Let’s have a look!</p>
1523
 
1524
  <h3>Zero Bubble and DualPipe</h3>
1525
 
1526
- <p>There are even more sophisticated ways to reduce the bubble more and reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe reach close to a zero bubble regime.</p>
1527
-
1528
- <p>Let’s very quickly see how this can work by detailing briefly the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that a backward through a matrix multiplication involve actually two separated operations: backward for the inputs (B) and the backward for the weights (W):</p>
 
 
1529
 
 
 
1530
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
1531
- <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
1532
 
1533
- <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only need to be performed before the optimiser step. This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1534
 
1535
- <p>DeepSeek’s DualPipe introduced with V3 proposes an extension of this decomposed approach to the case of two stream propagating from both sides of the PP ranks and being interleaved to minimize even further idle time in the GPUs are displayed in the following scheduling graph:</p>
 
 
 
 
 
1536
 
1537
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
1538
 
1539
- <p>The ZeroBubble and DualPipe schedules are a bit too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. In practice, optimizing these schedules requires careful measurements of the time for each operations followed by a scheduling algorithm able to find the most optimal allocation of time given the constrains. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling.</p>
1540
 
1541
- <p>This concludes our tour into the world of pipeline schedules and bubbles. Let's turn to the last parallelism method we can use to train large models efficiently: Expert parallelism.</p>
 
 
1542
 
1543
  <h2>Expert parallelism</h2>
1544
  <p>One more <s>thing</s> parallelism.</p>
@@ -2283,7 +2296,7 @@
2283
 
2284
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2285
 
2286
- <iframe class="l-body-outset" id="plotFP8Loss" src="assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
2287
 
2288
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2289
 
 
1352
 
1353
  <h2>Pipeline Parallelism</h2>
1354
 
1355
+ <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1356
 
1357
  <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
1358
  <script>
 
1366
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1367
  <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
1368
 
1369
+ <p>Sequence and context parallelism can help for long sequences but don’t help much if the sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1370
 
1371
+ <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's see the effect of Pipeline Parallelism in action on the memory usage for a 8B model:</p>
1372
+
1373
+ <aside>This technique may remind you of our discussion on <a target="_self" href="#zero-redundancy-optimizer">ZeRO-3</a> where we split the model parameters across GPUs. We compare both techniques in details later in the <a target="_self" href="#5d_parallelism_in_a_nutshell">5D parallelism in a nutshell</a> section.</aside>
1374
 
1375
  <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
1376
  <script>
 
1382
  </script>
1383
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1384
 
1385
+ <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
1386
 
1387
+ <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1388
 
1389
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1390
 
1391
  <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
1392
 
1393
+ <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. It can make a huge difference versus e.g. communications in Tensor Parallelism, which happens several times within each layer.</p>
1394
 
1395
+ <p>But maybe you start feeling a glimpse of the troubles to come: <strong>“sequentially”</strong> and <strong>“successively”</strong>?!? This doesn’t sound very efficient in the world of parallel computations, especially after our discussion on computation and communication overlap.</p>
1396
 
1397
+ <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model (here the numbers indicate the model layers):</p>
1398
 
1399
  <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
1400
+ <div class="figure-legend"><p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
1401
+ </div>
1402
  <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
1403
 
1404
  <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
 
1410
  r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
1411
  </d-math>
1412
 
1413
+ <p>As we add more stages the bubble time thus increases and the utilization drops. As we can see, the bubble can be very large in a naive implementation!</p>
1414
+ <p>Thankfully, various pipeline parallelism schemes have been designed to <strong>reduce the size of the bubble</strong>.</p>
1415
 
1416
  <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
1417
 
 
1419
 
1420
  <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
1421
 
1422
+ <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so we're preserving the general organization of our model training code. It makes this PP implementation one of the simplest to implement.</p>
1423
 
1424
  <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1425
 
 
1450
 
1451
  <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
1452
 
1453
+ <p>If you count carefully you'll see that the bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches (where <d-math>p</d-math> is the degree of pipeline parallelism) instead of <d-math>m</d-math> (where <d-math>m</d-math> was the number of microbatches) which can reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
1454
 
1455
+ <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not anymore cleanly sequential but performed in parallel across devices and interleaved. This means we will have to schedule a switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
1456
 
1457
  <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
1458
 
1459
+ <p>You can find a full implementation of 1F1B in picotron as well:</p>
 
 
1460
 
1461
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1462
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
 
1467
  </div>
1468
  </details>
1469
 
1470
+ <p>Let's take a look at how the 1F1B Pipeline Parallelism schedule scales in practice with some benchmarks on our cluster:</p>
1471
 
1472
  <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
1473
 
1474
+ <p>On the left, with a number of microbatches equal to –or less than– PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance are low and even drops as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps improve low-PP-degree performances while still staying limited at very large PP degree. In practice it's not possible to arbitrarly increase the number of microbatches to maintain the ratio of <d-math>m \gg p - 1</d-math> since we're ultimately constrained by the target global batch size. With a maximal possible number of microbatches as we add more PP degree, we'll ultimately have to increase the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
1475
 
1476
+ <p>Interestingly, at small number of micro-batches the performance only drops by 14% when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>) - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This type of behavior when hitting the lower-bandwith inter-node network makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
1477
 
1478
+ <p>While 1F1B significantly reduces our activation memory footprint, we see on this last graph that the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
1479
 
1480
  <h3>Interleaving stages</h3>
1481
 
1482
+ <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Any way we could still push this frontier?</p>
1483
 
1484
+ <p>Well it turns out this is possible if we are willing to bring in a few additional communication operations. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
1485
 
1486
+ <p>Up to now we’ve sliced our model naively along the model depth dimensions, hosting for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
1487
 
1488
+ <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model. Let's take a graphical look at how this works:</p>
1489
 
1490
  <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
1491
 
1492
+ <div class="figure-legend"><p>An example of interleaved pipeline parallelism for a model with layers distributed across 4 GPUs. Numbers still correspond to the microbatches IDs but for clarity we've colored differently the first and the last layers of the model to illustrate how layers are spread accross GPUs.</p>
1493
+ </div>
1494
+
1495
  <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
1496
 
1497
 
 
1516
  <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
1517
 
1518
 
1519
+ <p>Scheduling also becomes more complex here as we have to decide on a given GPU and at a given moment whether we are prioritizing earlier micro-batches going through later layers –meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or if we prioritize to first have later micro-batches going through earlier layers (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This choice is explained in detail in the nice "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
1520
 
1521
+ <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and breadth-first.</p>
1522
 
1523
  <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
1524
 
1525
+ <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to <strong>reduce the bubble to virtually zero</strong>! These techniques were for instance used in the DeepSeek V3/R1 implementation<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. Peaked your curiosity? Let’s have a final quick look at these magical schedules before we leave the world of Pipeline Parallelism!</p>
1526
 
1527
  <h3>Zero Bubble and DualPipe</h3>
1528
 
1529
+ <p>Even more sophisticated ways to reduce the bubble have recently been proposed which reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe, reaches close to a zero bubble regime.</p>
1530
+
1531
+ <aside>Ultimate "flex" in DeepSeek V3 technical report<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> where the authors indicate that their setup "achiev[ed] a near-zero all-to-all communication overhead".</aside>
1532
+
1533
+ <p>Let’s briefly see how this can work by summarizing the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that the backward pass through a matrix multiplication actually involves two separated operations: backward operation for the inputs (B) and the backward operation for the weights (W):</p>
1534
 
1535
+ <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only needs to be performed before the optimiser step. We can see that in the following diagram: </p>
1536
+
1537
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
 
1538
 
1539
+ <p>This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1540
 
1541
+ <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
1542
+
1543
+ <div class="figure-legend"><p>On the top (Figure 2 from the ZeroBubble paper): the classical 1F1B schedule, interleaving forward and backward pass but keeping a coarse-grained backward pass. On the bottom two graphs (Figure 3 from the ZeroBubble paper), two variantes of the ZeroBubble schedule, splitting the backward operation in a "B" and a "W" finer-grained operations. The last schedule, so-called "ZB-H2" is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1544
+ </div>
1545
+
1546
+ <p>DeepSeek’s DualPipe introduced with its V3 technical report <d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> an extension of this decomposed approach to the additional case of two streams propagating from both ends of the PP dimension, these streams being interleaved to minimize even further idle time in the GPUs. This schedule is displayed in the following scheduling graph and is even more complex than the previous ones:</p>
1547
 
1548
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
1549
 
1550
+ <p>In general, fully optimizing such complex schedules involve carfully measuring the duration of the various fine-grained operations and solving a ILP to minimize the final bubble time. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling. As a result, the ZeroBubble and DualPipe schedules are too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. </p>
1551
 
1552
+ <p>This concludes our tour into the world of pipeline schedules and bubbles. We hope you enjoyed this guided tour!</p>
1553
+
1554
+ <p>It's now time to turn to the last parallelism method we'll detail and which we can use to train large models efficiently: <strong>Expert parallelism</strong>.</p>
1555
 
1556
  <h2>Expert parallelism</h2>
1557
  <p>One more <s>thing</s> parallelism.</p>
 
2296
 
2297
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2298
 
2299
+ <iframe class="l-body-outset" id="plotFP8Loss" src="/assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
2300
 
2301
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2302
 
dist/style.css CHANGED
@@ -414,6 +414,13 @@ d-article {
414
  font-size: 1.0em;
415
  }
416
 
 
 
 
 
 
 
 
417
  d-code {
418
  font-size: 12px;
419
  }
 
414
  font-size: 1.0em;
415
  }
416
 
417
+ .figure-legend {
418
+ font-size: 0.9em;
419
+ font-style: italic;
420
+ color: var(--distill-gray);
421
+ line-height: 1.5em;
422
+ }
423
+
424
  d-code {
425
  font-size: 12px;
426
  }
src/index.html CHANGED
@@ -1352,7 +1352,7 @@
1352
 
1353
  <h2>Pipeline Parallelism</h2>
1354
 
1355
- <p>In the TP section we saw that if we try to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) we hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we perform it across several nodes:</p>
1356
 
1357
  <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
1358
  <script>
@@ -1366,9 +1366,11 @@
1366
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1367
  <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
1368
 
1369
- <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1370
 
1371
- <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
 
 
1372
 
1373
  <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
1374
  <script>
@@ -1380,23 +1382,23 @@
1380
  </script>
1381
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1382
 
1383
- <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
1384
 
1385
- <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
1386
 
1387
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1388
 
1389
  <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
1390
 
1391
- <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. This is a huge difference e.g. compared to the communication in Tensor Parallelism, happening several times within each layer.</p>
1392
 
1393
- <p>But maybe you start feeling a glimpse of the troubles to come: sequentially and successively”?!? This doesn’t sound very efficient in the world of parallel computation, especially after our discussion about computation and communication overlap.</p>
1394
 
1395
- <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model where the numbers indicate the model layers:</p>
1396
 
1397
  <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
1398
- <p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
1399
-
1400
  <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
1401
 
1402
  <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
@@ -1408,8 +1410,8 @@
1408
  r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
1409
  </d-math>
1410
 
1411
- <p>As we add more stages the bubble time thus increases and the utilization drops.</p>
1412
- <p>Thankfully, various pipeline parallelism schemes have been designed to reduce the size of the bubble which as you can see on this naive example can be very large in a naive implementation.</p>
1413
 
1414
  <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
1415
 
@@ -1417,7 +1419,7 @@
1417
 
1418
  <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
1419
 
1420
- <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1421
 
1422
  <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1423
 
@@ -1448,15 +1450,13 @@
1448
 
1449
  <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
1450
 
1451
- <p>The bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches instead of <d-math>m</d-math> which quite reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
1452
 
1453
- <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not cleanly consecutive anymore but performed in parallel across devices. This means we will have to schedule the switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
1454
 
1455
  <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
1456
 
1457
- <p>Here is the example training loop from the above gist:</p>
1458
-
1459
- <p>You can find the full implementation in picotron as well:</p>
1460
 
1461
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1462
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -1467,28 +1467,31 @@
1467
  </div>
1468
  </details>
1469
 
1470
- <p>Let's look at how the 1F1B Pipeline Parallelism schedule scales in practice:</p>
1471
 
1472
  <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
1473
 
1474
- <p>On the left, with microbatches equal to PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance drops significantly as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps reduce this effect. However, we can't maintain this ratio of <d-math>m \gg p - 1</d-math> indefinitely since we're ultimately constrained by our target global batch size - as we add more PP degree, we're increasing the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
1475
 
1476
- <p>Interestingly, when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>), the performance only drops by 14% - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
1477
 
1478
- <p>While 1F1B significantly reduces our activation memory footprint, the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
1479
 
1480
  <h3>Interleaving stages</h3>
1481
 
1482
- <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Can we also also reduce the time spent in the bubble?</p>
1483
 
1484
- <p>Well it turns out this is possible if we are willing to bring in a few additional communications. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
1485
 
1486
- <p>Up to now we’ve sliced our model naively along the model depth dimensions, locating for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
1487
 
1488
- <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model.</p>
1489
 
1490
  <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
1491
 
 
 
 
1492
  <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
1493
 
1494
 
@@ -1513,32 +1516,42 @@
1513
  <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
1514
 
1515
 
1516
- <p>Scheduling also becomes more complex here as we need to decide on a GPU whether we are prioritizing at a given moment earlier micro-batches meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or we prioritize to first complete the forward passes of all microbatches in the queue before going over to backward passes (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This is explained in detail in the "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
1517
 
1518
- <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and bread-first.</p>
1519
 
1520
  <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
1521
 
1522
- <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to reduce the bubble to virtually zero! Peaked your curiosity? Let’s have a look!</p>
1523
 
1524
  <h3>Zero Bubble and DualPipe</h3>
1525
 
1526
- <p>There are even more sophisticated ways to reduce the bubble more and reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe reach close to a zero bubble regime.</p>
1527
-
1528
- <p>Let’s very quickly see how this can work by detailing briefly the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that a backward through a matrix multiplication involve actually two separated operations: backward for the inputs (B) and the backward for the weights (W):</p>
 
 
1529
 
 
 
1530
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
1531
- <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
1532
 
1533
- <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only need to be performed before the optimiser step. This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1534
 
1535
- <p>DeepSeek’s DualPipe introduced with V3 proposes an extension of this decomposed approach to the case of two stream propagating from both sides of the PP ranks and being interleaved to minimize even further idle time in the GPUs are displayed in the following scheduling graph:</p>
 
 
 
 
 
1536
 
1537
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
1538
 
1539
- <p>The ZeroBubble and DualPipe schedules are a bit too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. In practice, optimizing these schedules requires careful measurements of the time for each operations followed by a scheduling algorithm able to find the most optimal allocation of time given the constrains. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling.</p>
1540
 
1541
- <p>This concludes our tour into the world of pipeline schedules and bubbles. Let's turn to the last parallelism method we can use to train large models efficiently: Expert parallelism.</p>
 
 
1542
 
1543
  <h2>Expert parallelism</h2>
1544
  <p>One more <s>thing</s> parallelism.</p>
 
1352
 
1353
  <h2>Pipeline Parallelism</h2>
1354
 
1355
+ <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1356
 
1357
  <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe>
1358
  <script>
 
1366
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1367
  <p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather and ReduceScatter operations.</p>
1368
 
1369
+ <p>Sequence and context parallelism can help for long sequences but don’t help much if the sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1370
 
1371
+ <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's see the effect of Pipeline Parallelism in action on the memory usage for a 8B model:</p>
1372
+
1373
+ <aside>This technique may remind you of our discussion on <a target="_self" href="#zero-redundancy-optimizer">ZeRO-3</a> where we split the model parameters across GPUs. We compare both techniques in details later in the <a target="_self" href="#5d_parallelism_in_a_nutshell">5D parallelism in a nutshell</a> section.</aside>
1374
 
1375
  <iframe class="l-body" id="plotFrame12" src="assets/data/benchmarks/pp_memoryusage.html" width="90%" scrolling="no" frameborder="0"></iframe>
1376
  <script>
 
1382
  </script>
1383
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1384
 
1385
+ <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
1386
 
1387
+ <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1388
 
1389
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1390
 
1391
  <p>So, let’s say we simply spread the layers on several devices, e.g. a first GPU will take the first few layers and a second GPU will take the second part of the models and so on. The forward pass through our model now simply involves sequentially passing the batch of data along the model and thus successively using each compute device.</p>
1392
 
1393
+ <p>We have a direct first advantage: the required interconnect bandwidth stays quite low as we only send moderate-sized activations at a handful of location along the model depth. It can make a huge difference versus e.g. communications in Tensor Parallelism, which happens several times within each layer.</p>
1394
 
1395
+ <p>But maybe you start feeling a glimpse of the troubles to come: <strong>“sequentially”</strong> and <strong>“successively”</strong>?!? This doesn’t sound very efficient in the world of parallel computations, especially after our discussion on computation and communication overlap.</p>
1396
 
1397
+ <p>Indeed reader! The main challenge in pipeline parallelism will be how to efficiently circumvent the sequential nature of PP to keep our GPU busy at all times and avoid having one GPU computing while the others are waiting. Here is how our GPU utilization is looking when doing a naive and simple forward and backward pass through the model (here the numbers indicate the model layers):</p>
1398
 
1399
  <p><img alt="image.png" src="/assets/images/pp_afab.svg" /></p>
1400
+ <div class="figure-legend"><p>An example of Pipeline parallelism for a model with 16 layers distributed across 4 GPUs. The numbers correspond to the layer IDs.</p>
1401
+ </div>
1402
  <p>The remaining idle time is indicated in grey and usually called the “bubble” and the sight of this probably break your heart after we spent so much time optimizing throughput.</p>
1403
 
1404
  <p>We can quantify how efficient a pipeline setup is by looking at how much time we loose because of the bubble. Let’s say <d-math>t_f</d-math> and <d-math>t_b</d-math> are the times for the forward and backward pass, respectively, as measured for one microbatch and one stage of the pipeline (a simple assumption is often to have <d-math>t_b \approx 2 \times t_f</d-math> which you can see on the above graph). If we could perfectly parallelize the ideal total time would be <d-math>t_{id}=t_f + t_b</d-math>. However, we can count on the graph that due to the pipeline bubble there is additional time of <d-math>t_{pb}=(p-1)*(t_f+t_b)</d-math> (where <d-math>p</d-math> is the degree of pipeline parallelism, i.e the number of GPU on the above graph) ie. the time each GPU is waiting while other GPUs are computing.</p>
 
1410
  r_{bubble} = \frac{(p-1)*(t_f+t_b)}{t_f+t_b} = p-1
1411
  </d-math>
1412
 
1413
+ <p>As we add more stages the bubble time thus increases and the utilization drops. As we can see, the bubble can be very large in a naive implementation!</p>
1414
+ <p>Thankfully, various pipeline parallelism schemes have been designed to <strong>reduce the size of the bubble</strong>.</p>
1415
 
1416
  <p>Let’s take a first tool out of our toolbox and think about splitting our batch into smaller bit-sized portions which can be processed in parallel or almost, like we did before in data parallel for instance. Now when the second GPU is busy processing micro-batch 1, the first GPU can already start processing micro-batch 2. Here is a schedule using 8 micro-batches:</p>
1417
 
 
1419
 
1420
  <aside>Before the numbers in the diagram indicated the layers but in all pipeline parallel plots from now including this one it indicates a microbatch. You can think of each square here to contain several layers as seen in the previous figure. </aside>
1421
 
1422
+ <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so we're preserving the general organization of our model training code. It makes this PP implementation one of the simplest to implement.</p>
1423
 
1424
  <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1425
 
 
1450
 
1451
  <p><img alt="image.png" src="/assets/images/pp_1f1b.svg" /></p>
1452
 
1453
+ <p>If you count carefully you'll see that the bubble still has the same size so our training efficiency is not significantly improved. However we only need to store activations for <d-math>p</d-math> micro-batches (where <d-math>p</d-math> is the degree of pipeline parallelism) instead of <d-math>m</d-math> (where <d-math>m</d-math> was the number of microbatches) which can reduce the activation memory explosion we had in the AFAB schedule. As a consequence we can add more microbatches which then will actually reduce the bubble.</p>
1454
 
1455
+ <p>A major complexity of this setup, visible on the above graph is how forward and backward passes are not anymore cleanly sequential but performed in parallel across devices and interleaved. This means we will have to schedule a switch from forward to backward passes independently on each device instead of in a simple and common central training loop as usual.</p>
1456
 
1457
  <p>This is one of the reason implementing Pipeline Parallelism usually requires rather extensive modifications to training code as well as modeling code.</p>
1458
 
1459
+ <p>You can find a full implementation of 1F1B in picotron as well:</p>
 
 
1460
 
1461
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1462
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
 
1467
  </div>
1468
  </details>
1469
 
1470
+ <p>Let's take a look at how the 1F1B Pipeline Parallelism schedule scales in practice with some benchmarks on our cluster:</p>
1471
 
1472
  <p><img alt="Throughput scaling of Pipeline Parallelism with varying microbatch sizes" src="/assets/images/pp_1f1b_scaling.png" /></p>
1473
 
1474
+ <p>On the left, with a number of microbatches equal to –or less than– PP degree minus one (<d-math>m = p - 1</d-math>), we see how detrimental the pipeline bubble can be - performance are low and even drops as we scale PP. The right plot shows that using many more microbatches than PP degree (<d-math>m = 32 \gg p - 1</d-math>) helps improve low-PP-degree performances while still staying limited at very large PP degree. In practice it's not possible to arbitrarly increase the number of microbatches to maintain the ratio of <d-math>m \gg p - 1</d-math> since we're ultimately constrained by the target global batch size. With a maximal possible number of microbatches as we add more PP degree, we'll ultimately have to increase the bubble size according to <d-math>r_{bubble} = \frac{p - 1}{m}</d-math>.</p>
1475
 
1476
+ <p>Interestingly, at small number of micro-batches the performance only drops by 14% when scaling from one node (<d-math>p = 8</d-math>) to two nodes (<d-math>p = 16</d-math>) - a much better scaling than Tensor Parallelism which typically sees around 43% performance degradation in similar cross-node scenarios. This type of behavior when hitting the lower-bandwith inter-node network makes Pipeline Parallelism particularly attractive for distributed training across multiple nodes.</p>
1477
 
1478
+ <p>While 1F1B significantly reduces our activation memory footprint, we see on this last graph that the pipeline bubble remains a major efficiency bottleneck. With the bubble size still proportional to the number of pipeline stages, we're leaving valuable GPU compute idle. Can we design an even smarter schedule to minimize this wasted computation time?</p>
1479
 
1480
  <h3>Interleaving stages</h3>
1481
 
1482
+ <p>The 1F1B schedule has let us improved memory usage but not much the size of the idle buddle. Any way we could still push this frontier?</p>
1483
 
1484
+ <p>Well it turns out this is possible if we are willing to bring in a few additional communication operations. Time to talk about <strong><em>interleaved stages</em></strong>.</p>
1485
 
1486
+ <p>Up to now we’ve sliced our model naively along the model depth dimensions, hosting for instance layers 1-4 on the first GPU and layers 5-8 on the second GPU. But there are other ways we could think about slicing our layers, e.g. having odd layers 1, 3, 5, 7 on the first GPU and even layers 2, 4, 6, 8 on the second GPU.</p>
1487
 
1488
+ <p>This can be seen in general as a kind of “looping pipeline” where a micro-batch will move in circles from one GPU to the next as it goes through the forward pass through the model. Let's take a graphical look at how this works:</p>
1489
 
1490
  <p><img alt="pp_1f1b_interleaved.svg" src="/assets/images/pp_1f1b_interleaved.svg" /></p>
1491
 
1492
+ <div class="figure-legend"><p>An example of interleaved pipeline parallelism for a model with layers distributed across 4 GPUs. Numbers still correspond to the microbatches IDs but for clarity we've colored differently the first and the last layers of the model to illustrate how layers are spread accross GPUs.</p>
1493
+ </div>
1494
+
1495
  <p>As a consequence we see additional communications happening as the model goes several times through each GPU for the same computation that previously just took one pass. However, each forward and backward pass is divided by a factor of <d-math>v</d-math>, where <d-math>v</d-math> is the number of stages or model chunks per GPUs as we are able to better interleave forward and backward passes. </p>
1496
 
1497
 
 
1516
  <!-- <p><img alt="pp_bubblesize.png" src="/assets/images/pp_bubblesize.png" /></p> -->
1517
 
1518
 
1519
+ <p>Scheduling also becomes more complex here as we have to decide on a given GPU and at a given moment whether we are prioritizing earlier micro-batches going through later layers –meaning that we close the forward and backward loops as fast as possible (so called “depth-first”, i.e. prioritizing getting batches out of the model as fast as possible) or if we prioritize to first have later micro-batches going through earlier layers (so called “breadth-first” i.e. prioritizing filling in the pipeline as much as possible). This choice is explained in detail in the nice "Breadth-Fist Pipeline" paper<d-cite bibtex-key="lamypoirier2023breadthfirstpipelineparallelism"></d-cite>.</p>
1520
 
1521
+ <p>You now have all the elements to understand the pipeline parallelism approach in Llama 3.1 which is using a one-forward-one-backward setup with interleaved stages and a priority setting tuneable between depth-first and breadth-first.</p>
1522
 
1523
  <p><img alt="pp_llama3.1_schedule.png" src="/assets/images/pp_llama3.1_schedule.png" /></p>
1524
 
1525
+ <p>However, we haven’t reached the end of possible pipeline schedules and recently some methods have been proposed to <strong>reduce the bubble to virtually zero</strong>! These techniques were for instance used in the DeepSeek V3/R1 implementation<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. Peaked your curiosity? Let’s have a final quick look at these magical schedules before we leave the world of Pipeline Parallelism!</p>
1526
 
1527
  <h3>Zero Bubble and DualPipe</h3>
1528
 
1529
+ <p>Even more sophisticated ways to reduce the bubble have recently been proposed which reached close to a “zero bubble” regime. The secret here is to split at an even finer-grained level the operations involved in order to interleave them in the most efficient way. For instance the pipeline implementation approach in DeepSeek V3/R1, called DualPipe, reaches close to a zero bubble regime.</p>
1530
+
1531
+ <aside>Ultimate "flex" in DeepSeek V3 technical report<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> where the authors indicate that their setup "achiev[ed] a near-zero all-to-all communication overhead".</aside>
1532
+
1533
+ <p>Let’s briefly see how this can work by summarizing the ZeroBubble<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> work which is a precursor to DualPipe. The base observation of ZeroBubble is that the backward pass through a matrix multiplication actually involves two separated operations: backward operation for the inputs (B) and the backward operation for the weights (W):</p>
1534
 
1535
+ <p>While the output of B, the backward pass for the input, is necessary for performing the backward pass of the lower layers, the backward pass of the weights, W, is not necessary for the rest of the backward pass and generally only needs to be performed before the optimiser step. We can see that in the following diagram: </p>
1536
+
1537
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_compgraph.png" /></p>
 
1538
 
1539
+ <p>This means W can be flexibly scheduled anywhere after the corresponding B of the same stage. This allows for strategic placement of W to fill the pipeline bubbles. The ZB-H2 schedule on the top right is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1540
 
1541
+ <p><img alt="image.png" src="/assets/images/pp_zerobubble_ppschedule.png" /></p>
1542
+
1543
+ <div class="figure-legend"><p>On the top (Figure 2 from the ZeroBubble paper): the classical 1F1B schedule, interleaving forward and backward pass but keeping a coarse-grained backward pass. On the bottom two graphs (Figure 3 from the ZeroBubble paper), two variantes of the ZeroBubble schedule, splitting the backward operation in a "B" and a "W" finer-grained operations. The last schedule, so-called "ZB-H2" is an example of (theoretical) schedule with zero bubble taking advantage for this fine-grained decomposition.</p>
1544
+ </div>
1545
+
1546
+ <p>DeepSeek’s DualPipe introduced with its V3 technical report <d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite> an extension of this decomposed approach to the additional case of two streams propagating from both ends of the PP dimension, these streams being interleaved to minimize even further idle time in the GPUs. This schedule is displayed in the following scheduling graph and is even more complex than the previous ones:</p>
1547
 
1548
  <p><img alt="image.png" src="/assets/images/pp_zerobubble_dualpipe.png" /></p>
1549
 
1550
+ <p>In general, fully optimizing such complex schedules involve carfully measuring the duration of the various fine-grained operations and solving a ILP to minimize the final bubble time. See for instance in the ZeroBubble paper<d-cite bibtex-key="qi2023zerobubblepipelineparallelism"></d-cite> for a discussion of the heuristics and algorithms to perform such a scheduling. As a result, the ZeroBubble and DualPipe schedules are too complex for us to give here code snippets but you should start to have a general idea of the concepts involved. </p>
1551
 
1552
+ <p>This concludes our tour into the world of pipeline schedules and bubbles. We hope you enjoyed this guided tour!</p>
1553
+
1554
+ <p>It's now time to turn to the last parallelism method we'll detail and which we can use to train large models efficiently: <strong>Expert parallelism</strong>.</p>
1555
 
1556
  <h2>Expert parallelism</h2>
1557
  <p>One more <s>thing</s> parallelism.</p>
src/style.css CHANGED
@@ -414,6 +414,13 @@ d-article {
414
  font-size: 1.0em;
415
  }
416
 
 
 
 
 
 
 
 
417
  d-code {
418
  font-size: 12px;
419
  }
 
414
  font-size: 1.0em;
415
  }
416
 
417
+ .figure-legend {
418
+ font-size: 0.9em;
419
+ font-style: italic;
420
+ color: var(--distill-gray);
421
+ line-height: 1.5em;
422
+ }
423
+
424
  d-code {
425
  font-size: 12px;
426
  }