thomwolf HF staff commited on
Commit
5f43090
·
verified ·
1 Parent(s): 2464340

continue fixes (#38)

Browse files

- update font size (a01f87e0cd7b4806233755e49f46b08316363a33)

Files changed (4) hide show
  1. dist/index.html +41 -39
  2. dist/style.css +11 -1
  3. src/index.html +41 -39
  4. src/style.css +11 -1
dist/index.html CHANGED
@@ -311,14 +311,15 @@
311
 
312
  <div class="note-box">
313
  <p class="note-box-title">📝 Note</p>
314
- <p class="note-box-content">
 
315
  You would think for a model you could compute the memory requirements exactly but there are a few additional memory occupants that makes it hard to be exact:
316
  <ul>
317
  <li>CUDA Kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
318
  <li>Some rest memory usage from buffers, intermediate results and some memory that can’t be used due to fragmentation</li>
319
  </ul>
320
  We’ll neglect these last two contributors as they are typically small and constant factors.
321
- </p>
322
  </div>
323
 
324
  <p>These items are stored as tensors which come in different <em>shapes</em> and <em>precisions</em>. The <em>shapes</em> are determined by hyper-parameters such as batch size, sequence length, model hidden dimensions, attention heads, vocabulary size, and potential model sharding as we’ll see later. <em>Precision</em> refers to formats like FP32, BF16, or FP8, which respectively require 4, 2, or 1 byte to store each single value in the tensor.</p>
@@ -388,16 +389,16 @@
388
 
389
  <div class="note-box">
390
  <p class="note-box-title">📝 Note</p>
391
- <p class="note-box-content">
392
- Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
393
- </p>
394
  </div>
395
 
396
  <div class="note-box">
397
  <p class="note-box-title">📝 Note</p>
398
- <p class="note-box-content">
399
- The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.
400
- </p>
401
  </div>
402
 
403
  <p>Interestingly, mixed precision itself doesn’t save overall memory as it just distributes the memory differently across the three components, and in fact adds another 4 bytes over full precision training if we accumulate gradients in FP32. It’s still advantageous as computing the forward/backward passes in half precision allows us to (1) use optimized lower precision operations on the GPU which are faster and (2) reduces the activation memory requirements during the forward pass which is a large part of the memory usage as we saw on the graph above and below.</p>
@@ -498,12 +499,13 @@
498
 
499
  <div class="note-box">
500
  <p class="note-box-title">📝 Note</p>
501
- <p class="note-box-content">
 
502
  When you’re measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPS (Floating point operations per second) and compare it to theoretical maximum FLOPS of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPS for a training step gives a value called “hardware FLOPS” which is the real number of operations performed on the accelerator. Dividing this number by the duration of the training step and the maximum accelerator FLOPS yields the <strong><em>Hardware FLOPS Utilization (HFU).</em></strong>
503
- <br>
504
- <br>
505
  However, what really matters at the end of the day is the start-to-end time needed to train a model on a given dataset. So when comparing various GPU/TPU/accelerator together, if one of these accelerator provide for instance enough memory to skip recomputation and thus perform less operation per second (lower HFU) but for a faster training, it should be rewarded not punished. Thus, an alternative is to compute what is called <strong><em>Model FLOPS Utilization (MFU)</em></strong> which, in contrast to HFU, only takes into account the required operations for the forward+backward passes through the model, and do not include recomputation in the measured FLOPs. This value is thus more specific to the model than the training implementation.
506
- </p>
 
507
  </div>
508
 
509
 
@@ -677,9 +679,9 @@
677
 
678
  <div class="note-box">
679
  <p class="note-box-title">📝 Note</p>
680
- <p class="note-box-content">
681
- <p>When performing communication operations, tensors must be contiguous in memory to avoid redundant memory copies. To perform this optimally, we often pre-allocate continuous buffers of the size of activations or model parameters specifically for communication. While this speed up communication, it also contributes in part to the peak memory usage during training.
682
- </p>
683
  </div>
684
 
685
  <p>Now let's have a look what that means for the global batch size.</p>
@@ -720,9 +722,9 @@
720
 
721
  <div class="note-box">
722
  <p class="note-box-title">📝 Note</p>
723
- <p class="note-box-content">
724
- <p>Bear in mind that at the 512+ GPUs scale, depending on the network used, the communication operations will start to be bound by <em>ring latency</em> (time required for a signal to propagate once around the ring) which means we can no longer fully overlap the DP communications. This will decrease our compute efficiency and hit our throughput. In this case we should start exploring other dimensions to parallelize on.
725
- </p>
726
  </div>
727
 
728
  <p>While data parallelism nicely overlaps the all-reduce gradient synchronization with backward computation to save time, this benefit starts to break down at large scales. Why? Because as we add more and more GPUs (hundreds or thousands), the overhead of coordinating between them grows significantly and the network requirements are becoming too large for the benefits. As a result, our setup will become less and less efficient which each additional GPU we add to the system.</p>
@@ -842,9 +844,9 @@
842
 
843
  <div class="note-box">
844
  <p class="note-box-title">📝 Note</p>
845
- <p class="note-box-content">
846
- <p>Unfortunately these techniques are not straightforward to implement and require sophisticated use of hooks/bucketing. In practice we can just use ZeRO-3/FSDP implementation where the FSDPUnit is the entire model, more details about this later.
847
- </p>
848
  </div>
849
 
850
  <p>In ZeRO-1 the optimizer states have been partitioned, which means that each replica only updates <d-math>\frac{1}{N_d}</d-math> of the optimizer states. The keen reader must have noticed that there is no real need to have all gradients on all DP ranks in the first place since only a subset is needed for the optimization step. Meet ZeRO-2!</p>
@@ -873,9 +875,9 @@
873
 
874
  <div class="note-box">
875
  <p class="note-box-title">📝 Note</p>
876
- <p class="note-box-content">
877
- <p>This stage is also called FSDP (Fully Shared Data Parallelism) in PyTorch native implementation. We’ll just refer to ZeRO-3 in this blogpost but you can think of FSDP wherever you see it.
878
- </p>
879
  </div>
880
 
881
  <p>So how do we do a forward or backward pass in practice if all parts of the model are distributed? Quite simply we gather them on-demand when we need them. In the forward pass this looks as follows:</p>
@@ -1030,9 +1032,9 @@
1030
 
1031
  <div class="note-box">
1032
  <p class="note-box-title">📝 Note</p>
1033
- <p class="note-box-content">
1034
- <p>One interesting note about layer normalization in tensor parallel training - since each TP rank sees the same activations after the all-gather, the layer norm weights don't actually need an all-reduce to sync their gradients after the backward pass. They naturally stay in sync across ranks. However, for dropout operations, we must make sure to sync the random seed across TP ranks to maintain deterministic behavior.
1035
- </p>
1036
  </div>
1037
 
1038
  <p>This raises an interesting question - could we extend tensor parallelism to these remaining operations as well? Indeed, it's possible to parallelize layer norm, dropout and other operations too, which we'll explore next.</p>
@@ -1045,9 +1047,9 @@
1045
 
1046
  <div class="note-box">
1047
  <p class="note-box-title">📝 Note</p>
1048
- <p class="note-box-content">
1049
- <p>The term Sequence Parallelism is a bit overloaded: the Sequence Parallelism in this section is tightly coupled to Tensor Parallelism and applies to dropout and layer norm operation. However, when we will move to longer sequences the attention computation will become a bottleneck, which calls for techniques such as Ring-Attention, which are sometimes also called <em>Sequence Parallelism</em> but we’ll refer to them as <em>Context Parallelism</em> to differentiate the two approaches. So each time you see sequence parallelism, remember that it is used together with tensor parallelism (in contrast to context parallelism, which can be used independently).
1050
- </p>
1051
  </div>
1052
 
1053
  <p>Sequence parallelism (SP) involves splitting the activations and computations for the parts of the model not handled by tensor parallelism (TP) such as Dropout and LayerNorm, but along the input sequence dimension rather than across hidden dimension. This is needed because these operations require access to the full hidden dimension to compute correctly. For example, LayerNorm needs the full hidden dimension to compute mean and variance:</p>
@@ -1228,9 +1230,9 @@
1228
 
1229
  <div class="note-box">
1230
  <p class="note-box-title">📝 Note</p>
1231
- <p class="note-box-content">
1232
- <p>Since LayerNorms in the SP region operate on different portions of the sequence, their gradients will differ across TP ranks. To ensure the weights stay synchronized, we need to all-reduce their gradients during the backward pass, similar to how DP ensures weights stay in sync. This is a small communication overhead since LayerNorm has relatively few parameters.
1233
- </p>
1234
  </div>
1235
 
1236
  <p>However, there are two limits to TP and SP: 1) if we scale the sequence length the activation memory will still blow up in the TP region and 2) if the model is too big to fit with TP=8 then we will see a massive slow-down due to the inter-node connectivity.</p>
@@ -1272,9 +1274,9 @@
1272
 
1273
  <div class="note-box">
1274
  <p class="note-box-title">📝 Note</p>
1275
- <p class="note-box-content">
1276
- <p>Context Parallelism shares some conceptual similarities with Flash Attention (see later for more details) - both techniques rely on online softmax computation to reduce memory usage. While Flash Attention focuses on optimizing the attention computation itself on a single GPU, Context Parallelism achieves memory reduction by distributing the sequence across multiple GPUs.
1277
- </p>
1278
  </div>
1279
 
1280
  <h3>Discovering Ring Attention</h3>
@@ -1635,9 +1637,9 @@
1635
 
1636
  <div class="note-box">
1637
  <p class="note-box-title">📝 Note</p>
1638
- <p class="note-box-content">
1639
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1640
- </p>
1641
  </div>
1642
 
1643
 
 
311
 
312
  <div class="note-box">
313
  <p class="note-box-title">📝 Note</p>
314
+ <div class="note-box-content">
315
+ <p>
316
  You would think for a model you could compute the memory requirements exactly but there are a few additional memory occupants that makes it hard to be exact:
317
  <ul>
318
  <li>CUDA Kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
319
  <li>Some rest memory usage from buffers, intermediate results and some memory that can’t be used due to fragmentation</li>
320
  </ul>
321
  We’ll neglect these last two contributors as they are typically small and constant factors.
322
+ </p></div>
323
  </div>
324
 
325
  <p>These items are stored as tensors which come in different <em>shapes</em> and <em>precisions</em>. The <em>shapes</em> are determined by hyper-parameters such as batch size, sequence length, model hidden dimensions, attention heads, vocabulary size, and potential model sharding as we’ll see later. <em>Precision</em> refers to formats like FP32, BF16, or FP8, which respectively require 4, 2, or 1 byte to store each single value in the tensor.</p>
 
389
 
390
  <div class="note-box">
391
  <p class="note-box-title">📝 Note</p>
392
+ <div class="note-box-content">
393
+ <p>Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.</p>
394
+ </div>
395
  </div>
396
 
397
  <div class="note-box">
398
  <p class="note-box-title">📝 Note</p>
399
+ <div class="note-box-content">
400
+ <p>The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.</p>
401
+ </div>
402
  </div>
403
 
404
  <p>Interestingly, mixed precision itself doesn’t save overall memory as it just distributes the memory differently across the three components, and in fact adds another 4 bytes over full precision training if we accumulate gradients in FP32. It’s still advantageous as computing the forward/backward passes in half precision allows us to (1) use optimized lower precision operations on the GPU which are faster and (2) reduces the activation memory requirements during the forward pass which is a large part of the memory usage as we saw on the graph above and below.</p>
 
499
 
500
  <div class="note-box">
501
  <p class="note-box-title">📝 Note</p>
502
+ <div class="note-box-content">
503
+ <p>
504
  When you’re measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPS (Floating point operations per second) and compare it to theoretical maximum FLOPS of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPS for a training step gives a value called “hardware FLOPS” which is the real number of operations performed on the accelerator. Dividing this number by the duration of the training step and the maximum accelerator FLOPS yields the <strong><em>Hardware FLOPS Utilization (HFU).</em></strong>
505
+ </p><p>
 
506
  However, what really matters at the end of the day is the start-to-end time needed to train a model on a given dataset. So when comparing various GPU/TPU/accelerator together, if one of these accelerator provide for instance enough memory to skip recomputation and thus perform less operation per second (lower HFU) but for a faster training, it should be rewarded not punished. Thus, an alternative is to compute what is called <strong><em>Model FLOPS Utilization (MFU)</em></strong> which, in contrast to HFU, only takes into account the required operations for the forward+backward passes through the model, and do not include recomputation in the measured FLOPs. This value is thus more specific to the model than the training implementation.
507
+ </p>
508
+ </div>
509
  </div>
510
 
511
 
 
679
 
680
  <div class="note-box">
681
  <p class="note-box-title">📝 Note</p>
682
+ <div class="note-box-content">
683
+ <p>When performing communication operations, tensors must be contiguous in memory to avoid redundant memory copies. To perform this optimally, we often pre-allocate continuous buffers of the size of activations or model parameters specifically for communication. While this speed up communication, it also contributes in part to the peak memory usage during training.</p>
684
+ </div>
685
  </div>
686
 
687
  <p>Now let's have a look what that means for the global batch size.</p>
 
722
 
723
  <div class="note-box">
724
  <p class="note-box-title">📝 Note</p>
725
+ <div class="note-box-content">
726
+ <p>Bear in mind that at the 512+ GPUs scale, depending on the network used, the communication operations will start to be bound by <em>ring latency</em> (time required for a signal to propagate once around the ring) which means we can no longer fully overlap the DP communications. This will decrease our compute efficiency and hit our throughput. In this case we should start exploring other dimensions to parallelize on.</p>
727
+ </div>
728
  </div>
729
 
730
  <p>While data parallelism nicely overlaps the all-reduce gradient synchronization with backward computation to save time, this benefit starts to break down at large scales. Why? Because as we add more and more GPUs (hundreds or thousands), the overhead of coordinating between them grows significantly and the network requirements are becoming too large for the benefits. As a result, our setup will become less and less efficient which each additional GPU we add to the system.</p>
 
844
 
845
  <div class="note-box">
846
  <p class="note-box-title">📝 Note</p>
847
+ <div class="note-box-content">
848
+ <p>Unfortunately these techniques are not straightforward to implement and require sophisticated use of hooks/bucketing. In practice we can just use ZeRO-3/FSDP implementation where the FSDPUnit is the entire model, more details about this later.</p>
849
+ </div>
850
  </div>
851
 
852
  <p>In ZeRO-1 the optimizer states have been partitioned, which means that each replica only updates <d-math>\frac{1}{N_d}</d-math> of the optimizer states. The keen reader must have noticed that there is no real need to have all gradients on all DP ranks in the first place since only a subset is needed for the optimization step. Meet ZeRO-2!</p>
 
875
 
876
  <div class="note-box">
877
  <p class="note-box-title">📝 Note</p>
878
+ <div class="note-box-content">
879
+ <p>This stage is also called FSDP (Fully Shared Data Parallelism) in PyTorch native implementation. We’ll just refer to ZeRO-3 in this blogpost but you can think of FSDP wherever you see it.</p>
880
+ </div>
881
  </div>
882
 
883
  <p>So how do we do a forward or backward pass in practice if all parts of the model are distributed? Quite simply we gather them on-demand when we need them. In the forward pass this looks as follows:</p>
 
1032
 
1033
  <div class="note-box">
1034
  <p class="note-box-title">📝 Note</p>
1035
+ <div class="note-box-content">
1036
+ <p>One interesting note about layer normalization in tensor parallel training - since each TP rank sees the same activations after the all-gather, the layer norm weights don't actually need an all-reduce to sync their gradients after the backward pass. They naturally stay in sync across ranks. However, for dropout operations, we must make sure to sync the random seed across TP ranks to maintain deterministic behavior.</p>
1037
+ </div>
1038
  </div>
1039
 
1040
  <p>This raises an interesting question - could we extend tensor parallelism to these remaining operations as well? Indeed, it's possible to parallelize layer norm, dropout and other operations too, which we'll explore next.</p>
 
1047
 
1048
  <div class="note-box">
1049
  <p class="note-box-title">📝 Note</p>
1050
+ <div class="note-box-content">
1051
+ <p>The term Sequence Parallelism is a bit overloaded: the Sequence Parallelism in this section is tightly coupled to Tensor Parallelism and applies to dropout and layer norm operation. However, when we will move to longer sequences the attention computation will become a bottleneck, which calls for techniques such as Ring-Attention, which are sometimes also called <em>Sequence Parallelism</em> but we’ll refer to them as <em>Context Parallelism</em> to differentiate the two approaches. So each time you see sequence parallelism, remember that it is used together with tensor parallelism (in contrast to context parallelism, which can be used independently).</p>
1052
+ </div>
1053
  </div>
1054
 
1055
  <p>Sequence parallelism (SP) involves splitting the activations and computations for the parts of the model not handled by tensor parallelism (TP) such as Dropout and LayerNorm, but along the input sequence dimension rather than across hidden dimension. This is needed because these operations require access to the full hidden dimension to compute correctly. For example, LayerNorm needs the full hidden dimension to compute mean and variance:</p>
 
1230
 
1231
  <div class="note-box">
1232
  <p class="note-box-title">📝 Note</p>
1233
+ <div class="note-box-content">
1234
+ <p>Since LayerNorms in the SP region operate on different portions of the sequence, their gradients will differ across TP ranks. To ensure the weights stay synchronized, we need to all-reduce their gradients during the backward pass, similar to how DP ensures weights stay in sync. This is a small communication overhead since LayerNorm has relatively few parameters.</p>
1235
+ </div>
1236
  </div>
1237
 
1238
  <p>However, there are two limits to TP and SP: 1) if we scale the sequence length the activation memory will still blow up in the TP region and 2) if the model is too big to fit with TP=8 then we will see a massive slow-down due to the inter-node connectivity.</p>
 
1274
 
1275
  <div class="note-box">
1276
  <p class="note-box-title">📝 Note</p>
1277
+ <div class="note-box-content">
1278
+ <p>Context Parallelism shares some conceptual similarities with Flash Attention (see later for more details) - both techniques rely on online softmax computation to reduce memory usage. While Flash Attention focuses on optimizing the attention computation itself on a single GPU, Context Parallelism achieves memory reduction by distributing the sequence across multiple GPUs.</p>
1279
+ </div>
1280
  </div>
1281
 
1282
  <h3>Discovering Ring Attention</h3>
 
1637
 
1638
  <div class="note-box">
1639
  <p class="note-box-title">📝 Note</p>
1640
+ <div class="note-box-content">
1641
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1642
+ </div>
1643
  </div>
1644
 
1645
 
dist/style.css CHANGED
@@ -197,6 +197,10 @@ toggle-icon.collapsed {
197
  margin-top: 0;
198
  }
199
 
 
 
 
 
200
  @media (min-width: 1200px) {
201
  d-article {
202
  /* Ensure d-article does not prevent sticky positioning */
@@ -385,12 +389,14 @@ d-contents nav > ul > li > a:hover {
385
  margin: 0;
386
  color: #444444;
387
  font-weight: 600;
 
388
  }
389
 
390
  .note-box-content {
391
  margin-top: 0.5rem;
392
  margin-bottom: 0; /* Ensure no bottom margin */
393
  color: #24292f;
 
394
  }
395
 
396
  /* For dark mode support */
@@ -405,4 +411,8 @@ d-contents nav > ul > li > a:hover {
405
  .note-box-content {
406
  color: #d4d4d4;
407
  }
408
- }
 
 
 
 
 
197
  margin-top: 0;
198
  }
199
 
200
+ d-article {
201
+ font-size: 1.04em;
202
+ }
203
+
204
  @media (min-width: 1200px) {
205
  d-article {
206
  /* Ensure d-article does not prevent sticky positioning */
 
389
  margin: 0;
390
  color: #444444;
391
  font-weight: 600;
392
+ font-size: 12px;
393
  }
394
 
395
  .note-box-content {
396
  margin-top: 0.5rem;
397
  margin-bottom: 0; /* Ensure no bottom margin */
398
  color: #24292f;
399
+ font-size: 12px;
400
  }
401
 
402
  /* For dark mode support */
 
411
  .note-box-content {
412
  color: #d4d4d4;
413
  }
414
+ }
415
+
416
+ d-code {
417
+ font-size: 12px;
418
+ }
src/index.html CHANGED
@@ -311,14 +311,15 @@
311
 
312
  <div class="note-box">
313
  <p class="note-box-title">📝 Note</p>
314
- <p class="note-box-content">
 
315
  You would think for a model you could compute the memory requirements exactly but there are a few additional memory occupants that makes it hard to be exact:
316
  <ul>
317
  <li>CUDA Kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
318
  <li>Some rest memory usage from buffers, intermediate results and some memory that can’t be used due to fragmentation</li>
319
  </ul>
320
  We’ll neglect these last two contributors as they are typically small and constant factors.
321
- </p>
322
  </div>
323
 
324
  <p>These items are stored as tensors which come in different <em>shapes</em> and <em>precisions</em>. The <em>shapes</em> are determined by hyper-parameters such as batch size, sequence length, model hidden dimensions, attention heads, vocabulary size, and potential model sharding as we’ll see later. <em>Precision</em> refers to formats like FP32, BF16, or FP8, which respectively require 4, 2, or 1 byte to store each single value in the tensor.</p>
@@ -388,16 +389,16 @@
388
 
389
  <div class="note-box">
390
  <p class="note-box-title">📝 Note</p>
391
- <p class="note-box-content">
392
- Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
393
- </p>
394
  </div>
395
 
396
  <div class="note-box">
397
  <p class="note-box-title">📝 Note</p>
398
- <p class="note-box-content">
399
- The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.
400
- </p>
401
  </div>
402
 
403
  <p>Interestingly, mixed precision itself doesn’t save overall memory as it just distributes the memory differently across the three components, and in fact adds another 4 bytes over full precision training if we accumulate gradients in FP32. It’s still advantageous as computing the forward/backward passes in half precision allows us to (1) use optimized lower precision operations on the GPU which are faster and (2) reduces the activation memory requirements during the forward pass which is a large part of the memory usage as we saw on the graph above and below.</p>
@@ -498,12 +499,13 @@
498
 
499
  <div class="note-box">
500
  <p class="note-box-title">📝 Note</p>
501
- <p class="note-box-content">
 
502
  When you’re measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPS (Floating point operations per second) and compare it to theoretical maximum FLOPS of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPS for a training step gives a value called “hardware FLOPS” which is the real number of operations performed on the accelerator. Dividing this number by the duration of the training step and the maximum accelerator FLOPS yields the <strong><em>Hardware FLOPS Utilization (HFU).</em></strong>
503
- <br>
504
- <br>
505
  However, what really matters at the end of the day is the start-to-end time needed to train a model on a given dataset. So when comparing various GPU/TPU/accelerator together, if one of these accelerator provide for instance enough memory to skip recomputation and thus perform less operation per second (lower HFU) but for a faster training, it should be rewarded not punished. Thus, an alternative is to compute what is called <strong><em>Model FLOPS Utilization (MFU)</em></strong> which, in contrast to HFU, only takes into account the required operations for the forward+backward passes through the model, and do not include recomputation in the measured FLOPs. This value is thus more specific to the model than the training implementation.
506
- </p>
 
507
  </div>
508
 
509
 
@@ -677,9 +679,9 @@
677
 
678
  <div class="note-box">
679
  <p class="note-box-title">📝 Note</p>
680
- <p class="note-box-content">
681
- <p>When performing communication operations, tensors must be contiguous in memory to avoid redundant memory copies. To perform this optimally, we often pre-allocate continuous buffers of the size of activations or model parameters specifically for communication. While this speed up communication, it also contributes in part to the peak memory usage during training.
682
- </p>
683
  </div>
684
 
685
  <p>Now let's have a look what that means for the global batch size.</p>
@@ -720,9 +722,9 @@
720
 
721
  <div class="note-box">
722
  <p class="note-box-title">📝 Note</p>
723
- <p class="note-box-content">
724
- <p>Bear in mind that at the 512+ GPUs scale, depending on the network used, the communication operations will start to be bound by <em>ring latency</em> (time required for a signal to propagate once around the ring) which means we can no longer fully overlap the DP communications. This will decrease our compute efficiency and hit our throughput. In this case we should start exploring other dimensions to parallelize on.
725
- </p>
726
  </div>
727
 
728
  <p>While data parallelism nicely overlaps the all-reduce gradient synchronization with backward computation to save time, this benefit starts to break down at large scales. Why? Because as we add more and more GPUs (hundreds or thousands), the overhead of coordinating between them grows significantly and the network requirements are becoming too large for the benefits. As a result, our setup will become less and less efficient which each additional GPU we add to the system.</p>
@@ -842,9 +844,9 @@
842
 
843
  <div class="note-box">
844
  <p class="note-box-title">📝 Note</p>
845
- <p class="note-box-content">
846
- <p>Unfortunately these techniques are not straightforward to implement and require sophisticated use of hooks/bucketing. In practice we can just use ZeRO-3/FSDP implementation where the FSDPUnit is the entire model, more details about this later.
847
- </p>
848
  </div>
849
 
850
  <p>In ZeRO-1 the optimizer states have been partitioned, which means that each replica only updates <d-math>\frac{1}{N_d}</d-math> of the optimizer states. The keen reader must have noticed that there is no real need to have all gradients on all DP ranks in the first place since only a subset is needed for the optimization step. Meet ZeRO-2!</p>
@@ -873,9 +875,9 @@
873
 
874
  <div class="note-box">
875
  <p class="note-box-title">📝 Note</p>
876
- <p class="note-box-content">
877
- <p>This stage is also called FSDP (Fully Shared Data Parallelism) in PyTorch native implementation. We’ll just refer to ZeRO-3 in this blogpost but you can think of FSDP wherever you see it.
878
- </p>
879
  </div>
880
 
881
  <p>So how do we do a forward or backward pass in practice if all parts of the model are distributed? Quite simply we gather them on-demand when we need them. In the forward pass this looks as follows:</p>
@@ -1030,9 +1032,9 @@
1030
 
1031
  <div class="note-box">
1032
  <p class="note-box-title">📝 Note</p>
1033
- <p class="note-box-content">
1034
- <p>One interesting note about layer normalization in tensor parallel training - since each TP rank sees the same activations after the all-gather, the layer norm weights don't actually need an all-reduce to sync their gradients after the backward pass. They naturally stay in sync across ranks. However, for dropout operations, we must make sure to sync the random seed across TP ranks to maintain deterministic behavior.
1035
- </p>
1036
  </div>
1037
 
1038
  <p>This raises an interesting question - could we extend tensor parallelism to these remaining operations as well? Indeed, it's possible to parallelize layer norm, dropout and other operations too, which we'll explore next.</p>
@@ -1045,9 +1047,9 @@
1045
 
1046
  <div class="note-box">
1047
  <p class="note-box-title">📝 Note</p>
1048
- <p class="note-box-content">
1049
- <p>The term Sequence Parallelism is a bit overloaded: the Sequence Parallelism in this section is tightly coupled to Tensor Parallelism and applies to dropout and layer norm operation. However, when we will move to longer sequences the attention computation will become a bottleneck, which calls for techniques such as Ring-Attention, which are sometimes also called <em>Sequence Parallelism</em> but we’ll refer to them as <em>Context Parallelism</em> to differentiate the two approaches. So each time you see sequence parallelism, remember that it is used together with tensor parallelism (in contrast to context parallelism, which can be used independently).
1050
- </p>
1051
  </div>
1052
 
1053
  <p>Sequence parallelism (SP) involves splitting the activations and computations for the parts of the model not handled by tensor parallelism (TP) such as Dropout and LayerNorm, but along the input sequence dimension rather than across hidden dimension. This is needed because these operations require access to the full hidden dimension to compute correctly. For example, LayerNorm needs the full hidden dimension to compute mean and variance:</p>
@@ -1228,9 +1230,9 @@
1228
 
1229
  <div class="note-box">
1230
  <p class="note-box-title">📝 Note</p>
1231
- <p class="note-box-content">
1232
- <p>Since LayerNorms in the SP region operate on different portions of the sequence, their gradients will differ across TP ranks. To ensure the weights stay synchronized, we need to all-reduce their gradients during the backward pass, similar to how DP ensures weights stay in sync. This is a small communication overhead since LayerNorm has relatively few parameters.
1233
- </p>
1234
  </div>
1235
 
1236
  <p>However, there are two limits to TP and SP: 1) if we scale the sequence length the activation memory will still blow up in the TP region and 2) if the model is too big to fit with TP=8 then we will see a massive slow-down due to the inter-node connectivity.</p>
@@ -1272,9 +1274,9 @@
1272
 
1273
  <div class="note-box">
1274
  <p class="note-box-title">📝 Note</p>
1275
- <p class="note-box-content">
1276
- <p>Context Parallelism shares some conceptual similarities with Flash Attention (see later for more details) - both techniques rely on online softmax computation to reduce memory usage. While Flash Attention focuses on optimizing the attention computation itself on a single GPU, Context Parallelism achieves memory reduction by distributing the sequence across multiple GPUs.
1277
- </p>
1278
  </div>
1279
 
1280
  <h3>Discovering Ring Attention</h3>
@@ -1635,9 +1637,9 @@
1635
 
1636
  <div class="note-box">
1637
  <p class="note-box-title">📝 Note</p>
1638
- <p class="note-box-content">
1639
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1640
- </p>
1641
  </div>
1642
 
1643
 
 
311
 
312
  <div class="note-box">
313
  <p class="note-box-title">📝 Note</p>
314
+ <div class="note-box-content">
315
+ <p>
316
  You would think for a model you could compute the memory requirements exactly but there are a few additional memory occupants that makes it hard to be exact:
317
  <ul>
318
  <li>CUDA Kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
319
  <li>Some rest memory usage from buffers, intermediate results and some memory that can’t be used due to fragmentation</li>
320
  </ul>
321
  We’ll neglect these last two contributors as they are typically small and constant factors.
322
+ </p></div>
323
  </div>
324
 
325
  <p>These items are stored as tensors which come in different <em>shapes</em> and <em>precisions</em>. The <em>shapes</em> are determined by hyper-parameters such as batch size, sequence length, model hidden dimensions, attention heads, vocabulary size, and potential model sharding as we’ll see later. <em>Precision</em> refers to formats like FP32, BF16, or FP8, which respectively require 4, 2, or 1 byte to store each single value in the tensor.</p>
 
389
 
390
  <div class="note-box">
391
  <p class="note-box-title">📝 Note</p>
392
+ <div class="note-box-content">
393
+ <p>Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.</p>
394
+ </div>
395
  </div>
396
 
397
  <div class="note-box">
398
  <p class="note-box-title">📝 Note</p>
399
+ <div class="note-box-content">
400
+ <p>The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.</p>
401
+ </div>
402
  </div>
403
 
404
  <p>Interestingly, mixed precision itself doesn’t save overall memory as it just distributes the memory differently across the three components, and in fact adds another 4 bytes over full precision training if we accumulate gradients in FP32. It’s still advantageous as computing the forward/backward passes in half precision allows us to (1) use optimized lower precision operations on the GPU which are faster and (2) reduces the activation memory requirements during the forward pass which is a large part of the memory usage as we saw on the graph above and below.</p>
 
499
 
500
  <div class="note-box">
501
  <p class="note-box-title">📝 Note</p>
502
+ <div class="note-box-content">
503
+ <p>
504
  When you’re measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPS (Floating point operations per second) and compare it to theoretical maximum FLOPS of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPS for a training step gives a value called “hardware FLOPS” which is the real number of operations performed on the accelerator. Dividing this number by the duration of the training step and the maximum accelerator FLOPS yields the <strong><em>Hardware FLOPS Utilization (HFU).</em></strong>
505
+ </p><p>
 
506
  However, what really matters at the end of the day is the start-to-end time needed to train a model on a given dataset. So when comparing various GPU/TPU/accelerator together, if one of these accelerator provide for instance enough memory to skip recomputation and thus perform less operation per second (lower HFU) but for a faster training, it should be rewarded not punished. Thus, an alternative is to compute what is called <strong><em>Model FLOPS Utilization (MFU)</em></strong> which, in contrast to HFU, only takes into account the required operations for the forward+backward passes through the model, and do not include recomputation in the measured FLOPs. This value is thus more specific to the model than the training implementation.
507
+ </p>
508
+ </div>
509
  </div>
510
 
511
 
 
679
 
680
  <div class="note-box">
681
  <p class="note-box-title">📝 Note</p>
682
+ <div class="note-box-content">
683
+ <p>When performing communication operations, tensors must be contiguous in memory to avoid redundant memory copies. To perform this optimally, we often pre-allocate continuous buffers of the size of activations or model parameters specifically for communication. While this speed up communication, it also contributes in part to the peak memory usage during training.</p>
684
+ </div>
685
  </div>
686
 
687
  <p>Now let's have a look what that means for the global batch size.</p>
 
722
 
723
  <div class="note-box">
724
  <p class="note-box-title">📝 Note</p>
725
+ <div class="note-box-content">
726
+ <p>Bear in mind that at the 512+ GPUs scale, depending on the network used, the communication operations will start to be bound by <em>ring latency</em> (time required for a signal to propagate once around the ring) which means we can no longer fully overlap the DP communications. This will decrease our compute efficiency and hit our throughput. In this case we should start exploring other dimensions to parallelize on.</p>
727
+ </div>
728
  </div>
729
 
730
  <p>While data parallelism nicely overlaps the all-reduce gradient synchronization with backward computation to save time, this benefit starts to break down at large scales. Why? Because as we add more and more GPUs (hundreds or thousands), the overhead of coordinating between them grows significantly and the network requirements are becoming too large for the benefits. As a result, our setup will become less and less efficient which each additional GPU we add to the system.</p>
 
844
 
845
  <div class="note-box">
846
  <p class="note-box-title">📝 Note</p>
847
+ <div class="note-box-content">
848
+ <p>Unfortunately these techniques are not straightforward to implement and require sophisticated use of hooks/bucketing. In practice we can just use ZeRO-3/FSDP implementation where the FSDPUnit is the entire model, more details about this later.</p>
849
+ </div>
850
  </div>
851
 
852
  <p>In ZeRO-1 the optimizer states have been partitioned, which means that each replica only updates <d-math>\frac{1}{N_d}</d-math> of the optimizer states. The keen reader must have noticed that there is no real need to have all gradients on all DP ranks in the first place since only a subset is needed for the optimization step. Meet ZeRO-2!</p>
 
875
 
876
  <div class="note-box">
877
  <p class="note-box-title">📝 Note</p>
878
+ <div class="note-box-content">
879
+ <p>This stage is also called FSDP (Fully Shared Data Parallelism) in PyTorch native implementation. We’ll just refer to ZeRO-3 in this blogpost but you can think of FSDP wherever you see it.</p>
880
+ </div>
881
  </div>
882
 
883
  <p>So how do we do a forward or backward pass in practice if all parts of the model are distributed? Quite simply we gather them on-demand when we need them. In the forward pass this looks as follows:</p>
 
1032
 
1033
  <div class="note-box">
1034
  <p class="note-box-title">📝 Note</p>
1035
+ <div class="note-box-content">
1036
+ <p>One interesting note about layer normalization in tensor parallel training - since each TP rank sees the same activations after the all-gather, the layer norm weights don't actually need an all-reduce to sync their gradients after the backward pass. They naturally stay in sync across ranks. However, for dropout operations, we must make sure to sync the random seed across TP ranks to maintain deterministic behavior.</p>
1037
+ </div>
1038
  </div>
1039
 
1040
  <p>This raises an interesting question - could we extend tensor parallelism to these remaining operations as well? Indeed, it's possible to parallelize layer norm, dropout and other operations too, which we'll explore next.</p>
 
1047
 
1048
  <div class="note-box">
1049
  <p class="note-box-title">📝 Note</p>
1050
+ <div class="note-box-content">
1051
+ <p>The term Sequence Parallelism is a bit overloaded: the Sequence Parallelism in this section is tightly coupled to Tensor Parallelism and applies to dropout and layer norm operation. However, when we will move to longer sequences the attention computation will become a bottleneck, which calls for techniques such as Ring-Attention, which are sometimes also called <em>Sequence Parallelism</em> but we’ll refer to them as <em>Context Parallelism</em> to differentiate the two approaches. So each time you see sequence parallelism, remember that it is used together with tensor parallelism (in contrast to context parallelism, which can be used independently).</p>
1052
+ </div>
1053
  </div>
1054
 
1055
  <p>Sequence parallelism (SP) involves splitting the activations and computations for the parts of the model not handled by tensor parallelism (TP) such as Dropout and LayerNorm, but along the input sequence dimension rather than across hidden dimension. This is needed because these operations require access to the full hidden dimension to compute correctly. For example, LayerNorm needs the full hidden dimension to compute mean and variance:</p>
 
1230
 
1231
  <div class="note-box">
1232
  <p class="note-box-title">📝 Note</p>
1233
+ <div class="note-box-content">
1234
+ <p>Since LayerNorms in the SP region operate on different portions of the sequence, their gradients will differ across TP ranks. To ensure the weights stay synchronized, we need to all-reduce their gradients during the backward pass, similar to how DP ensures weights stay in sync. This is a small communication overhead since LayerNorm has relatively few parameters.</p>
1235
+ </div>
1236
  </div>
1237
 
1238
  <p>However, there are two limits to TP and SP: 1) if we scale the sequence length the activation memory will still blow up in the TP region and 2) if the model is too big to fit with TP=8 then we will see a massive slow-down due to the inter-node connectivity.</p>
 
1274
 
1275
  <div class="note-box">
1276
  <p class="note-box-title">📝 Note</p>
1277
+ <div class="note-box-content">
1278
+ <p>Context Parallelism shares some conceptual similarities with Flash Attention (see later for more details) - both techniques rely on online softmax computation to reduce memory usage. While Flash Attention focuses on optimizing the attention computation itself on a single GPU, Context Parallelism achieves memory reduction by distributing the sequence across multiple GPUs.</p>
1279
+ </div>
1280
  </div>
1281
 
1282
  <h3>Discovering Ring Attention</h3>
 
1637
 
1638
  <div class="note-box">
1639
  <p class="note-box-title">📝 Note</p>
1640
+ <div class="note-box-content">
1641
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</p>
1642
+ </div>
1643
  </div>
1644
 
1645
 
src/style.css CHANGED
@@ -197,6 +197,10 @@ toggle-icon.collapsed {
197
  margin-top: 0;
198
  }
199
 
 
 
 
 
200
  @media (min-width: 1200px) {
201
  d-article {
202
  /* Ensure d-article does not prevent sticky positioning */
@@ -385,12 +389,14 @@ d-contents nav > ul > li > a:hover {
385
  margin: 0;
386
  color: #444444;
387
  font-weight: 600;
 
388
  }
389
 
390
  .note-box-content {
391
  margin-top: 0.5rem;
392
  margin-bottom: 0; /* Ensure no bottom margin */
393
  color: #24292f;
 
394
  }
395
 
396
  /* For dark mode support */
@@ -405,4 +411,8 @@ d-contents nav > ul > li > a:hover {
405
  .note-box-content {
406
  color: #d4d4d4;
407
  }
408
- }
 
 
 
 
 
197
  margin-top: 0;
198
  }
199
 
200
+ d-article {
201
+ font-size: 1.04em;
202
+ }
203
+
204
  @media (min-width: 1200px) {
205
  d-article {
206
  /* Ensure d-article does not prevent sticky positioning */
 
389
  margin: 0;
390
  color: #444444;
391
  font-weight: 600;
392
+ font-size: 12px;
393
  }
394
 
395
  .note-box-content {
396
  margin-top: 0.5rem;
397
  margin-bottom: 0; /* Ensure no bottom margin */
398
  color: #24292f;
399
+ font-size: 12px;
400
  }
401
 
402
  /* For dark mode support */
 
411
  .note-box-content {
412
  color: #d4d4d4;
413
  }
414
+ }
415
+
416
+ d-code {
417
+ font-size: 12px;
418
+ }