Commit
Β·
68cc8e2
1
Parent(s):
e87aa99
some few more changes
Browse files- dist/index.html +17 -3
- src/index.html +17 -3
dist/index.html
CHANGED
|
@@ -1484,7 +1484,14 @@
|
|
| 1484 |
</script> -->
|
| 1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
| 1486 |
|
| 1487 |
-
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1488 |
|
| 1489 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
| 1490 |
|
|
@@ -3557,7 +3564,7 @@
|
|
| 3557 |
|
| 3558 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
| 3559 |
|
| 3560 |
-
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \
|
| 3561 |
|
| 3562 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
| 3563 |
<ul>
|
|
@@ -3586,6 +3593,13 @@
|
|
| 3586 |
</li>
|
| 3587 |
|
| 3588 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3589 |
</ul>
|
| 3590 |
|
| 3591 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
@@ -3654,7 +3668,7 @@
|
|
| 3654 |
|
| 3655 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
| 3656 |
<d-math block>
|
| 3657 |
-
t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
| 3658 |
</d-math>
|
| 3659 |
|
| 3660 |
<p>For effective overlap between computation and communication, we need:</p>
|
|
|
|
| 1484 |
</script> -->
|
| 1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
| 1486 |
|
| 1487 |
+
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
|
| 1488 |
+
|
| 1489 |
+
<div class="note-box">
|
| 1490 |
+
<p class="note-box-title">π Note</p>
|
| 1491 |
+
<div class="note-box-content">
|
| 1492 |
+
<p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
|
| 1493 |
+
</div>
|
| 1494 |
+
</div>
|
| 1495 |
|
| 1496 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
| 1497 |
|
|
|
|
| 3564 |
|
| 3565 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
| 3566 |
|
| 3567 |
+
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
|
| 3568 |
|
| 3569 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
| 3570 |
<ul>
|
|
|
|
| 3593 |
</li>
|
| 3594 |
|
| 3595 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
| 3596 |
+
|
| 3597 |
+
<div class="note-box">
|
| 3598 |
+
<p class="note-box-title">π Note</p>
|
| 3599 |
+
<div class="note-box-content">
|
| 3600 |
+
<p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
|
| 3601 |
+
</div>
|
| 3602 |
+
</div>
|
| 3603 |
</ul>
|
| 3604 |
|
| 3605 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
|
|
| 3668 |
|
| 3669 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
| 3670 |
<d-math block>
|
| 3671 |
+
t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
| 3672 |
</d-math>
|
| 3673 |
|
| 3674 |
<p>For effective overlap between computation and communication, we need:</p>
|
src/index.html
CHANGED
|
@@ -1484,7 +1484,14 @@
|
|
| 1484 |
</script> -->
|
| 1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
| 1486 |
|
| 1487 |
-
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1488 |
|
| 1489 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
| 1490 |
|
|
@@ -3557,7 +3564,7 @@
|
|
| 3557 |
|
| 3558 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
| 3559 |
|
| 3560 |
-
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \
|
| 3561 |
|
| 3562 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
| 3563 |
<ul>
|
|
@@ -3586,6 +3593,13 @@
|
|
| 3586 |
</li>
|
| 3587 |
|
| 3588 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3589 |
</ul>
|
| 3590 |
|
| 3591 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
@@ -3654,7 +3668,7 @@
|
|
| 3654 |
|
| 3655 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
| 3656 |
<d-math block>
|
| 3657 |
-
t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
| 3658 |
</d-math>
|
| 3659 |
|
| 3660 |
<p>For effective overlap between computation and communication, we need:</p>
|
|
|
|
| 1484 |
</script> -->
|
| 1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
| 1486 |
|
| 1487 |
+
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
|
| 1488 |
+
|
| 1489 |
+
<div class="note-box">
|
| 1490 |
+
<p class="note-box-title">π Note</p>
|
| 1491 |
+
<div class="note-box-content">
|
| 1492 |
+
<p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
|
| 1493 |
+
</div>
|
| 1494 |
+
</div>
|
| 1495 |
|
| 1496 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
| 1497 |
|
|
|
|
| 3564 |
|
| 3565 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
| 3566 |
|
| 3567 |
+
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
|
| 3568 |
|
| 3569 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
| 3570 |
<ul>
|
|
|
|
| 3593 |
</li>
|
| 3594 |
|
| 3595 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
| 3596 |
+
|
| 3597 |
+
<div class="note-box">
|
| 3598 |
+
<p class="note-box-title">π Note</p>
|
| 3599 |
+
<div class="note-box-content">
|
| 3600 |
+
<p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
|
| 3601 |
+
</div>
|
| 3602 |
+
</div>
|
| 3603 |
</ul>
|
| 3604 |
|
| 3605 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
|
|
| 3668 |
|
| 3669 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
| 3670 |
<d-math block>
|
| 3671 |
+
t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
| 3672 |
</d-math>
|
| 3673 |
|
| 3674 |
<p>For effective overlap between computation and communication, we need:</p>
|