Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

121

nouamanetazi HF Staff commited on Mar 12

Commit

68cc8e2

1 Parent(s): e87aa99

some few more changes

Browse files

Files changed (2) hide show

dist/index.html +17 -3
src/index.html +17 -3

dist/index.html CHANGED Viewed

@@ -1484,7 +1484,14 @@
         </script> -->
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
-        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
         <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
@@ -3557,7 +3564,7 @@
             <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
-            <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \cdot h^2</d-math>), plus master weights in FP32 (<d-math>h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 \cdot h^2</d-math>) per weight matrix.</li>
             <li><strong>Total model parameters:</strong> Each transformer block will store:
                 <ul>
@@ -3586,6 +3593,13 @@
             </li>
             <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
         </ul>
         <h3>A3: Math for Compute/Communication Overlap</h3>
@@ -3654,7 +3668,7 @@
         <p>The computation time for the forward pass of one decoder layer is:</p>
         <d-math block>
-        t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
         </d-math>
         <p>For effective overlap between computation and communication, we need:</p>

         </script> -->
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
+        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
+        <div class="note-box">
+            <p class="note-box-title">📝 Note</p>
+            <div class="note-box-content">
+                <p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
+            </div>
+        </div>
         <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
             <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
+            <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2  h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
             <li><strong>Total model parameters:</strong> Each transformer block will store:
                 <ul>
             </li>
             <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
+            <div class="note-box">
+                <p class="note-box-title">📝 Note</p>
+                <div class="note-box-content">
+                    <p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
+                </div>
+            </div>
         </ul>
         <h3>A3: Math for Compute/Communication Overlap</h3>
         <p>The computation time for the forward pass of one decoder layer is:</p>
         <d-math block>
+        t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
         </d-math>
         <p>For effective overlap between computation and communication, we need:</p>

src/index.html CHANGED Viewed

@@ -1484,7 +1484,14 @@
         </script> -->
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
-        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
         <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
@@ -3557,7 +3564,7 @@
             <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
-            <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \cdot h^2</d-math>), plus master weights in FP32 (<d-math>h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 \cdot h^2</d-math>) per weight matrix.</li>
             <li><strong>Total model parameters:</strong> Each transformer block will store:
                 <ul>
@@ -3586,6 +3593,13 @@
             </li>
             <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
         </ul>
         <h3>A3: Math for Compute/Communication Overlap</h3>
@@ -3654,7 +3668,7 @@
         <p>The computation time for the forward pass of one decoder layer is:</p>
         <d-math block>
-        t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
         </d-math>
         <p>For effective overlap between computation and communication, we need:</p>

         </script> -->
         <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
+        <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
+        <div class="note-box">
+            <p class="note-box-title">📝 Note</p>
+            <div class="note-box-content">
+                <p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
+            </div>
+        </div>
         <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
             <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
+            <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2  h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
             <li><strong>Total model parameters:</strong> Each transformer block will store:
                 <ul>
             </li>
             <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
+            <div class="note-box">
+                <p class="note-box-title">📝 Note</p>
+                <div class="note-box-content">
+                    <p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
+                </div>
+            </div>
         </ul>
         <h3>A3: Math for Compute/Communication Overlap</h3>
         <p>The computation time for the forward pass of one decoder layer is:</p>
         <d-math block>
+        t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
         </d-math>
         <p>For effective overlap between computation and communication, we need:</p>