Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

nouamanetazi HF staff commited on 7 days ago

Commit

1b676ed

1 Parent(s): f3c3c8f

.

Browse files

Files changed (3) hide show

assets/images/activation_recomputation.png +3 -0
assets/images/gradaccumulation_diag.png +3 -0
src/index.html +2 -2

assets/images/activation_recomputation.png ADDED Viewed

Git LFS Details

SHA256: 322496303f8133466e128f152e8cb2248bc2a0d5665a57b7894d80048612e64f
Pointer size: 130 Bytes
Size of remote file: 74.5 kB

assets/images/gradaccumulation_diag.png ADDED Viewed

Git LFS Details

SHA256: 0a7acb4c1e4832272beb247588f2a154a703d5b6f468b5e0b7dcffbcda41bbdc
Pointer size: 131 Bytes
Size of remote file: 116 kB

src/index.html CHANGED Viewed

@@ -429,7 +429,7 @@
         <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>There are several strategies to select key activations to store:</p>
@@ -489,7 +489,7 @@
         <p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>

         <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
+        <p><img alt="image.png" src="/assets/images/activation_recomputation.png" /></p>
         <p>There are several strategies to select key activations to store:</p>
         <p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
+        <p><img alt="image.png" src="/assets/images/gradaccumulation_diag.png" /></p>
         <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>