nouamanetazi HF staff commited on
Commit
1b676ed
Β·
1 Parent(s): f3c3c8f
assets/images/activation_recomputation.png ADDED

Git LFS Details

  • SHA256: 322496303f8133466e128f152e8cb2248bc2a0d5665a57b7894d80048612e64f
  • Pointer size: 130 Bytes
  • Size of remote file: 74.5 kB
assets/images/gradaccumulation_diag.png ADDED

Git LFS Details

  • SHA256: 0a7acb4c1e4832272beb247588f2a154a703d5b6f468b5e0b7dcffbcda41bbdc
  • Pointer size: 131 Bytes
  • Size of remote file: 116 kB
src/index.html CHANGED
@@ -429,7 +429,7 @@
429
 
430
  <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
431
 
432
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
433
 
434
  <p>There are several strategies to select key activations to store:</p>
435
 
@@ -489,7 +489,7 @@
489
 
490
  <p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
491
 
492
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
493
 
494
  <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>
495
 
 
429
 
430
  <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
431
 
432
+ <p><img alt="image.png" src="/assets/images/activation_recomputation.png" /></p>
433
 
434
  <p>There are several strategies to select key activations to store:</p>
435
 
 
489
 
490
  <p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
491
 
492
+ <p><img alt="image.png" src="/assets/images/gradaccumulation_diag.png" /></p>
493
 
494
  <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>
495