Spaces:
Running
Running
Commit
Β·
1b676ed
1
Parent(s):
f3c3c8f
assets/images/activation_recomputation.png
ADDED
![]() |
Git LFS Details
|
assets/images/gradaccumulation_diag.png
ADDED
![]() |
Git LFS Details
|
src/index.html
CHANGED
@@ -429,7 +429,7 @@
|
|
429 |
|
430 |
<p>The general idea behind <strong><em>activation recomputation</em></strong> β also called <em>gradient checkpointing</em> or <em>rematerialization</em> β is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
|
431 |
|
432 |
-
<p><img alt="image.png" src="/assets/images/
|
433 |
|
434 |
<p>There are several strategies to select key activations to store:</p>
|
435 |
|
@@ -489,7 +489,7 @@
|
|
489 |
|
490 |
<p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
|
491 |
|
492 |
-
<p><img alt="image.png" src="/assets/images/
|
493 |
|
494 |
<aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>
|
495 |
|
|
|
429 |
|
430 |
<p>The general idea behind <strong><em>activation recomputation</em></strong> β also called <em>gradient checkpointing</em> or <em>rematerialization</em> β is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. FF, LayerNorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
|
431 |
|
432 |
+
<p><img alt="image.png" src="/assets/images/activation_recomputation.png" /></p>
|
433 |
|
434 |
<p>There are several strategies to select key activations to store:</p>
|
435 |
|
|
|
489 |
|
490 |
<p>Gradient accumulation allows us to effectively increase our batch size up to infinity (and beyond!) while the memory footprint stays constant. Gradient accumulation is also compatible with activation recomputation for further memory reduction. One drawback however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step thereby increasing the compute overhead and slowing down training. No free lunch! </p>
|
491 |
|
492 |
+
<p><img alt="image.png" src="/assets/images/gradaccumulation_diag.png" /></p>
|
493 |
|
494 |
<aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients which persist throughout a training step. Whereas without gradient accumulation, in the backward gradients are computed while freeing the activations memory, which means a lower peak memory.</aside>
|
495 |
|