xrsrke/add_interactive_fp8_loss_curve

#43
by neuralink HF staff - opened
assets/data/fp8/.DS_Store ADDED
Binary file (6.15 kB). View file
 
assets/data/fp8/fp8_training_loss_curves.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/data/fp8/fp8_training_loss_curves.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -2275,6 +2275,8 @@
2275
 
2276
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2277
 
 
 
2278
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2279
 
2280
  <p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
 
2275
 
2276
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2277
 
2278
+ <iframe class="l-body-outset" id="plotFP8Loss" src="assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
2279
+
2280
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2281
 
2282
  <p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
src/index.html CHANGED
@@ -2275,6 +2275,8 @@
2275
 
2276
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2277
 
 
 
2278
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2279
 
2280
  <p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
 
2275
 
2276
  <p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
2277
 
2278
+ <iframe class="l-body-outset" id="plotFP8Loss" src="/assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
2279
+
2280
  <p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
2281
 
2282
  <p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>