nouamanetazi HF staff commited on
Commit
6e2c10a
Β·
verified Β·
1 Parent(s): b64bcbe
Files changed (2) hide show
  1. dist/index.html +19 -7
  2. src/index.html +19 -7
dist/index.html CHANGED
@@ -2728,18 +2728,23 @@
2728
 
2729
  <h3>Training Frameworks</h3>
2730
  <div>
2731
- <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2732
- <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2733
  </div>
2734
-
2735
  <div>
2736
  <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2737
- <p>NVIDIA's framework for training large language models with model and data parallelism.</p>
2738
  </div>
2739
 
2740
  <div>
2741
  <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2742
- <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
 
 
 
 
 
2743
  </div>
2744
 
2745
  <div>
@@ -2932,7 +2937,7 @@
2932
 
2933
  <div>
2934
  <a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
2935
- <p>Some of Horace He's blogposts</p>
2936
  </div>
2937
 
2938
  <div>
@@ -3546,12 +3551,19 @@
3546
  <li>Gradients = Parameters β‰ˆ <d-math>num\_layers \cdot 16h^2</d-math></li>
3547
  </ul>
3548
 
3549
- <p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time for each bucket is:</p>
3550
 
3551
  <d-math block>
3552
  t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
3553
  </d-math>
3554
 
 
 
 
 
 
 
 
3555
  <p>The computation time for backward pass is:</p>
3556
 
3557
  <d-math block>
 
2728
 
2729
  <h3>Training Frameworks</h3>
2730
  <div>
2731
+ <a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
2732
+ <p>Our framework for training large language models featuring various parallelism strategies</p>
2733
  </div>
2734
+
2735
  <div>
2736
  <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2737
+ <p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
2738
  </div>
2739
 
2740
  <div>
2741
  <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2742
+ <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
2743
+ </div>
2744
+
2745
+ <div>
2746
+ <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2747
+ <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2748
  </div>
2749
 
2750
  <div>
 
2937
 
2938
  <div>
2939
  <a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
2940
+ <p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
2941
  </div>
2942
 
2943
  <div>
 
3551
  <li>Gradients = Parameters β‰ˆ <d-math>num\_layers \cdot 16h^2</d-math></li>
3552
  </ul>
3553
 
3554
+ <p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
3555
 
3556
  <d-math block>
3557
  t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
3558
  </d-math>
3559
 
3560
+ <div class="note-box">
3561
+ <p class="note-box-title">πŸ“ Note</p>
3562
+ <div class="note-box-content">
3563
+ <p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
3564
+ </div>
3565
+ </div>
3566
+
3567
  <p>The computation time for backward pass is:</p>
3568
 
3569
  <d-math block>
src/index.html CHANGED
@@ -2728,18 +2728,23 @@
2728
 
2729
  <h3>Training Frameworks</h3>
2730
  <div>
2731
- <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2732
- <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2733
  </div>
2734
-
2735
  <div>
2736
  <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2737
- <p>NVIDIA's framework for training large language models with model and data parallelism.</p>
2738
  </div>
2739
 
2740
  <div>
2741
  <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2742
- <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
 
 
 
 
 
2743
  </div>
2744
 
2745
  <div>
@@ -2932,7 +2937,7 @@
2932
 
2933
  <div>
2934
  <a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
2935
- <p>Some of Horace He's blogposts</p>
2936
  </div>
2937
 
2938
  <div>
@@ -3546,12 +3551,19 @@
3546
  <li>Gradients = Parameters β‰ˆ <d-math>num\_layers \cdot 16h^2</d-math></li>
3547
  </ul>
3548
 
3549
- <p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time for each bucket is:</p>
3550
 
3551
  <d-math block>
3552
  t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
3553
  </d-math>
3554
 
 
 
 
 
 
 
 
3555
  <p>The computation time for backward pass is:</p>
3556
 
3557
  <d-math block>
 
2728
 
2729
  <h3>Training Frameworks</h3>
2730
  <div>
2731
+ <a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
2732
+ <p>Our framework for training large language models featuring various parallelism strategies</p>
2733
  </div>
2734
+
2735
  <div>
2736
  <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2737
+ <p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
2738
  </div>
2739
 
2740
  <div>
2741
  <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2742
+ <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
2743
+ </div>
2744
+
2745
+ <div>
2746
+ <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2747
+ <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2748
  </div>
2749
 
2750
  <div>
 
2937
 
2938
  <div>
2939
  <a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
2940
+ <p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
2941
  </div>
2942
 
2943
  <div>
 
3551
  <li>Gradients = Parameters β‰ˆ <d-math>num\_layers \cdot 16h^2</d-math></li>
3552
  </ul>
3553
 
3554
+ <p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
3555
 
3556
  <d-math block>
3557
  t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
3558
  </d-math>
3559
 
3560
+ <div class="note-box">
3561
+ <p class="note-box-title">πŸ“ Note</p>
3562
+ <div class="note-box-content">
3563
+ <p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
3564
+ </div>
3565
+ </div>
3566
+
3567
  <p>The computation time for backward pass is:</p>
3568
 
3569
  <d-math block>