Spaces:
Running
Running
app4
#66
by
nouamanetazi
HF staff
- opened
- dist/index.html +19 -7
- src/index.html +19 -7
dist/index.html
CHANGED
@@ -2728,18 +2728,23 @@
|
|
2728 |
|
2729 |
<h3>Training Frameworks</h3>
|
2730 |
<div>
|
2731 |
-
<a href="https://github.com/
|
2732 |
-
<p>
|
2733 |
</div>
|
2734 |
-
|
2735 |
<div>
|
2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2737 |
-
<p>NVIDIA's framework for training large language models
|
2738 |
</div>
|
2739 |
|
2740 |
<div>
|
2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2742 |
-
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism
|
|
|
|
|
|
|
|
|
|
|
2743 |
</div>
|
2744 |
|
2745 |
<div>
|
@@ -2932,7 +2937,7 @@
|
|
2932 |
|
2933 |
<div>
|
2934 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
2935 |
-
<p>Some of Horace He's blogposts
|
2936 |
</div>
|
2937 |
|
2938 |
<div>
|
@@ -3546,12 +3551,19 @@
|
|
3546 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
3547 |
</ul>
|
3548 |
|
3549 |
-
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time
|
3550 |
|
3551 |
<d-math block>
|
3552 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
3553 |
</d-math>
|
3554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3555 |
<p>The computation time for backward pass is:</p>
|
3556 |
|
3557 |
<d-math block>
|
|
|
2728 |
|
2729 |
<h3>Training Frameworks</h3>
|
2730 |
<div>
|
2731 |
+
<a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
|
2732 |
+
<p>Our framework for training large language models featuring various parallelism strategies</p>
|
2733 |
</div>
|
2734 |
+
|
2735 |
<div>
|
2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2737 |
+
<p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
|
2738 |
</div>
|
2739 |
|
2740 |
<div>
|
2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2742 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
|
2743 |
+
</div>
|
2744 |
+
|
2745 |
+
<div>
|
2746 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
2747 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
2748 |
</div>
|
2749 |
|
2750 |
<div>
|
|
|
2937 |
|
2938 |
<div>
|
2939 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
2940 |
+
<p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
|
2941 |
</div>
|
2942 |
|
2943 |
<div>
|
|
|
3551 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
3552 |
</ul>
|
3553 |
|
3554 |
+
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
|
3555 |
|
3556 |
<d-math block>
|
3557 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
3558 |
</d-math>
|
3559 |
|
3560 |
+
<div class="note-box">
|
3561 |
+
<p class="note-box-title">π Note</p>
|
3562 |
+
<div class="note-box-content">
|
3563 |
+
<p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
|
3564 |
+
</div>
|
3565 |
+
</div>
|
3566 |
+
|
3567 |
<p>The computation time for backward pass is:</p>
|
3568 |
|
3569 |
<d-math block>
|
src/index.html
CHANGED
@@ -2728,18 +2728,23 @@
|
|
2728 |
|
2729 |
<h3>Training Frameworks</h3>
|
2730 |
<div>
|
2731 |
-
<a href="https://github.com/
|
2732 |
-
<p>
|
2733 |
</div>
|
2734 |
-
|
2735 |
<div>
|
2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2737 |
-
<p>NVIDIA's framework for training large language models
|
2738 |
</div>
|
2739 |
|
2740 |
<div>
|
2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2742 |
-
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism
|
|
|
|
|
|
|
|
|
|
|
2743 |
</div>
|
2744 |
|
2745 |
<div>
|
@@ -2932,7 +2937,7 @@
|
|
2932 |
|
2933 |
<div>
|
2934 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
2935 |
-
<p>Some of Horace He's blogposts
|
2936 |
</div>
|
2937 |
|
2938 |
<div>
|
@@ -3546,12 +3551,19 @@
|
|
3546 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
3547 |
</ul>
|
3548 |
|
3549 |
-
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time
|
3550 |
|
3551 |
<d-math block>
|
3552 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
3553 |
</d-math>
|
3554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3555 |
<p>The computation time for backward pass is:</p>
|
3556 |
|
3557 |
<d-math block>
|
|
|
2728 |
|
2729 |
<h3>Training Frameworks</h3>
|
2730 |
<div>
|
2731 |
+
<a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
|
2732 |
+
<p>Our framework for training large language models featuring various parallelism strategies</p>
|
2733 |
</div>
|
2734 |
+
|
2735 |
<div>
|
2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2737 |
+
<p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
|
2738 |
</div>
|
2739 |
|
2740 |
<div>
|
2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2742 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
|
2743 |
+
</div>
|
2744 |
+
|
2745 |
+
<div>
|
2746 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
2747 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
2748 |
</div>
|
2749 |
|
2750 |
<div>
|
|
|
2937 |
|
2938 |
<div>
|
2939 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
2940 |
+
<p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
|
2941 |
</div>
|
2942 |
|
2943 |
<div>
|
|
|
3551 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
3552 |
</ul>
|
3553 |
|
3554 |
+
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
|
3555 |
|
3556 |
<d-math block>
|
3557 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
3558 |
</d-math>
|
3559 |
|
3560 |
+
<div class="note-box">
|
3561 |
+
<p class="note-box-title">π Note</p>
|
3562 |
+
<div class="note-box-content">
|
3563 |
+
<p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
|
3564 |
+
</div>
|
3565 |
+
</div>
|
3566 |
+
|
3567 |
<p>The computation time for backward pass is:</p>
|
3568 |
|
3569 |
<d-math block>
|