Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

119

nouamanetazi HF Staff commited on Feb 17

Commit

6ab569e

verified ·

1 Parent(s): 2cb31db

ep final (#25)

Browse files

- . (1d7bb53a91a01d70706123b76a0ec9385d45a7d9)
- ep (0a5bbcd086b4185e55f8c94ee883171bdf7f2e67)

Files changed (6) hide show

assets/images/ep_moe.png +3 -0
dist/assets/images/ep_moe.png +3 -0
dist/bibliography.bib +9 -0
dist/index.html +16 -7
src/bibliography.bib +9 -0
src/index.html +16 -7

assets/images/ep_moe.png ADDED Viewed

Git LFS Details

SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
Pointer size: 131 Bytes
Size of remote file: 104 kB

dist/assets/images/ep_moe.png ADDED Viewed

Git LFS Details

SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
Pointer size: 131 Bytes
Size of remote file: 104 kB

dist/bibliography.bib CHANGED Viewed

@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
 }

       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
+}
+@misc{fedus2022switchtransformersscalingtrillion,
+      title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
+      author={William Fedus and Barret Zoph and Noam Shazeer},
+      year={2022},
+      eprint={2101.03961},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2101.03961},
 }

dist/index.html CHANGED Viewed

@@ -1088,7 +1088,7 @@
             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
-                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
@@ -1447,19 +1447,28 @@
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
-        <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
-        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
-        <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
         <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
         <ul>
-            <li>Data Parallelism – along the batch dimension including ZeRO</li>
-            <li>Tensor Parallelism - along the hidden-state dimension</li>
             <li>Sequence and Context Parallelism - along the sequence dimension</li>
             <li>Pipeline Parallelism - along the model layers</li>
             <li>Expert Parallelism - along the model experts</li>

             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
+                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
+        <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
+        <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
+        <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
+        <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
+        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
+        <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
+        <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
+        <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
         <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
         <ul>
+            <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
+            <li>Tensor Parallelism - along the hidden dimension</li>
             <li>Sequence and Context Parallelism - along the sequence dimension</li>
             <li>Pipeline Parallelism - along the model layers</li>
             <li>Expert Parallelism - along the model experts</li>

src/bibliography.bib CHANGED Viewed

@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
 }

       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
+}
+@misc{fedus2022switchtransformersscalingtrillion,
+      title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
+      author={William Fedus and Barret Zoph and Noam Shazeer},
+      year={2022},
+      eprint={2101.03961},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2101.03961},
 }

src/index.html CHANGED Viewed

@@ -1088,7 +1088,7 @@
             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
-                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
@@ -1447,19 +1447,28 @@
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
-        <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
-        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
-        <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
         <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
         <ul>
-            <li>Data Parallelism – along the batch dimension including ZeRO</li>
-            <li>Tensor Parallelism - along the hidden-state dimension</li>
             <li>Sequence and Context Parallelism - along the sequence dimension</li>
             <li>Pipeline Parallelism - along the model layers</li>
             <li>Expert Parallelism - along the model experts</li>

             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
+                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
+        <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
+        <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
+        <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
+        <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
+        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
+        <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
+        <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
+        <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
         <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
         <ul>
+            <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
+            <li>Tensor Parallelism - along the hidden dimension</li>
             <li>Sequence and Context Parallelism - along the sequence dimension</li>
             <li>Pipeline Parallelism - along the model layers</li>
             <li>Expert Parallelism - along the model experts</li>