nouamanetazi HF staff commited on
Commit
6ab569e
·
verified ·
1 Parent(s): 2cb31db
assets/images/ep_moe.png ADDED

Git LFS Details

  • SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
dist/assets/images/ep_moe.png ADDED

Git LFS Details

  • SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
dist/bibliography.bib CHANGED
@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @misc{fedus2022switchtransformersscalingtrillion,
515
+ title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
516
+ author={William Fedus and Barret Zoph and Noam Shazeer},
517
+ year={2022},
518
+ eprint={2101.03961},
519
+ archivePrefix={arXiv},
520
+ primaryClass={cs.LG},
521
+ url={https://arxiv.org/abs/2101.03961},
522
  }
dist/index.html CHANGED
@@ -1088,7 +1088,7 @@
1088
  <tbody>
1089
  <tr>
1090
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1091
- <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
1092
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1093
  </tr>
1094
  </tbody>
@@ -1447,19 +1447,28 @@
1447
  <h2>Expert parallelism</h2>
1448
  <p>One more <s>thing</s> parallelism.</p>
1449
 
1450
- <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
 
 
 
 
 
 
 
 
 
 
1451
 
1452
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1453
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1454
-
1455
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1456
 
1457
- <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1458
 
1459
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1460
  <ul>
1461
- <li>Data Parallelism – along the batch dimension including ZeRO</li>
1462
- <li>Tensor Parallelism - along the hidden-state dimension</li>
1463
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1464
  <li>Pipeline Parallelism - along the model layers</li>
1465
  <li>Expert Parallelism - along the model experts</li>
 
1088
  <tbody>
1089
  <tr>
1090
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1091
+ <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
1092
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1093
  </tr>
1094
  </tbody>
 
1447
  <h2>Expert parallelism</h2>
1448
  <p>One more <s>thing</s> parallelism.</p>
1449
 
1450
+ <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1451
+
1452
+ <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
1453
+
1454
+ <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
1455
+ <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
1456
+
1457
+
1458
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
1459
+
1460
+ <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
1461
 
1462
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1463
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1464
+ <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
 
1465
 
1466
+ <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
1467
 
1468
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1469
  <ul>
1470
+ <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1471
+ <li>Tensor Parallelism - along the hidden dimension</li>
1472
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1473
  <li>Pipeline Parallelism - along the model layers</li>
1474
  <li>Expert Parallelism - along the model experts</li>
src/bibliography.bib CHANGED
@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @misc{fedus2022switchtransformersscalingtrillion,
515
+ title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
516
+ author={William Fedus and Barret Zoph and Noam Shazeer},
517
+ year={2022},
518
+ eprint={2101.03961},
519
+ archivePrefix={arXiv},
520
+ primaryClass={cs.LG},
521
+ url={https://arxiv.org/abs/2101.03961},
522
  }
src/index.html CHANGED
@@ -1088,7 +1088,7 @@
1088
  <tbody>
1089
  <tr>
1090
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1091
- <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
1092
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1093
  </tr>
1094
  </tbody>
@@ -1447,19 +1447,28 @@
1447
  <h2>Expert parallelism</h2>
1448
  <p>One more <s>thing</s> parallelism.</p>
1449
 
1450
- <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
 
 
 
 
 
 
 
 
 
 
1451
 
1452
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1453
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1454
-
1455
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1456
 
1457
- <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1458
 
1459
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1460
  <ul>
1461
- <li>Data Parallelism – along the batch dimension including ZeRO</li>
1462
- <li>Tensor Parallelism - along the hidden-state dimension</li>
1463
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1464
  <li>Pipeline Parallelism - along the model layers</li>
1465
  <li>Expert Parallelism - along the model experts</li>
 
1088
  <tbody>
1089
  <tr>
1090
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1091
+ <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
1092
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1093
  </tr>
1094
  </tbody>
 
1447
  <h2>Expert parallelism</h2>
1448
  <p>One more <s>thing</s> parallelism.</p>
1449
 
1450
+ <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1451
+
1452
+ <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
1453
+
1454
+ <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
1455
+ <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
1456
+
1457
+
1458
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
1459
+
1460
+ <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
1461
 
1462
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1463
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1464
+ <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
 
1465
 
1466
+ <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
1467
 
1468
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1469
  <ul>
1470
+ <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1471
+ <li>Tensor Parallelism - along the hidden dimension</li>
1472
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1473
  <li>Pipeline Parallelism - along the model layers</li>
1474
  <li>Expert Parallelism - along the model experts</li>