nouamanetazi HF staff commited on
Commit
0a5bbcd
Β·
1 Parent(s): 1d7bb53
assets/images/ep_moe.png ADDED

Git LFS Details

  • SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
dist/assets/images/ep_moe.png ADDED

Git LFS Details

  • SHA256: 3de739419df02726a0fc56a0099cb36db2fbb00f2924c48b08552d452009effc
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
dist/bibliography.bib CHANGED
@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @misc{fedus2022switchtransformersscalingtrillion,
515
+ title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
516
+ author={William Fedus and Barret Zoph and Noam Shazeer},
517
+ year={2022},
518
+ eprint={2101.03961},
519
+ archivePrefix={arXiv},
520
+ primaryClass={cs.LG},
521
+ url={https://arxiv.org/abs/2101.03961},
522
  }
dist/index.html CHANGED
@@ -1438,19 +1438,26 @@
1438
 
1439
  <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
 
1441
- <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
 
 
 
 
 
 
 
 
1442
 
1443
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1444
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1445
-
1446
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1447
 
1448
- <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1449
 
1450
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1451
  <ul>
1452
- <li>Data Parallelism – along the batch dimension including ZeRO</li>
1453
- <li>Tensor Parallelism - along the hidden-state dimension</li>
1454
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1455
  <li>Pipeline Parallelism - along the model layers</li>
1456
  <li>Expert Parallelism - along the model experts</li>
 
1438
 
1439
  <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
 
1441
+ <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
1442
+
1443
+ <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
1444
+ <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
1445
+
1446
+
1447
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
1448
+
1449
+ <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
1450
 
1451
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1452
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1453
+ <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
 
1454
 
1455
+ <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
1456
 
1457
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1458
  <ul>
1459
+ <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1460
+ <li>Tensor Parallelism - along the hidden dimension</li>
1461
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1462
  <li>Pipeline Parallelism - along the model layers</li>
1463
  <li>Expert Parallelism - along the model experts</li>
src/bibliography.bib CHANGED
@@ -510,4 +510,13 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @misc{fedus2022switchtransformersscalingtrillion,
515
+ title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
516
+ author={William Fedus and Barret Zoph and Noam Shazeer},
517
+ year={2022},
518
+ eprint={2101.03961},
519
+ archivePrefix={arXiv},
520
+ primaryClass={cs.LG},
521
+ url={https://arxiv.org/abs/2101.03961},
522
  }
src/index.html CHANGED
@@ -1438,19 +1438,26 @@
1438
 
1439
  <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
 
1441
- <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
 
 
 
 
 
 
 
 
1442
 
1443
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1444
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1445
-
1446
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1447
 
1448
- <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1449
 
1450
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1451
  <ul>
1452
- <li>Data Parallelism – along the batch dimension including ZeRO</li>
1453
- <li>Tensor Parallelism - along the hidden-state dimension</li>
1454
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1455
  <li>Pipeline Parallelism - along the model layers</li>
1456
  <li>Expert Parallelism - along the model experts</li>
 
1438
 
1439
  <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
 
1441
+ <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context.</p>
1442
+
1443
+ <p><img alt="ep_moe.png" src="/assets/images/ep_moe.png" /></p>
1444
+ <p>MoE layer from the Switch Transformers paper<d-cite bibtex-key="fedus2022switchtransformersscalingtrillion"></d-cite></p>
1445
+
1446
+
1447
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert.</p>
1448
+
1449
+ <p>In practice, EP is typically used in conjunction with another form of parallelism - usually Data Parallelism. This is because EP only affects the MoE layers and doesn't shard the input tokens (unlike Context Parallelism which shards tokens along the sequence length dimension). This means our GPUs would be doing redundant compute for all the non-MoE blocks if we used EP alone. By combining EP with DP, we can efficiently shard both the experts and the input batches across our GPUs, as we can see in the simplified diagram below:</p>
1450
 
1451
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1452
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1453
+ <p>But let's not get ahead of ourselves - we've reserved a specific section to talk about interactions between different parallelism strategies, so look forward to that to better understand the previous diagram.</p>
 
1454
 
1455
+ <p>There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead. While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction.</p>
1456
 
1457
  <p>Congratulation reader, with this brief overview of Expert parallelism you have now seen all 5 parallelism strategies to scale model training: </p>
1458
  <ul>
1459
+ <li>Data Parallelism – along the batch dimension (including ZeRO)</li>
1460
+ <li>Tensor Parallelism - along the hidden dimension</li>
1461
  <li>Sequence and Context Parallelism - along the sequence dimension</li>
1462
  <li>Pipeline Parallelism - along the model layers</li>
1463
  <li>Expert Parallelism - along the model experts</li>