lvwerra HF staff commited on
Commit
787ae8e
·
verified ·
1 Parent(s): 6e2c10a
dist/assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB

Git LFS Details

  • SHA256: 088cd848100ab26abbffdcc7c0e8f18a83facd0a8637c460e3ac88d483b04b46
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB
dist/bibliography.bib CHANGED
@@ -361,7 +361,7 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
361
  }
362
  @misc{deepseekai2024deepseekv3technicalreport,
363
  title={DeepSeek-V3 Technical Report},
364
- author={DeepSeek-AI and Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J. L. Cai and Jian Liang and Jianzhong Guo and Jiaqi Ni and Jiashi Li and Jiawei Wang and Jin Chen and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and Junxiao Song and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Lei Xu and Leyi Xia and Liang Zhao and Litong Wang and Liyue Zhang and Meng Li and Miaojun Wang and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Mingming Li and Ning Tian and Panpan Huang and Peiyi Wang and Peng Zhang and Qiancheng Wang and Qihao Zhu and Qinyu Chen and Qiushi Du and R. J. Chen and R. L. Jin and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and Runxin Xu and Ruoyu Zhang and Ruyi Chen and S. S. Li and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shaoqing Wu and Shengfeng Ye and Shengfeng Ye and Shirong Ma and Shiyu Wang and Shuang Zhou and Shuiping Yu and Shunfeng Zhou and Shuting Pan and T. Wang and Tao Yun and Tian Pei and Tianyu Sun and W. L. Xiao and Wangding Zeng and Wanjia Zhao and Wei An and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and X. Q. Li and Xiangyue Jin and Xianzu Wang and Xiao Bi and Xiaodong Liu and Xiaohan Wang and Xiaojin Shen and Xiaokang Chen and Xiaokang Zhang and Xiaosha Chen and Xiaotao Nie and Xiaowen Sun and Xiaoxiang Wang and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xingkai Yu and Xinnan Song and Xinxia Shan and Xinyi Zhou and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and Y. K. Li and Y. Q. Wang and Y. X. Wei and Y. X. Zhu and Yang Zhang and Yanhong Xu and Yanhong Xu and Yanping Huang and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Li and Yaohui Wang and Yi Yu and Yi Zheng and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Ying Tang and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yu Wu and Yuan Ou and Yuchen Zhu and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yukun Zha and Yunfan Xiong and Yunxian Ma and Yuting Yan and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Z. F. Wu and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhen Huang and Zhen Zhang and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhibin Gou and Zhicheng Ma and Zhigang Yan and Zhihong Shao and Zhipeng Xu and Zhiyu Wu and Zhongyu Zhang and Zhuoshu Li and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Ziyi Gao and Zizheng Pan},
365
  year={2024},
366
  eprint={2412.19437},
367
  archivePrefix={arXiv},
 
361
  }
362
  @misc{deepseekai2024deepseekv3technicalreport,
363
  title={DeepSeek-V3 Technical Report},
364
+ author={DeepSeek-AI and others},
365
  year={2024},
366
  eprint={2412.19437},
367
  archivePrefix={arXiv},
dist/fragments/benchmarks_interactive.html CHANGED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -861,7 +861,7 @@
861
 
862
  <p>In ZeRO-1, the optimizer states are partitioned into <d-math>N_d</d-math> equal parts where <d-math>N_d</d-math> is the DP degree. This means that each model replica distributed on each DP rank only keeps track of <d-math>\frac{1}{N_d}</d-math> of the optimizer states. During the optimization step only <d-math>\frac{1}{N_d}</d-math> of the float32 weights are updated.</p>
863
 
864
- <p>However during the forward pass, each replica@ need all the parameters, we thus need to add an additional <strong><em>all-gather</em></strong> (the second type of collective communication primitive we encounter!) after the optimizer step so that each model replica has the full set of updated weights.</p>
865
 
866
  <p>This explains the memory formula of <d-math>2\Psi + 2\Psi + \frac{k\Psi}{N_d}</d-math> that we saw on the above graph! Here’s a summary of the sequence of operations for a single training step</p>
867
 
@@ -1000,7 +1000,7 @@
1000
 
1001
  <p>In practice a small example of the operation looks like this:</p>
1002
 
1003
- <p><img class="l-body" width="500px" alt="TP diagram" src="/assets/images/tp_diagram.svg" /></p>
1004
 
1005
  <p>Let’s see how we can parallelise this operation! In tensor parallelism, tensors will be split into N shards along a particular dimension and distributed across N GPUs. Matrices can be split either on the column part or row part leading to row and column parallelism. One thing we’ll see in the following is that choosing row or column sharding will require different communications primitives.</p>
1006
 
@@ -1377,7 +1377,7 @@
1377
 
1378
  <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
1379
 
1380
- <p>We need a better way to distribute the input sequences. This can be achieved by assigning the tokens not purely sequential to the GPUs but by mixing the ordering a bit such that we have a good mix of early and late tokens on each GPU. This approach is called Zig-Zag attention<d-cite bibtex-key="attention brandon2023fasterring"></d-cite> and in this new arrangement, the attention mask will show an even distribution of computation but if you count the number of colored squares, you’ll see that the computation is now balanced across all GPUs.</p>
1381
 
1382
  <p><img alt="cp_zigzagmask.svg" src="/assets/images/cp_zigzagmask.svg" /></p>
1383
 
@@ -2862,8 +2862,9 @@
2862
  </div>
2863
 
2864
  <div>
2865
- <a href="https://main-horse.github.io/posts/visualizing-6d/"><strong>@main_horse blog</strong></a>
2866
- <p>Visualizing 6D Mesh Parallelism</p>
 
2867
  </div>
2868
 
2869
  <h3>Hardware</h3>
 
861
 
862
  <p>In ZeRO-1, the optimizer states are partitioned into <d-math>N_d</d-math> equal parts where <d-math>N_d</d-math> is the DP degree. This means that each model replica distributed on each DP rank only keeps track of <d-math>\frac{1}{N_d}</d-math> of the optimizer states. During the optimization step only <d-math>\frac{1}{N_d}</d-math> of the float32 weights are updated.</p>
863
 
864
+ <p>However during the forward pass, each replica need all the parameters, we thus need to add an additional <strong><em>all-gather</em></strong> (the second type of collective communication primitive we encounter!) after the optimizer step so that each model replica has the full set of updated weights.</p>
865
 
866
  <p>This explains the memory formula of <d-math>2\Psi + 2\Psi + \frac{k\Psi}{N_d}</d-math> that we saw on the above graph! Here’s a summary of the sequence of operations for a single training step</p>
867
 
 
1000
 
1001
  <p>In practice a small example of the operation looks like this:</p>
1002
 
1003
+ <p style="text-align: center"><img width="300px" alt="TP diagram" src="/assets/images/tp_diagram.svg" /></p>
1004
 
1005
  <p>Let’s see how we can parallelise this operation! In tensor parallelism, tensors will be split into N shards along a particular dimension and distributed across N GPUs. Matrices can be split either on the column part or row part leading to row and column parallelism. One thing we’ll see in the following is that choosing row or column sharding will require different communications primitives.</p>
1006
 
 
1377
 
1378
  <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
1379
 
1380
+ <p>We need a better way to distribute the input sequences. This can be achieved by assigning the tokens not purely sequential to the GPUs but by mixing the ordering a bit such that we have a good mix of early and late tokens on each GPU. This approach is called Zig-Zag attention<d-cite bibtex-key="brandon2023fasterring"></d-cite> and in this new arrangement, the attention mask will show an even distribution of computation but if you count the number of colored squares, you’ll see that the computation is now balanced across all GPUs.</p>
1381
 
1382
  <p><img alt="cp_zigzagmask.svg" src="/assets/images/cp_zigzagmask.svg" /></p>
1383
 
 
2862
  </div>
2863
 
2864
  <div>
2865
+ <a href="https://main-horse.github.io/posts/visualizing-6d/"><strong>Visualizing 6D Mesh Parallelism
2866
+ </strong></a>
2867
+ <p>Explains the collective communication involved in a 6D parallel mesh.</p>
2868
  </div>
2869
 
2870
  <h3>Hardware</h3>
src/bibliography.bib CHANGED
@@ -361,7 +361,7 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
361
  }
362
  @misc{deepseekai2024deepseekv3technicalreport,
363
  title={DeepSeek-V3 Technical Report},
364
- author={DeepSeek-AI and Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J. L. Cai and Jian Liang and Jianzhong Guo and Jiaqi Ni and Jiashi Li and Jiawei Wang and Jin Chen and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and Junxiao Song and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Lei Xu and Leyi Xia and Liang Zhao and Litong Wang and Liyue Zhang and Meng Li and Miaojun Wang and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Mingming Li and Ning Tian and Panpan Huang and Peiyi Wang and Peng Zhang and Qiancheng Wang and Qihao Zhu and Qinyu Chen and Qiushi Du and R. J. Chen and R. L. Jin and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and Runxin Xu and Ruoyu Zhang and Ruyi Chen and S. S. Li and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shaoqing Wu and Shengfeng Ye and Shengfeng Ye and Shirong Ma and Shiyu Wang and Shuang Zhou and Shuiping Yu and Shunfeng Zhou and Shuting Pan and T. Wang and Tao Yun and Tian Pei and Tianyu Sun and W. L. Xiao and Wangding Zeng and Wanjia Zhao and Wei An and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and X. Q. Li and Xiangyue Jin and Xianzu Wang and Xiao Bi and Xiaodong Liu and Xiaohan Wang and Xiaojin Shen and Xiaokang Chen and Xiaokang Zhang and Xiaosha Chen and Xiaotao Nie and Xiaowen Sun and Xiaoxiang Wang and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xingkai Yu and Xinnan Song and Xinxia Shan and Xinyi Zhou and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and Y. K. Li and Y. Q. Wang and Y. X. Wei and Y. X. Zhu and Yang Zhang and Yanhong Xu and Yanhong Xu and Yanping Huang and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Li and Yaohui Wang and Yi Yu and Yi Zheng and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Ying Tang and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yu Wu and Yuan Ou and Yuchen Zhu and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yukun Zha and Yunfan Xiong and Yunxian Ma and Yuting Yan and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Z. F. Wu and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhen Huang and Zhen Zhang and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhibin Gou and Zhicheng Ma and Zhigang Yan and Zhihong Shao and Zhipeng Xu and Zhiyu Wu and Zhongyu Zhang and Zhuoshu Li and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Ziyi Gao and Zizheng Pan},
365
  year={2024},
366
  eprint={2412.19437},
367
  archivePrefix={arXiv},
 
361
  }
362
  @misc{deepseekai2024deepseekv3technicalreport,
363
  title={DeepSeek-V3 Technical Report},
364
+ author={DeepSeek-AI and others},
365
  year={2024},
366
  eprint={2412.19437},
367
  archivePrefix={arXiv},
src/fragments/benchmarks_interactive.html CHANGED
The diff for this file is too large to render. See raw diff
 
src/index.html CHANGED
@@ -861,7 +861,7 @@
861
 
862
  <p>In ZeRO-1, the optimizer states are partitioned into <d-math>N_d</d-math> equal parts where <d-math>N_d</d-math> is the DP degree. This means that each model replica distributed on each DP rank only keeps track of <d-math>\frac{1}{N_d}</d-math> of the optimizer states. During the optimization step only <d-math>\frac{1}{N_d}</d-math> of the float32 weights are updated.</p>
863
 
864
- <p>However during the forward pass, each replica@ need all the parameters, we thus need to add an additional <strong><em>all-gather</em></strong> (the second type of collective communication primitive we encounter!) after the optimizer step so that each model replica has the full set of updated weights.</p>
865
 
866
  <p>This explains the memory formula of <d-math>2\Psi + 2\Psi + \frac{k\Psi}{N_d}</d-math> that we saw on the above graph! Here’s a summary of the sequence of operations for a single training step</p>
867
 
@@ -1000,7 +1000,7 @@
1000
 
1001
  <p>In practice a small example of the operation looks like this:</p>
1002
 
1003
- <p><img class="l-body" width="500px" alt="TP diagram" src="/assets/images/tp_diagram.svg" /></p>
1004
 
1005
  <p>Let’s see how we can parallelise this operation! In tensor parallelism, tensors will be split into N shards along a particular dimension and distributed across N GPUs. Matrices can be split either on the column part or row part leading to row and column parallelism. One thing we’ll see in the following is that choosing row or column sharding will require different communications primitives.</p>
1006
 
@@ -1377,7 +1377,7 @@
1377
 
1378
  <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
1379
 
1380
- <p>We need a better way to distribute the input sequences. This can be achieved by assigning the tokens not purely sequential to the GPUs but by mixing the ordering a bit such that we have a good mix of early and late tokens on each GPU. This approach is called Zig-Zag attention<d-cite bibtex-key="attention brandon2023fasterring"></d-cite> and in this new arrangement, the attention mask will show an even distribution of computation but if you count the number of colored squares, you’ll see that the computation is now balanced across all GPUs.</p>
1381
 
1382
  <p><img alt="cp_zigzagmask.svg" src="/assets/images/cp_zigzagmask.svg" /></p>
1383
 
@@ -1874,7 +1874,7 @@
1874
 
1875
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1876
 
1877
- <h2>How to Find the Best Training Configuration</h2>
1878
 
1879
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1880
 
@@ -2862,8 +2862,9 @@
2862
  </div>
2863
 
2864
  <div>
2865
- <a href="https://main-horse.github.io/posts/visualizing-6d/"><strong>@main_horse blog</strong></a>
2866
- <p>Visualizing 6D Mesh Parallelism</p>
 
2867
  </div>
2868
 
2869
  <h3>Hardware</h3>
 
861
 
862
  <p>In ZeRO-1, the optimizer states are partitioned into <d-math>N_d</d-math> equal parts where <d-math>N_d</d-math> is the DP degree. This means that each model replica distributed on each DP rank only keeps track of <d-math>\frac{1}{N_d}</d-math> of the optimizer states. During the optimization step only <d-math>\frac{1}{N_d}</d-math> of the float32 weights are updated.</p>
863
 
864
+ <p>However during the forward pass, each replica need all the parameters, we thus need to add an additional <strong><em>all-gather</em></strong> (the second type of collective communication primitive we encounter!) after the optimizer step so that each model replica has the full set of updated weights.</p>
865
 
866
  <p>This explains the memory formula of <d-math>2\Psi + 2\Psi + \frac{k\Psi}{N_d}</d-math> that we saw on the above graph! Here’s a summary of the sequence of operations for a single training step</p>
867
 
 
1000
 
1001
  <p>In practice a small example of the operation looks like this:</p>
1002
 
1003
+ <p style="text-align: center"><img width="300px" alt="TP diagram" src="/assets/images/tp_diagram.svg" /></p>
1004
 
1005
  <p>Let’s see how we can parallelise this operation! In tensor parallelism, tensors will be split into N shards along a particular dimension and distributed across N GPUs. Matrices can be split either on the column part or row part leading to row and column parallelism. One thing we’ll see in the following is that choosing row or column sharding will require different communications primitives.</p>
1006
 
 
1377
 
1378
  <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
1379
 
1380
+ <p>We need a better way to distribute the input sequences. This can be achieved by assigning the tokens not purely sequential to the GPUs but by mixing the ordering a bit such that we have a good mix of early and late tokens on each GPU. This approach is called Zig-Zag attention<d-cite bibtex-key="brandon2023fasterring"></d-cite> and in this new arrangement, the attention mask will show an even distribution of computation but if you count the number of colored squares, you’ll see that the computation is now balanced across all GPUs.</p>
1381
 
1382
  <p><img alt="cp_zigzagmask.svg" src="/assets/images/cp_zigzagmask.svg" /></p>
1383
 
 
1874
 
1875
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1876
 
1877
+ <h2>Finding the Best Training Configuration</h2>
1878
 
1879
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1880
 
 
2862
  </div>
2863
 
2864
  <div>
2865
+ <a href="https://main-horse.github.io/posts/visualizing-6d/"><strong>Visualizing 6D Mesh Parallelism
2866
+ </strong></a>
2867
+ <p>Explains the collective communication involved in a 6D parallel mesh.</p>
2868
  </div>
2869
 
2870
  <h3>Hardware</h3>