lvwerra HF staff commited on
Commit
5f11eab
·
1 Parent(s): e906818

remove a bit of whitespace and center image

Browse files
Files changed (2) hide show
  1. dist/index.html +11 -11
  2. src/index.html +11 -11
dist/index.html CHANGED
@@ -1089,31 +1089,31 @@
1089
 
1090
  <p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
1091
 
1092
- <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr;">
1093
  <div>
1094
- <p><strong>Initial LayerNorm (SP Region)</strong></p>
1095
- <ul>
1096
  <li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
1097
  <li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
1098
  </ul>
1099
- <p><strong>First Transition (SP → TP)</strong></p>
1100
- <ul>
1101
  <li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
1102
  <li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
1103
  </ul>
1104
- <p><strong>First Linear Layer (TP Region)</strong></p>
1105
- <ul>
1106
  <li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
1107
  <li>GeLU is applied independently on each GPU</li>
1108
  <li>Z1* is (b,s,h/2)</li>
1109
  </ul>
1110
- <p><strong>Second Linear Layer (TP Region)</strong></p>
1111
- <ul>
1112
  <li>B1 is a row-linear layer, so it restores the hidden dimension</li>
1113
  <li>W1 is (b,s,h)</li>
1114
  </ul>
1115
- <p><strong>Final Transition (TP → SP)</strong></p>
1116
- <ul>
1117
  <li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
1118
  <li>W1* is (b,s/2,h)</li>
1119
  </ul>
 
1089
 
1090
  <p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
1091
 
1092
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1093
  <div>
1094
+ <p style="margin-bottom: 0;"><strong>Initial LayerNorm (SP Region)</strong></p>
1095
+ <ul style="margin-top: 0;">
1096
  <li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
1097
  <li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
1098
  </ul>
1099
+ <p style="margin-bottom: 0;"><strong>First Transition (SP → TP)</strong></p>
1100
+ <ul style="margin-top: 0;">
1101
  <li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
1102
  <li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
1103
  </ul>
1104
+ <p style="margin-bottom: 0;"><strong>First Linear Layer (TP Region)</strong></p>
1105
+ <ul style="margin-top: 0;">
1106
  <li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
1107
  <li>GeLU is applied independently on each GPU</li>
1108
  <li>Z1* is (b,s,h/2)</li>
1109
  </ul>
1110
+ <p style="margin-bottom: 0;"><strong>Second Linear Layer (TP Region)</strong></p>
1111
+ <ul style="margin-top: 0;">
1112
  <li>B1 is a row-linear layer, so it restores the hidden dimension</li>
1113
  <li>W1 is (b,s,h)</li>
1114
  </ul>
1115
+ <p style="margin-bottom: 0;"><strong>Final Transition (TP → SP)</strong></p>
1116
+ <ul style="margin-top: 0;">
1117
  <li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
1118
  <li>W1* is (b,s/2,h)</li>
1119
  </ul>
src/index.html CHANGED
@@ -1089,31 +1089,31 @@
1089
 
1090
  <p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
1091
 
1092
- <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr;">
1093
  <div>
1094
- <p><strong>Initial LayerNorm (SP Region)</strong></p>
1095
- <ul>
1096
  <li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
1097
  <li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
1098
  </ul>
1099
- <p><strong>First Transition (SP → TP)</strong></p>
1100
- <ul>
1101
  <li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
1102
  <li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
1103
  </ul>
1104
- <p><strong>First Linear Layer (TP Region)</strong></p>
1105
- <ul>
1106
  <li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
1107
  <li>GeLU is applied independently on each GPU</li>
1108
  <li>Z1* is (b,s,h/2)</li>
1109
  </ul>
1110
- <p><strong>Second Linear Layer (TP Region)</strong></p>
1111
- <ul>
1112
  <li>B1 is a row-linear layer, so it restores the hidden dimension</li>
1113
  <li>W1 is (b,s,h)</li>
1114
  </ul>
1115
- <p><strong>Final Transition (TP → SP)</strong></p>
1116
- <ul>
1117
  <li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
1118
  <li>W1* is (b,s/2,h)</li>
1119
  </ul>
 
1089
 
1090
  <p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
1091
 
1092
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1093
  <div>
1094
+ <p style="margin-bottom: 0;"><strong>Initial LayerNorm (SP Region)</strong></p>
1095
+ <ul style="margin-top: 0;">
1096
  <li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
1097
  <li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
1098
  </ul>
1099
+ <p style="margin-bottom: 0;"><strong>First Transition (SP → TP)</strong></p>
1100
+ <ul style="margin-top: 0;">
1101
  <li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
1102
  <li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
1103
  </ul>
1104
+ <p style="margin-bottom: 0;"><strong>First Linear Layer (TP Region)</strong></p>
1105
+ <ul style="margin-top: 0;">
1106
  <li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
1107
  <li>GeLU is applied independently on each GPU</li>
1108
  <li>Z1* is (b,s,h/2)</li>
1109
  </ul>
1110
+ <p style="margin-bottom: 0;"><strong>Second Linear Layer (TP Region)</strong></p>
1111
+ <ul style="margin-top: 0;">
1112
  <li>B1 is a row-linear layer, so it restores the hidden dimension</li>
1113
  <li>W1 is (b,s,h)</li>
1114
  </ul>
1115
+ <p style="margin-bottom: 0;"><strong>Final Transition (TP → SP)</strong></p>
1116
+ <ul style="margin-top: 0;">
1117
  <li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
1118
  <li>W1* is (b,s/2,h)</li>
1119
  </ul>