Spaces:
Running
Running
remove a bit of whitespace and center image
Browse files- dist/index.html +11 -11
- src/index.html +11 -11
dist/index.html
CHANGED
@@ -1089,31 +1089,31 @@
|
|
1089 |
|
1090 |
<p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
|
1091 |
|
1092 |
-
<div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr;">
|
1093 |
<div>
|
1094 |
-
<p><strong>Initial LayerNorm (SP Region)</strong></p>
|
1095 |
-
<ul>
|
1096 |
<li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
|
1097 |
<li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
|
1098 |
</ul>
|
1099 |
-
<p><strong>First Transition (SP → TP)</strong></p>
|
1100 |
-
<ul>
|
1101 |
<li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
|
1102 |
<li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
|
1103 |
</ul>
|
1104 |
-
<p><strong>First Linear Layer (TP Region)</strong></p>
|
1105 |
-
<ul>
|
1106 |
<li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
|
1107 |
<li>GeLU is applied independently on each GPU</li>
|
1108 |
<li>Z1* is (b,s,h/2)</li>
|
1109 |
</ul>
|
1110 |
-
<p><strong>Second Linear Layer (TP Region)</strong></p>
|
1111 |
-
<ul>
|
1112 |
<li>B1 is a row-linear layer, so it restores the hidden dimension</li>
|
1113 |
<li>W1 is (b,s,h)</li>
|
1114 |
</ul>
|
1115 |
-
<p><strong>Final Transition (TP → SP)</strong></p>
|
1116 |
-
<ul>
|
1117 |
<li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
|
1118 |
<li>W1* is (b,s/2,h)</li>
|
1119 |
</ul>
|
|
|
1089 |
|
1090 |
<p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
|
1091 |
|
1092 |
+
<div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
|
1093 |
<div>
|
1094 |
+
<p style="margin-bottom: 0;"><strong>Initial LayerNorm (SP Region)</strong></p>
|
1095 |
+
<ul style="margin-top: 0;">
|
1096 |
<li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
|
1097 |
<li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
|
1098 |
</ul>
|
1099 |
+
<p style="margin-bottom: 0;"><strong>First Transition (SP → TP)</strong></p>
|
1100 |
+
<ul style="margin-top: 0;">
|
1101 |
<li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
|
1102 |
<li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
|
1103 |
</ul>
|
1104 |
+
<p style="margin-bottom: 0;"><strong>First Linear Layer (TP Region)</strong></p>
|
1105 |
+
<ul style="margin-top: 0;">
|
1106 |
<li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
|
1107 |
<li>GeLU is applied independently on each GPU</li>
|
1108 |
<li>Z1* is (b,s,h/2)</li>
|
1109 |
</ul>
|
1110 |
+
<p style="margin-bottom: 0;"><strong>Second Linear Layer (TP Region)</strong></p>
|
1111 |
+
<ul style="margin-top: 0;">
|
1112 |
<li>B1 is a row-linear layer, so it restores the hidden dimension</li>
|
1113 |
<li>W1 is (b,s,h)</li>
|
1114 |
</ul>
|
1115 |
+
<p style="margin-bottom: 0;"><strong>Final Transition (TP → SP)</strong></p>
|
1116 |
+
<ul style="margin-top: 0;">
|
1117 |
<li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
|
1118 |
<li>W1* is (b,s/2,h)</li>
|
1119 |
</ul>
|
src/index.html
CHANGED
@@ -1089,31 +1089,31 @@
|
|
1089 |
|
1090 |
<p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
|
1091 |
|
1092 |
-
<div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr;">
|
1093 |
<div>
|
1094 |
-
<p><strong>Initial LayerNorm (SP Region)</strong></p>
|
1095 |
-
<ul>
|
1096 |
<li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
|
1097 |
<li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
|
1098 |
</ul>
|
1099 |
-
<p><strong>First Transition (SP → TP)</strong></p>
|
1100 |
-
<ul>
|
1101 |
<li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
|
1102 |
<li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
|
1103 |
</ul>
|
1104 |
-
<p><strong>First Linear Layer (TP Region)</strong></p>
|
1105 |
-
<ul>
|
1106 |
<li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
|
1107 |
<li>GeLU is applied independently on each GPU</li>
|
1108 |
<li>Z1* is (b,s,h/2)</li>
|
1109 |
</ul>
|
1110 |
-
<p><strong>Second Linear Layer (TP Region)</strong></p>
|
1111 |
-
<ul>
|
1112 |
<li>B1 is a row-linear layer, so it restores the hidden dimension</li>
|
1113 |
<li>W1 is (b,s,h)</li>
|
1114 |
</ul>
|
1115 |
-
<p><strong>Final Transition (TP → SP)</strong></p>
|
1116 |
-
<ul>
|
1117 |
<li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
|
1118 |
<li>W1* is (b,s/2,h)</li>
|
1119 |
</ul>
|
|
|
1089 |
|
1090 |
<p>So what is actually happening here? As a famous LLM would say, let’s take it step-by-step:</p>
|
1091 |
|
1092 |
+
<div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
|
1093 |
<div>
|
1094 |
+
<p style="margin-bottom: 0;"><strong>Initial LayerNorm (SP Region)</strong></p>
|
1095 |
+
<ul style="margin-top: 0;">
|
1096 |
<li>Input tensors X1<em> and X2</em> (b,s/2,h) enter LayerNorm, already split across sequence dimension</li>
|
1097 |
<li>Each GPU computes LayerNorm independently on its sequence chunk and give Y1<em> and Y2</em></li>
|
1098 |
</ul>
|
1099 |
+
<p style="margin-bottom: 0;"><strong>First Transition (SP → TP)</strong></p>
|
1100 |
+
<ul style="margin-top: 0;">
|
1101 |
<li>"g" operation (all-gather) combines Y1<em> and Y2</em> back to full sequence length</li>
|
1102 |
<li> Restores Y (b,s,h) since column linear layer needs full hidden dimension h</li>
|
1103 |
</ul>
|
1104 |
+
<p style="margin-bottom: 0;"><strong>First Linear Layer (TP Region)</strong></p>
|
1105 |
+
<ul style="margin-top: 0;">
|
1106 |
<li>A1 is a column-linear layer, so it splits Y along the hidden dimension</li>
|
1107 |
<li>GeLU is applied independently on each GPU</li>
|
1108 |
<li>Z1* is (b,s,h/2)</li>
|
1109 |
</ul>
|
1110 |
+
<p style="margin-bottom: 0;"><strong>Second Linear Layer (TP Region)</strong></p>
|
1111 |
+
<ul style="margin-top: 0;">
|
1112 |
<li>B1 is a row-linear layer, so it restores the hidden dimension</li>
|
1113 |
<li>W1 is (b,s,h)</li>
|
1114 |
</ul>
|
1115 |
+
<p style="margin-bottom: 0;"><strong>Final Transition (TP → SP)</strong></p>
|
1116 |
+
<ul style="margin-top: 0;">
|
1117 |
<li>"g*" operation (reduce-scatter) which reduces for previous row-linear correctness while scattering along sequence dimension</li>
|
1118 |
<li>W1* is (b,s/2,h)</li>
|
1119 |
</ul>
|