Spaces:
Running
Running
5d conclusion
Browse files- dist/index.html +269 -2
- src/index.html +269 -2
dist/index.html
CHANGED
@@ -1215,17 +1215,284 @@
|
|
1215 |
|
1216 |
<h2>5D parallelism in a nutshell</h2>
|
1217 |
|
1218 |
-
<p
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1219 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1220 |
|
1221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1222 |
|
1223 |
<h2>How to Find the Best Training Configuration</h2>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1224 |
|
1225 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1226 |
|
1227 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1229 |
<h3>How to improve performance with Kernels ?</h3>
|
1230 |
|
1231 |
<h4>Memory Coalescing</h4>
|
|
|
1215 |
|
1216 |
<h2>5D parallelism in a nutshell</h2>
|
1217 |
|
1218 |
+
<p>Let’s start with Pipeline parallelism as ZeRO-3 and Pipeline parallelism have interesting similarities and differences.</p>
|
1219 |
+
|
1220 |
+
<p>Both methods are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
|
1221 |
+
|
1222 |
+
<p>However, there are a few major differences between the two:</p>
|
1223 |
+
|
1224 |
+
<table>
|
1225 |
+
<thead>
|
1226 |
+
<tr>
|
1227 |
+
<th><strong>ZeRO-3</strong></th>
|
1228 |
+
<th><strong>Pipeline parallel</strong></th>
|
1229 |
+
</tr>
|
1230 |
+
</thead>
|
1231 |
+
<tbody>
|
1232 |
+
<tr>
|
1233 |
+
<td>each compute unit only stores a fraction of a layer</td>
|
1234 |
+
<td>each compute unit stores a full layer</td>
|
1235 |
+
</tr>
|
1236 |
+
<tr>
|
1237 |
+
<td>communication is used to transfer weights</td>
|
1238 |
+
<td>communication is used to transfer activations</td>
|
1239 |
+
</tr>
|
1240 |
+
<tr>
|
1241 |
+
<td>model agnostic orchestration</td>
|
1242 |
+
<td>model agnostic orchestration</td>
|
1243 |
+
</tr>
|
1244 |
+
<tr>
|
1245 |
+
<td>Complex implementation to handle model partitioning and communications</td>
|
1246 |
+
<td>Complex implementation to handle efficient PP schedules</td>
|
1247 |
+
</tr>
|
1248 |
+
<tr>
|
1249 |
+
<td>Prefers large <d-math>mbs</d-math> and <d-math>seq\_len</d-math> to hide comms</td>
|
1250 |
+
<td>Prefers large <d-math>\text{grad\_acc}</d-math> to hide bubble</td>
|
1251 |
+
</tr>
|
1252 |
+
</tbody>
|
1253 |
+
</table>
|
1254 |
+
|
1255 |
+
<p>Clearly ZeRO-3 and PP are distinctly different approaches to sharing the model layers and deciding to focus communication either on weights or on activations. While they can be combined, doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If combined, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize the communication overhead.</p>
|
1256 |
+
|
1257 |
+
<p>Note that ZeRO-1 and ZeRO-2 on the other hand are interesting to combine with Pipeline Parallelism as they focus on gradients and optimizer states instead of parameters and are thus complementary. For instance, DeepSeek-v3 used PP with ZeRO-1!</p>
|
1258 |
+
|
1259 |
+
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1260 |
+
|
1261 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1262 |
|
1263 |
|
1264 |
+
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
1265 |
+
|
1266 |
+
|
1267 |
+
<p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP — The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
|
1268 |
+
|
1269 |
+
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1270 |
+
|
1271 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1272 |
+
|
1273 |
+
|
1274 |
+
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
1275 |
+
|
1276 |
+
<p>It's worth noting the scope of impact for these different parallelism strategies:</p>
|
1277 |
+
|
1278 |
+
<ul>
|
1279 |
+
<li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
|
1280 |
+
<li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
|
1281 |
+
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1282 |
+
</ul>
|
1283 |
+
|
1284 |
+
<aside>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</aside>
|
1285 |
+
|
1286 |
+
<table>
|
1287 |
+
<thead>
|
1288 |
+
<tr>
|
1289 |
+
<th><strong>Tensor + Sequence Parallel</strong></th>
|
1290 |
+
<th><strong>Context Parallel</strong></th>
|
1291 |
+
<th><strong>Expert Parallel</strong></th>
|
1292 |
+
</tr>
|
1293 |
+
</thead>
|
1294 |
+
<tbody>
|
1295 |
+
<tr>
|
1296 |
+
<td>shards weights and activations along hidden/seq dim</td>
|
1297 |
+
<td>shards activations along sequence dim</td>
|
1298 |
+
<td>shards specialized expert weights and activations</td>
|
1299 |
+
</tr>
|
1300 |
+
<tr>
|
1301 |
+
<td>communication for matrix multiply operations (column/row linears)</td>
|
1302 |
+
<td>communication for attention key/values</td>
|
1303 |
+
<td>communication for token routing to experts</td>
|
1304 |
+
</tr>
|
1305 |
+
<tr>
|
1306 |
+
<td>model-specific implementation needed</td>
|
1307 |
+
<td>model-agnostic except for attention</td>
|
1308 |
+
<td>model-agnostic except for MoE layers</td>
|
1309 |
+
</tr>
|
1310 |
+
<tr>
|
1311 |
+
<td>Prefers high-bandwidth intra-node communication</td>
|
1312 |
+
<td>Prefers large sequence lengths</td>
|
1313 |
+
<td>Requires MoEs</td>
|
1314 |
+
</tr>
|
1315 |
+
</tbody>
|
1316 |
+
</table>
|
1317 |
+
|
1318 |
+
<p>Which leads us to this beautiful diagram to summarize all what we’ve seen:</p>
|
1319 |
+
|
1320 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1321 |
+
|
1322 |
+
<p>And to have an idea of the memory benefits of each parallelism:</p>
|
1323 |
+
|
1324 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1325 |
|
1326 |
<h2>How to Find the Best Training Configuration</h2>
|
1327 |
+
|
1328 |
+
<p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
|
1329 |
+
|
1330 |
+
<p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
|
1331 |
+
|
1332 |
+
<table>
|
1333 |
+
<thead>
|
1334 |
+
<tr>
|
1335 |
+
<th><strong>Method</strong></th>
|
1336 |
+
<th><strong>Memory savings</strong></th>
|
1337 |
+
<th><strong>Parallel/sharding dimension</strong></th>
|
1338 |
+
<th><strong>Disadvantage</strong></th>
|
1339 |
+
</tr>
|
1340 |
+
</thead>
|
1341 |
+
<tbody>
|
1342 |
+
<tr>
|
1343 |
+
<td>DP</td>
|
1344 |
+
<td>None (replicates everything)</td>
|
1345 |
+
<td>Batch</td>
|
1346 |
+
<td>Limited by max batch size</td>
|
1347 |
+
</tr>
|
1348 |
+
<tr>
|
1349 |
+
<td>ZeRO-1</td>
|
1350 |
+
<td>Optimizer states</td>
|
1351 |
+
<td>Batch</td>
|
1352 |
+
<td>Params communication overhead</td>
|
1353 |
+
</tr>
|
1354 |
+
<tr>
|
1355 |
+
<td>ZeRO-2</td>
|
1356 |
+
<td>Optimizer states and gradients</td>
|
1357 |
+
<td>Batch</td>
|
1358 |
+
<td>Params communication overhead</td>
|
1359 |
+
</tr>
|
1360 |
+
<tr>
|
1361 |
+
<td>ZeRO-3</td>
|
1362 |
+
<td>Optimizer states, gradients, and model parameters</td>
|
1363 |
+
<td>Batch and Model Params</td>
|
1364 |
+
<td>Params communication overhead</td>
|
1365 |
+
</tr>
|
1366 |
+
<tr>
|
1367 |
+
<td>PP</td>
|
1368 |
+
<td>Model</td>
|
1369 |
+
<td>Model layers</td>
|
1370 |
+
<td>Idle bubble and complex schedules</td>
|
1371 |
+
</tr>
|
1372 |
+
<tr>
|
1373 |
+
<td>TP/SP</td>
|
1374 |
+
<td>Model and activations</td>
|
1375 |
+
<td>Hidden dimension / Sequence length</td>
|
1376 |
+
<td>Requires high bandwidth communication</td>
|
1377 |
+
</tr>
|
1378 |
+
<tr>
|
1379 |
+
<td>CP</td>
|
1380 |
+
<td>Activations</td>
|
1381 |
+
<td>Sequence length</td>
|
1382 |
+
<td>Communication overhead in attention</td>
|
1383 |
+
</tr>
|
1384 |
+
<tr>
|
1385 |
+
<td>EP</td>
|
1386 |
+
<td>Experts parameters</td>
|
1387 |
+
<td>Expert dimension</td>
|
1388 |
+
<td>Requires MoE layers, routing overhead</td>
|
1389 |
+
</tr>
|
1390 |
+
</tbody>
|
1391 |
+
</table>
|
1392 |
+
|
1393 |
+
<p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
|
1394 |
+
|
1395 |
+
<h3>Step 1: Fitting a Training Step in Memory</h3>
|
1396 |
+
|
1397 |
+
<p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
|
1398 |
+
|
1399 |
+
<p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
|
1400 |
+
<ul>
|
1401 |
+
<li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
|
1402 |
+
<li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
|
1403 |
+
<ul>
|
1404 |
+
<li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
|
1405 |
+
<li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
|
1406 |
+
<li>Pure Data Parallelism with ZeRO-3</li>
|
1407 |
+
</ul>
|
1408 |
+
<li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
|
1409 |
+
<li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
|
1410 |
+
</ul>
|
1411 |
+
|
1412 |
+
<p>Special considerations:</p>
|
1413 |
+
<ul>
|
1414 |
+
<li>For very long sequences, add Context Parallelism (CP) across nodes</li>
|
1415 |
+
<li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
|
1416 |
+
</ul>
|
1417 |
+
|
1418 |
+
<p>GPU-poor case 😭 - when running out of GPU resources:</p>
|
1419 |
+
<ul>
|
1420 |
+
<li>Enable full activation recomputation to trade compute for memory</li>
|
1421 |
+
<li>Use gradient accumulation to process larger batches with limited memory
|
1422 |
+
</li>
|
1423 |
+
</ul>
|
1424 |
+
|
1425 |
+
<p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
|
1426 |
+
|
1427 |
+
<h3>Step 2: Achieving Target Global Batch Size </h3>
|
1428 |
+
|
1429 |
+
<p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
|
1430 |
+
|
1431 |
+
<p>To increase global batch size:</p>
|
1432 |
+
<ul>
|
1433 |
+
<li>Scale up Data Parallelism or gradient accumulation steps</li>
|
1434 |
+
<li>For long sequences, leverage Context Parallelism</li>
|
1435 |
+
</ul>
|
1436 |
+
|
1437 |
+
<p>To decrease global batch size:</p>
|
1438 |
+
<ul>
|
1439 |
+
<li>Reduce Data Parallelism in favor of other parallelization strategies</li>
|
1440 |
+
<li>For long sequences, reduce Context Parallelism</li>
|
1441 |
+
</ul>
|
1442 |
+
|
1443 |
+
<p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
|
1444 |
+
|
1445 |
+
<h3>Step 3: Optimizing Training Throughput</h3>
|
1446 |
+
|
1447 |
+
<p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
|
1448 |
+
|
1449 |
+
<ul>
|
1450 |
+
<li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
|
1451 |
+
<li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
|
1452 |
+
<li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
|
1453 |
+
<li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
|
1454 |
+
</ul>
|
1455 |
+
|
1456 |
+
<p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
|
1457 |
+
|
1458 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1459 |
+
|
1460 |
+
|
1461 |
+
<p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
|
1462 |
+
|
1463 |
+
<p>Time to turn the lights off and activate CUDA mode! </p>
|
1464 |
|
1465 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1466 |
|
1467 |
+
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
1468 |
+
|
1469 |
+
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|
1470 |
+
|
1471 |
+
<p>This section will dive into much more details of the GPU architecture and in particular in NVIDIA’s GPU architecture but the general ideas, as often, can be reused on similar accelerator units.</p>
|
1472 |
+
|
1473 |
+
<p>We’ll briefly explain how GPU are organized before covering the Flash-Attention revolution, how to efficiently schedule workload on GPU and finally explain how various precisions can be efficiently used on GPU.</p>
|
1474 |
+
|
1475 |
+
|
1476 |
+
|
1477 |
+
<h3>A primer on GPU</h4>
|
1478 |
|
1479 |
+
<p>Generally, GPUs have a very hierarchical organization. In this primer we’ll keep the discussion at the concept levels that are necessary for the rest of our presentation.</p>
|
1480 |
+
|
1481 |
+
<p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
|
1482 |
+
|
1483 |
+
<p></p>
|
1484 |
+
|
1485 |
+
<p></p>
|
1486 |
+
|
1487 |
+
<p></p>
|
1488 |
+
|
1489 |
+
<p></p>
|
1490 |
+
|
1491 |
+
<p></p>
|
1492 |
+
|
1493 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1494 |
+
|
1495 |
+
|
1496 |
<h3>How to improve performance with Kernels ?</h3>
|
1497 |
|
1498 |
<h4>Memory Coalescing</h4>
|
src/index.html
CHANGED
@@ -1215,17 +1215,284 @@
|
|
1215 |
|
1216 |
<h2>5D parallelism in a nutshell</h2>
|
1217 |
|
1218 |
-
<p
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1219 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1220 |
|
1221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1222 |
|
1223 |
<h2>How to Find the Best Training Configuration</h2>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1224 |
|
1225 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1226 |
|
1227 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1229 |
<h3>How to improve performance with Kernels ?</h3>
|
1230 |
|
1231 |
<h4>Memory Coalescing</h4>
|
|
|
1215 |
|
1216 |
<h2>5D parallelism in a nutshell</h2>
|
1217 |
|
1218 |
+
<p>Let’s start with Pipeline parallelism as ZeRO-3 and Pipeline parallelism have interesting similarities and differences.</p>
|
1219 |
+
|
1220 |
+
<p>Both methods are ways to partition the model weights over several GPUs and perform communication/computation along the model depth axis (for example in ZeRO-3, we prefetch the next layer while computing). In the following we say “a layer” to simplify what should be in general called “a set of layer” (as the basis sharding unit of the model). This means in both cases the full layers are computed on device, as opposed to TP, where the layers are sharded for the computation.</p>
|
1221 |
+
|
1222 |
+
<p>However, there are a few major differences between the two:</p>
|
1223 |
+
|
1224 |
+
<table>
|
1225 |
+
<thead>
|
1226 |
+
<tr>
|
1227 |
+
<th><strong>ZeRO-3</strong></th>
|
1228 |
+
<th><strong>Pipeline parallel</strong></th>
|
1229 |
+
</tr>
|
1230 |
+
</thead>
|
1231 |
+
<tbody>
|
1232 |
+
<tr>
|
1233 |
+
<td>each compute unit only stores a fraction of a layer</td>
|
1234 |
+
<td>each compute unit stores a full layer</td>
|
1235 |
+
</tr>
|
1236 |
+
<tr>
|
1237 |
+
<td>communication is used to transfer weights</td>
|
1238 |
+
<td>communication is used to transfer activations</td>
|
1239 |
+
</tr>
|
1240 |
+
<tr>
|
1241 |
+
<td>model agnostic orchestration</td>
|
1242 |
+
<td>model agnostic orchestration</td>
|
1243 |
+
</tr>
|
1244 |
+
<tr>
|
1245 |
+
<td>Complex implementation to handle model partitioning and communications</td>
|
1246 |
+
<td>Complex implementation to handle efficient PP schedules</td>
|
1247 |
+
</tr>
|
1248 |
+
<tr>
|
1249 |
+
<td>Prefers large <d-math>mbs</d-math> and <d-math>seq\_len</d-math> to hide comms</td>
|
1250 |
+
<td>Prefers large <d-math>\text{grad\_acc}</d-math> to hide bubble</td>
|
1251 |
+
</tr>
|
1252 |
+
</tbody>
|
1253 |
+
</table>
|
1254 |
+
|
1255 |
+
<p>Clearly ZeRO-3 and PP are distinctly different approaches to sharing the model layers and deciding to focus communication either on weights or on activations. While they can be combined, doing so requires increasing the global batch size significantly to amortize the communication costs, creating a tradeoff between global batch size, model size, network bandwidth, and training efficiency. If combined, ZeRO-3 should be configured to keep the weights in memory for each micro-batch in PP to minimize the communication overhead.</p>
|
1256 |
+
|
1257 |
+
<p>Note that ZeRO-1 and ZeRO-2 on the other hand are interesting to combine with Pipeline Parallelism as they focus on gradients and optimizer states instead of parameters and are thus complementary. For instance, DeepSeek-v3 used PP with ZeRO-1!</p>
|
1258 |
+
|
1259 |
+
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1260 |
+
|
1261 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1262 |
|
1263 |
|
1264 |
+
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
1265 |
+
|
1266 |
+
|
1267 |
+
<p><strong>Context Parallelism</strong> and <strong>Expert Parallelism</strong> also help us sharding activations, and can be seen as complimentary to TP — The former handles long sequences while the latter enables distributed Mixture of Experts training.</p>
|
1268 |
+
|
1269 |
+
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1270 |
+
|
1271 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1272 |
+
|
1273 |
+
|
1274 |
+
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
1275 |
+
|
1276 |
+
<p>It's worth noting the scope of impact for these different parallelism strategies:</p>
|
1277 |
+
|
1278 |
+
<ul>
|
1279 |
+
<li>Tensor Parallelism (with Sequence Parallelism) affects computation throughout the entire model by sharding both weights and activations.</li>
|
1280 |
+
<li>Context Parallelism primarily impacts attention layers since that's where cross-sequence communication is required, with other layers operating independently on sharded sequences.</li>
|
1281 |
+
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1282 |
+
</ul>
|
1283 |
+
|
1284 |
+
<aside>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.</aside>
|
1285 |
+
|
1286 |
+
<table>
|
1287 |
+
<thead>
|
1288 |
+
<tr>
|
1289 |
+
<th><strong>Tensor + Sequence Parallel</strong></th>
|
1290 |
+
<th><strong>Context Parallel</strong></th>
|
1291 |
+
<th><strong>Expert Parallel</strong></th>
|
1292 |
+
</tr>
|
1293 |
+
</thead>
|
1294 |
+
<tbody>
|
1295 |
+
<tr>
|
1296 |
+
<td>shards weights and activations along hidden/seq dim</td>
|
1297 |
+
<td>shards activations along sequence dim</td>
|
1298 |
+
<td>shards specialized expert weights and activations</td>
|
1299 |
+
</tr>
|
1300 |
+
<tr>
|
1301 |
+
<td>communication for matrix multiply operations (column/row linears)</td>
|
1302 |
+
<td>communication for attention key/values</td>
|
1303 |
+
<td>communication for token routing to experts</td>
|
1304 |
+
</tr>
|
1305 |
+
<tr>
|
1306 |
+
<td>model-specific implementation needed</td>
|
1307 |
+
<td>model-agnostic except for attention</td>
|
1308 |
+
<td>model-agnostic except for MoE layers</td>
|
1309 |
+
</tr>
|
1310 |
+
<tr>
|
1311 |
+
<td>Prefers high-bandwidth intra-node communication</td>
|
1312 |
+
<td>Prefers large sequence lengths</td>
|
1313 |
+
<td>Requires MoEs</td>
|
1314 |
+
</tr>
|
1315 |
+
</tbody>
|
1316 |
+
</table>
|
1317 |
+
|
1318 |
+
<p>Which leads us to this beautiful diagram to summarize all what we’ve seen:</p>
|
1319 |
+
|
1320 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1321 |
+
|
1322 |
+
<p>And to have an idea of the memory benefits of each parallelism:</p>
|
1323 |
+
|
1324 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1325 |
|
1326 |
<h2>How to Find the Best Training Configuration</h2>
|
1327 |
+
|
1328 |
+
<p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models. There remain a general question: which ones should we choose and which ones are best combined? We touched a little bit on this at the end of the last section but in this section we will walk through the decision process step by step.</p>
|
1329 |
+
|
1330 |
+
<p>First let's have a quick look at each parallel strategy and how it helps and at what cost it comes:</p>
|
1331 |
+
|
1332 |
+
<table>
|
1333 |
+
<thead>
|
1334 |
+
<tr>
|
1335 |
+
<th><strong>Method</strong></th>
|
1336 |
+
<th><strong>Memory savings</strong></th>
|
1337 |
+
<th><strong>Parallel/sharding dimension</strong></th>
|
1338 |
+
<th><strong>Disadvantage</strong></th>
|
1339 |
+
</tr>
|
1340 |
+
</thead>
|
1341 |
+
<tbody>
|
1342 |
+
<tr>
|
1343 |
+
<td>DP</td>
|
1344 |
+
<td>None (replicates everything)</td>
|
1345 |
+
<td>Batch</td>
|
1346 |
+
<td>Limited by max batch size</td>
|
1347 |
+
</tr>
|
1348 |
+
<tr>
|
1349 |
+
<td>ZeRO-1</td>
|
1350 |
+
<td>Optimizer states</td>
|
1351 |
+
<td>Batch</td>
|
1352 |
+
<td>Params communication overhead</td>
|
1353 |
+
</tr>
|
1354 |
+
<tr>
|
1355 |
+
<td>ZeRO-2</td>
|
1356 |
+
<td>Optimizer states and gradients</td>
|
1357 |
+
<td>Batch</td>
|
1358 |
+
<td>Params communication overhead</td>
|
1359 |
+
</tr>
|
1360 |
+
<tr>
|
1361 |
+
<td>ZeRO-3</td>
|
1362 |
+
<td>Optimizer states, gradients, and model parameters</td>
|
1363 |
+
<td>Batch and Model Params</td>
|
1364 |
+
<td>Params communication overhead</td>
|
1365 |
+
</tr>
|
1366 |
+
<tr>
|
1367 |
+
<td>PP</td>
|
1368 |
+
<td>Model</td>
|
1369 |
+
<td>Model layers</td>
|
1370 |
+
<td>Idle bubble and complex schedules</td>
|
1371 |
+
</tr>
|
1372 |
+
<tr>
|
1373 |
+
<td>TP/SP</td>
|
1374 |
+
<td>Model and activations</td>
|
1375 |
+
<td>Hidden dimension / Sequence length</td>
|
1376 |
+
<td>Requires high bandwidth communication</td>
|
1377 |
+
</tr>
|
1378 |
+
<tr>
|
1379 |
+
<td>CP</td>
|
1380 |
+
<td>Activations</td>
|
1381 |
+
<td>Sequence length</td>
|
1382 |
+
<td>Communication overhead in attention</td>
|
1383 |
+
</tr>
|
1384 |
+
<tr>
|
1385 |
+
<td>EP</td>
|
1386 |
+
<td>Experts parameters</td>
|
1387 |
+
<td>Expert dimension</td>
|
1388 |
+
<td>Requires MoE layers, routing overhead</td>
|
1389 |
+
</tr>
|
1390 |
+
</tbody>
|
1391 |
+
</table>
|
1392 |
+
|
1393 |
+
<p>Clearly, there is no free lunch for any of those methods but we can actually come up with a few rules that help finding a good starting point. To find the definitive optimal setup you'll have to run a few experiments in any case.</p>
|
1394 |
+
|
1395 |
+
<h3>Step 1: Fitting a Training Step in Memory</h3>
|
1396 |
+
|
1397 |
+
<p>First, we need to figure out how we can fit a single model instance on GPUs. There are two general cases.</p>
|
1398 |
+
|
1399 |
+
<p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
|
1400 |
+
<ul>
|
1401 |
+
<li>For models under 10B parameters, you can use either Tensor Parallelism or Data Parallelism with ZeRO-3 and Full Recompute across 8 GPUs</li>
|
1402 |
+
<li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
|
1403 |
+
<ul>
|
1404 |
+
<li>Tensor Parallelism (TP=8) combined with Pipeline Parallelism</li>
|
1405 |
+
<li>Tensor Parallelism (TP=8) with Data Parallelism (ZeRO-3)</li>
|
1406 |
+
<li>Pure Data Parallelism with ZeRO-3</li>
|
1407 |
+
</ul>
|
1408 |
+
<li>At 512+ GPU scale, pure Data Parallelism becomes inefficient - better to combine DP with either Tensor or Pipeline Parallelism</li>
|
1409 |
+
<li>At 1024+ GPU scale, the recommended setup is TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
|
1410 |
+
</ul>
|
1411 |
+
|
1412 |
+
<p>Special considerations:</p>
|
1413 |
+
<ul>
|
1414 |
+
<li>For very long sequences, add Context Parallelism (CP) across nodes</li>
|
1415 |
+
<li>For Mixture of Experts architectures, use Expert Parallelism (EP) across nodes</li>
|
1416 |
+
</ul>
|
1417 |
+
|
1418 |
+
<p>GPU-poor case 😭 - when running out of GPU resources:</p>
|
1419 |
+
<ul>
|
1420 |
+
<li>Enable full activation recomputation to trade compute for memory</li>
|
1421 |
+
<li>Use gradient accumulation to process larger batches with limited memory
|
1422 |
+
</li>
|
1423 |
+
</ul>
|
1424 |
+
|
1425 |
+
<p>Now that we have a single model instance training, we need to make sure we have the right batch size.</p>
|
1426 |
+
|
1427 |
+
<h3>Step 2: Achieving Target Global Batch Size </h3>
|
1428 |
+
|
1429 |
+
<p>Depending on how we setup in step one in terms of micro batch size and DP, our current batch size might be too small or big. </p>
|
1430 |
+
|
1431 |
+
<p>To increase global batch size:</p>
|
1432 |
+
<ul>
|
1433 |
+
<li>Scale up Data Parallelism or gradient accumulation steps</li>
|
1434 |
+
<li>For long sequences, leverage Context Parallelism</li>
|
1435 |
+
</ul>
|
1436 |
+
|
1437 |
+
<p>To decrease global batch size:</p>
|
1438 |
+
<ul>
|
1439 |
+
<li>Reduce Data Parallelism in favor of other parallelization strategies</li>
|
1440 |
+
<li>For long sequences, reduce Context Parallelism</li>
|
1441 |
+
</ul>
|
1442 |
+
|
1443 |
+
<p>Ok, now we have the model running in the configuration we want, but is it the fastest way? Let's optimize throughput next.</p>
|
1444 |
+
|
1445 |
+
<h3>Step 3: Optimizing Training Throughput</h3>
|
1446 |
+
|
1447 |
+
<p>So we want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren't bottlenecks we can try the following:</p>
|
1448 |
+
|
1449 |
+
<ul>
|
1450 |
+
<li>Scale up Tensor Parallelism within node to reduce other parallelism requirements</li>
|
1451 |
+
<li>Increase Data Parallelism with ZeRO-3 while maintaining target batch size</li>
|
1452 |
+
<li>When Data Parallelism communication becomes a bottleneck, transition to Pipeline Parallelism</li>
|
1453 |
+
<li>Try scaling up different parallelisms, and fitting max micro batch size (mbs) to find optimal balance between max GBS, model size, compute, and communication.</li>
|
1454 |
+
</ul>
|
1455 |
+
|
1456 |
+
<p>We can roughly summarize the journey to the best configuration in the following diagram:</p>
|
1457 |
+
|
1458 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1459 |
+
|
1460 |
+
|
1461 |
+
<p>This concludes our very deep dive into the distribution methods of 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. </p>
|
1462 |
+
|
1463 |
+
<p>Time to turn the lights off and activate CUDA mode! </p>
|
1464 |
|
1465 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1466 |
|
1467 |
+
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
1468 |
+
|
1469 |
+
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|
1470 |
+
|
1471 |
+
<p>This section will dive into much more details of the GPU architecture and in particular in NVIDIA’s GPU architecture but the general ideas, as often, can be reused on similar accelerator units.</p>
|
1472 |
+
|
1473 |
+
<p>We’ll briefly explain how GPU are organized before covering the Flash-Attention revolution, how to efficiently schedule workload on GPU and finally explain how various precisions can be efficiently used on GPU.</p>
|
1474 |
+
|
1475 |
+
|
1476 |
+
|
1477 |
+
<h3>A primer on GPU</h4>
|
1478 |
|
1479 |
+
<p>Generally, GPUs have a very hierarchical organization. In this primer we’ll keep the discussion at the concept levels that are necessary for the rest of our presentation.</p>
|
1480 |
+
|
1481 |
+
<p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
|
1482 |
+
|
1483 |
+
<p></p>
|
1484 |
+
|
1485 |
+
<p></p>
|
1486 |
+
|
1487 |
+
<p></p>
|
1488 |
+
|
1489 |
+
<p></p>
|
1490 |
+
|
1491 |
+
<p></p>
|
1492 |
+
|
1493 |
+
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
1494 |
+
|
1495 |
+
|
1496 |
<h3>How to improve performance with Kernels ?</h3>
|
1497 |
|
1498 |
<h4>Memory Coalescing</h4>
|