Spaces:
Running
Running
add references
Browse files- dist/index.html +181 -7
- src/index.html +181 -7
dist/index.html
CHANGED
@@ -2313,18 +2313,192 @@
|
|
2313 |
<h2>References</h2>
|
2314 |
|
2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
2316 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2317 |
<h3>Training Frameworks</h3>
|
2318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2319 |
<h3>Debugging</h3>
|
2320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2321 |
<h3>Distribution Techniques</h3>
|
2322 |
-
|
2323 |
-
<
|
2324 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2325 |
<h3>Hardware</h3>
|
2326 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2327 |
<h3>Others</h3>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2328 |
|
2329 |
<h2>Appendix</h2>
|
2330 |
|
|
|
2313 |
<h2>References</h2>
|
2314 |
|
2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
2316 |
+
|
2317 |
+
<div>
|
2318 |
+
<a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
|
2319 |
+
<p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
|
2320 |
+
</div>
|
2321 |
+
|
2322 |
+
<div>
|
2323 |
+
<a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
|
2324 |
+
<p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
|
2325 |
+
</div>
|
2326 |
+
|
2327 |
+
<div>
|
2328 |
+
<a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
|
2329 |
+
<p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
|
2330 |
+
</div>
|
2331 |
+
|
2332 |
+
<div>
|
2333 |
+
<a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
|
2334 |
+
<p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
|
2335 |
+
</div>
|
2336 |
+
|
2337 |
+
<div>
|
2338 |
+
<a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
|
2339 |
+
<p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
|
2340 |
+
</div>
|
2341 |
+
|
2342 |
+
|
2343 |
<h3>Training Frameworks</h3>
|
2344 |
+
|
2345 |
+
<div>
|
2346 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
2347 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
2348 |
+
</div>
|
2349 |
+
|
2350 |
+
<div>
|
2351 |
+
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2352 |
+
<p>NVIDIA's framework for training large language models with model and data parallelism.</p>
|
2353 |
+
</div>
|
2354 |
+
|
2355 |
+
<div>
|
2356 |
+
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2357 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
|
2358 |
+
</div>
|
2359 |
+
|
2360 |
+
<div>
|
2361 |
+
<a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
|
2362 |
+
<p>Integrated large-scale model training system with various optimization techniques.</p>
|
2363 |
+
</div>
|
2364 |
+
|
2365 |
+
<div>
|
2366 |
+
<a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
|
2367 |
+
<p>A PyTorch native library for large model training.</p>
|
2368 |
+
</div>
|
2369 |
+
|
2370 |
+
<div>
|
2371 |
+
<a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
|
2372 |
+
<p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
|
2373 |
+
</div>
|
2374 |
+
|
2375 |
+
<div>
|
2376 |
+
<a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
|
2377 |
+
<p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
|
2378 |
+
</div>
|
2379 |
+
|
2380 |
+
<div>
|
2381 |
+
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
2382 |
+
<p>Training language models across compute clusters with DiLoCo.</p>
|
2383 |
+
</div>
|
2384 |
+
|
2385 |
<h3>Debugging</h3>
|
2386 |
+
|
2387 |
+
<div>
|
2388 |
+
<a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
|
2389 |
+
<p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
|
2390 |
+
</div>
|
2391 |
+
|
2392 |
+
<div>
|
2393 |
+
<a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
|
2394 |
+
<p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
|
2395 |
+
</div>
|
2396 |
+
|
2397 |
+
<div>
|
2398 |
+
<a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
|
2399 |
+
<p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
|
2400 |
+
</div>
|
2401 |
+
|
2402 |
<h3>Distribution Techniques</h3>
|
2403 |
+
|
2404 |
+
<div>
|
2405 |
+
<a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
|
2406 |
+
<p>Comprehensive explanation of data parallel training in deep learning.</p>
|
2407 |
+
</div>
|
2408 |
+
|
2409 |
+
<div>
|
2410 |
+
<a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
|
2411 |
+
<p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
|
2412 |
+
</div>
|
2413 |
+
|
2414 |
+
<div>
|
2415 |
+
<a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
|
2416 |
+
<p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
|
2417 |
+
</div>
|
2418 |
+
|
2419 |
+
<div>
|
2420 |
+
<a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
|
2421 |
+
<p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
|
2422 |
+
</div>
|
2423 |
+
|
2424 |
+
<div>
|
2425 |
+
<a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
|
2426 |
+
<p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
|
2427 |
+
</div>
|
2428 |
+
|
2429 |
+
<div>
|
2430 |
+
<a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
|
2431 |
+
<p>Includes broad discussions around PP schedules.</p>
|
2432 |
+
</div>
|
2433 |
+
|
2434 |
+
<div>
|
2435 |
+
<a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
|
2436 |
+
<p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
|
2437 |
+
</div>
|
2438 |
+
|
2439 |
+
<div>
|
2440 |
+
<a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
|
2441 |
+
<p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
|
2442 |
+
</div>
|
2443 |
+
|
2444 |
+
<div>
|
2445 |
+
<a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
|
2446 |
+
<p>Tutorial explaining the concepts and implementation of ring attention.</p>
|
2447 |
+
</div>
|
2448 |
+
|
2449 |
+
<div>
|
2450 |
+
<a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
|
2451 |
+
<p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
|
2452 |
+
</div>
|
2453 |
+
|
2454 |
+
<div>
|
2455 |
+
<a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
|
2456 |
+
<p>Introduces mixed precision training techniques for deep learning models.</p>
|
2457 |
+
</div>
|
2458 |
+
|
2459 |
<h3>Hardware</h3>
|
2460 |
+
|
2461 |
+
<div>
|
2462 |
+
<a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
|
2463 |
+
<p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
|
2464 |
+
</div>
|
2465 |
+
|
2466 |
+
<div>
|
2467 |
+
<a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
|
2468 |
+
<p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
|
2469 |
+
</div>
|
2470 |
+
|
2471 |
+
<div>
|
2472 |
+
<a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
|
2473 |
+
<p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
|
2474 |
+
</div>
|
2475 |
+
|
2476 |
<h3>Others</h3>
|
2477 |
+
|
2478 |
+
<div>
|
2479 |
+
<a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
|
2480 |
+
<p>Comprehensive handbook covering various aspects of training LLMs.</p>
|
2481 |
+
</div>
|
2482 |
+
|
2483 |
+
<div>
|
2484 |
+
<a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
|
2485 |
+
<p>Detailed documentation of the BLOOM model training process and challenges.</p>
|
2486 |
+
</div>
|
2487 |
+
|
2488 |
+
<div>
|
2489 |
+
<a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
|
2490 |
+
<p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
|
2491 |
+
</div>
|
2492 |
+
|
2493 |
+
<div>
|
2494 |
+
<a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
|
2495 |
+
<p>Investigation into the relationship between model size and training overhead.</p>
|
2496 |
+
</div>
|
2497 |
+
|
2498 |
+
<div>
|
2499 |
+
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
2500 |
+
<p>Investigation into long context training in terms of data and training cost.</p>
|
2501 |
+
</div>
|
2502 |
|
2503 |
<h2>Appendix</h2>
|
2504 |
|
src/index.html
CHANGED
@@ -2313,18 +2313,192 @@
|
|
2313 |
<h2>References</h2>
|
2314 |
|
2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
2316 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2317 |
<h3>Training Frameworks</h3>
|
2318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2319 |
<h3>Debugging</h3>
|
2320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2321 |
<h3>Distribution Techniques</h3>
|
2322 |
-
|
2323 |
-
<
|
2324 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2325 |
<h3>Hardware</h3>
|
2326 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2327 |
<h3>Others</h3>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2328 |
|
2329 |
<h2>Appendix</h2>
|
2330 |
|
|
|
2313 |
<h2>References</h2>
|
2314 |
|
2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
2316 |
+
|
2317 |
+
<div>
|
2318 |
+
<a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
|
2319 |
+
<p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
|
2320 |
+
</div>
|
2321 |
+
|
2322 |
+
<div>
|
2323 |
+
<a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
|
2324 |
+
<p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
|
2325 |
+
</div>
|
2326 |
+
|
2327 |
+
<div>
|
2328 |
+
<a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
|
2329 |
+
<p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
|
2330 |
+
</div>
|
2331 |
+
|
2332 |
+
<div>
|
2333 |
+
<a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
|
2334 |
+
<p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
|
2335 |
+
</div>
|
2336 |
+
|
2337 |
+
<div>
|
2338 |
+
<a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
|
2339 |
+
<p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
|
2340 |
+
</div>
|
2341 |
+
|
2342 |
+
|
2343 |
<h3>Training Frameworks</h3>
|
2344 |
+
|
2345 |
+
<div>
|
2346 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
2347 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
2348 |
+
</div>
|
2349 |
+
|
2350 |
+
<div>
|
2351 |
+
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
2352 |
+
<p>NVIDIA's framework for training large language models with model and data parallelism.</p>
|
2353 |
+
</div>
|
2354 |
+
|
2355 |
+
<div>
|
2356 |
+
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
2357 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
|
2358 |
+
</div>
|
2359 |
+
|
2360 |
+
<div>
|
2361 |
+
<a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
|
2362 |
+
<p>Integrated large-scale model training system with various optimization techniques.</p>
|
2363 |
+
</div>
|
2364 |
+
|
2365 |
+
<div>
|
2366 |
+
<a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
|
2367 |
+
<p>A PyTorch native library for large model training.</p>
|
2368 |
+
</div>
|
2369 |
+
|
2370 |
+
<div>
|
2371 |
+
<a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
|
2372 |
+
<p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
|
2373 |
+
</div>
|
2374 |
+
|
2375 |
+
<div>
|
2376 |
+
<a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
|
2377 |
+
<p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
|
2378 |
+
</div>
|
2379 |
+
|
2380 |
+
<div>
|
2381 |
+
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
2382 |
+
<p>Training language models across compute clusters with DiLoCo.</p>
|
2383 |
+
</div>
|
2384 |
+
|
2385 |
<h3>Debugging</h3>
|
2386 |
+
|
2387 |
+
<div>
|
2388 |
+
<a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
|
2389 |
+
<p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
|
2390 |
+
</div>
|
2391 |
+
|
2392 |
+
<div>
|
2393 |
+
<a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
|
2394 |
+
<p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
|
2395 |
+
</div>
|
2396 |
+
|
2397 |
+
<div>
|
2398 |
+
<a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
|
2399 |
+
<p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
|
2400 |
+
</div>
|
2401 |
+
|
2402 |
<h3>Distribution Techniques</h3>
|
2403 |
+
|
2404 |
+
<div>
|
2405 |
+
<a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
|
2406 |
+
<p>Comprehensive explanation of data parallel training in deep learning.</p>
|
2407 |
+
</div>
|
2408 |
+
|
2409 |
+
<div>
|
2410 |
+
<a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
|
2411 |
+
<p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
|
2412 |
+
</div>
|
2413 |
+
|
2414 |
+
<div>
|
2415 |
+
<a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
|
2416 |
+
<p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
|
2417 |
+
</div>
|
2418 |
+
|
2419 |
+
<div>
|
2420 |
+
<a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
|
2421 |
+
<p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
|
2422 |
+
</div>
|
2423 |
+
|
2424 |
+
<div>
|
2425 |
+
<a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
|
2426 |
+
<p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
|
2427 |
+
</div>
|
2428 |
+
|
2429 |
+
<div>
|
2430 |
+
<a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
|
2431 |
+
<p>Includes broad discussions around PP schedules.</p>
|
2432 |
+
</div>
|
2433 |
+
|
2434 |
+
<div>
|
2435 |
+
<a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
|
2436 |
+
<p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
|
2437 |
+
</div>
|
2438 |
+
|
2439 |
+
<div>
|
2440 |
+
<a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
|
2441 |
+
<p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
|
2442 |
+
</div>
|
2443 |
+
|
2444 |
+
<div>
|
2445 |
+
<a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
|
2446 |
+
<p>Tutorial explaining the concepts and implementation of ring attention.</p>
|
2447 |
+
</div>
|
2448 |
+
|
2449 |
+
<div>
|
2450 |
+
<a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
|
2451 |
+
<p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
|
2452 |
+
</div>
|
2453 |
+
|
2454 |
+
<div>
|
2455 |
+
<a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
|
2456 |
+
<p>Introduces mixed precision training techniques for deep learning models.</p>
|
2457 |
+
</div>
|
2458 |
+
|
2459 |
<h3>Hardware</h3>
|
2460 |
+
|
2461 |
+
<div>
|
2462 |
+
<a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
|
2463 |
+
<p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
|
2464 |
+
</div>
|
2465 |
+
|
2466 |
+
<div>
|
2467 |
+
<a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
|
2468 |
+
<p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
|
2469 |
+
</div>
|
2470 |
+
|
2471 |
+
<div>
|
2472 |
+
<a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
|
2473 |
+
<p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
|
2474 |
+
</div>
|
2475 |
+
|
2476 |
<h3>Others</h3>
|
2477 |
+
|
2478 |
+
<div>
|
2479 |
+
<a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
|
2480 |
+
<p>Comprehensive handbook covering various aspects of training LLMs.</p>
|
2481 |
+
</div>
|
2482 |
+
|
2483 |
+
<div>
|
2484 |
+
<a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
|
2485 |
+
<p>Detailed documentation of the BLOOM model training process and challenges.</p>
|
2486 |
+
</div>
|
2487 |
+
|
2488 |
+
<div>
|
2489 |
+
<a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
|
2490 |
+
<p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
|
2491 |
+
</div>
|
2492 |
+
|
2493 |
+
<div>
|
2494 |
+
<a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
|
2495 |
+
<p>Investigation into the relationship between model size and training overhead.</p>
|
2496 |
+
</div>
|
2497 |
+
|
2498 |
+
<div>
|
2499 |
+
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
2500 |
+
<p>Investigation into long context training in terms of data and training cost.</p>
|
2501 |
+
</div>
|
2502 |
|
2503 |
<h2>Appendix</h2>
|
2504 |
|