lvwerra HF staff commited on
Commit
d7e734d
·
1 Parent(s): d652b2f

add references

Browse files
Files changed (2) hide show
  1. dist/index.html +181 -7
  2. src/index.html +181 -7
dist/index.html CHANGED
@@ -2313,18 +2313,192 @@
2313
  <h2>References</h2>
2314
 
2315
  <h3>Landmark LLM Scaling Papers</h3>
2316
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2317
  <h3>Training Frameworks</h3>
2318
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2319
  <h3>Debugging</h3>
2320
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2321
  <h3>Distribution Techniques</h3>
2322
-
2323
- <h3>CUDA Kernels</h3>
2324
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2325
  <h3>Hardware</h3>
2326
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2327
  <h3>Others</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2328
 
2329
  <h2>Appendix</h2>
2330
 
 
2313
  <h2>References</h2>
2314
 
2315
  <h3>Landmark LLM Scaling Papers</h3>
2316
+
2317
+ <div>
2318
+ <a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
2319
+ <p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
2320
+ </div>
2321
+
2322
+ <div>
2323
+ <a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
2324
+ <p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
2325
+ </div>
2326
+
2327
+ <div>
2328
+ <a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
2329
+ <p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
2330
+ </div>
2331
+
2332
+ <div>
2333
+ <a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
2334
+ <p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
2335
+ </div>
2336
+
2337
+ <div>
2338
+ <a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
2339
+ <p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
2340
+ </div>
2341
+
2342
+
2343
  <h3>Training Frameworks</h3>
2344
+
2345
+ <div>
2346
+ <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2347
+ <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2348
+ </div>
2349
+
2350
+ <div>
2351
+ <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2352
+ <p>NVIDIA's framework for training large language models with model and data parallelism.</p>
2353
+ </div>
2354
+
2355
+ <div>
2356
+ <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2357
+ <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
2358
+ </div>
2359
+
2360
+ <div>
2361
+ <a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
2362
+ <p>Integrated large-scale model training system with various optimization techniques.</p>
2363
+ </div>
2364
+
2365
+ <div>
2366
+ <a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
2367
+ <p>A PyTorch native library for large model training.</p>
2368
+ </div>
2369
+
2370
+ <div>
2371
+ <a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
2372
+ <p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
2373
+ </div>
2374
+
2375
+ <div>
2376
+ <a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
2377
+ <p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
2378
+ </div>
2379
+
2380
+ <div>
2381
+ <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
2382
+ <p>Training language models across compute clusters with DiLoCo.</p>
2383
+ </div>
2384
+
2385
  <h3>Debugging</h3>
2386
+
2387
+ <div>
2388
+ <a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
2389
+ <p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
2390
+ </div>
2391
+
2392
+ <div>
2393
+ <a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
2394
+ <p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
2395
+ </div>
2396
+
2397
+ <div>
2398
+ <a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
2399
+ <p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
2400
+ </div>
2401
+
2402
  <h3>Distribution Techniques</h3>
2403
+
2404
+ <div>
2405
+ <a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
2406
+ <p>Comprehensive explanation of data parallel training in deep learning.</p>
2407
+ </div>
2408
+
2409
+ <div>
2410
+ <a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
2411
+ <p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
2412
+ </div>
2413
+
2414
+ <div>
2415
+ <a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
2416
+ <p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
2417
+ </div>
2418
+
2419
+ <div>
2420
+ <a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
2421
+ <p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
2422
+ </div>
2423
+
2424
+ <div>
2425
+ <a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
2426
+ <p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
2427
+ </div>
2428
+
2429
+ <div>
2430
+ <a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
2431
+ <p>Includes broad discussions around PP schedules.</p>
2432
+ </div>
2433
+
2434
+ <div>
2435
+ <a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
2436
+ <p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
2437
+ </div>
2438
+
2439
+ <div>
2440
+ <a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
2441
+ <p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
2442
+ </div>
2443
+
2444
+ <div>
2445
+ <a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
2446
+ <p>Tutorial explaining the concepts and implementation of ring attention.</p>
2447
+ </div>
2448
+
2449
+ <div>
2450
+ <a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
2451
+ <p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
2452
+ </div>
2453
+
2454
+ <div>
2455
+ <a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
2456
+ <p>Introduces mixed precision training techniques for deep learning models.</p>
2457
+ </div>
2458
+
2459
  <h3>Hardware</h3>
2460
+
2461
+ <div>
2462
+ <a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
2463
+ <p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
2464
+ </div>
2465
+
2466
+ <div>
2467
+ <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
2468
+ <p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
2469
+ </div>
2470
+
2471
+ <div>
2472
+ <a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
2473
+ <p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
2474
+ </div>
2475
+
2476
  <h3>Others</h3>
2477
+
2478
+ <div>
2479
+ <a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
2480
+ <p>Comprehensive handbook covering various aspects of training LLMs.</p>
2481
+ </div>
2482
+
2483
+ <div>
2484
+ <a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
2485
+ <p>Detailed documentation of the BLOOM model training process and challenges.</p>
2486
+ </div>
2487
+
2488
+ <div>
2489
+ <a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
2490
+ <p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
2491
+ </div>
2492
+
2493
+ <div>
2494
+ <a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
2495
+ <p>Investigation into the relationship between model size and training overhead.</p>
2496
+ </div>
2497
+
2498
+ <div>
2499
+ <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2500
+ <p>Investigation into long context training in terms of data and training cost.</p>
2501
+ </div>
2502
 
2503
  <h2>Appendix</h2>
2504
 
src/index.html CHANGED
@@ -2313,18 +2313,192 @@
2313
  <h2>References</h2>
2314
 
2315
  <h3>Landmark LLM Scaling Papers</h3>
2316
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2317
  <h3>Training Frameworks</h3>
2318
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2319
  <h3>Debugging</h3>
2320
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2321
  <h3>Distribution Techniques</h3>
2322
-
2323
- <h3>CUDA Kernels</h3>
2324
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2325
  <h3>Hardware</h3>
2326
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2327
  <h3>Others</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2328
 
2329
  <h2>Appendix</h2>
2330
 
 
2313
  <h2>References</h2>
2314
 
2315
  <h3>Landmark LLM Scaling Papers</h3>
2316
+
2317
+ <div>
2318
+ <a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
2319
+ <p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
2320
+ </div>
2321
+
2322
+ <div>
2323
+ <a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
2324
+ <p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
2325
+ </div>
2326
+
2327
+ <div>
2328
+ <a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
2329
+ <p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
2330
+ </div>
2331
+
2332
+ <div>
2333
+ <a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
2334
+ <p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
2335
+ </div>
2336
+
2337
+ <div>
2338
+ <a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
2339
+ <p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
2340
+ </div>
2341
+
2342
+
2343
  <h3>Training Frameworks</h3>
2344
+
2345
+ <div>
2346
+ <a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
2347
+ <p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
2348
+ </div>
2349
+
2350
+ <div>
2351
+ <a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
2352
+ <p>NVIDIA's framework for training large language models with model and data parallelism.</p>
2353
+ </div>
2354
+
2355
+ <div>
2356
+ <a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
2357
+ <p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
2358
+ </div>
2359
+
2360
+ <div>
2361
+ <a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
2362
+ <p>Integrated large-scale model training system with various optimization techniques.</p>
2363
+ </div>
2364
+
2365
+ <div>
2366
+ <a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
2367
+ <p>A PyTorch native library for large model training.</p>
2368
+ </div>
2369
+
2370
+ <div>
2371
+ <a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
2372
+ <p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
2373
+ </div>
2374
+
2375
+ <div>
2376
+ <a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
2377
+ <p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
2378
+ </div>
2379
+
2380
+ <div>
2381
+ <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
2382
+ <p>Training language models across compute clusters with DiLoCo.</p>
2383
+ </div>
2384
+
2385
  <h3>Debugging</h3>
2386
+
2387
+ <div>
2388
+ <a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
2389
+ <p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
2390
+ </div>
2391
+
2392
+ <div>
2393
+ <a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
2394
+ <p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
2395
+ </div>
2396
+
2397
+ <div>
2398
+ <a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
2399
+ <p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
2400
+ </div>
2401
+
2402
  <h3>Distribution Techniques</h3>
2403
+
2404
+ <div>
2405
+ <a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
2406
+ <p>Comprehensive explanation of data parallel training in deep learning.</p>
2407
+ </div>
2408
+
2409
+ <div>
2410
+ <a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
2411
+ <p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
2412
+ </div>
2413
+
2414
+ <div>
2415
+ <a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
2416
+ <p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
2417
+ </div>
2418
+
2419
+ <div>
2420
+ <a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
2421
+ <p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
2422
+ </div>
2423
+
2424
+ <div>
2425
+ <a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
2426
+ <p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
2427
+ </div>
2428
+
2429
+ <div>
2430
+ <a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
2431
+ <p>Includes broad discussions around PP schedules.</p>
2432
+ </div>
2433
+
2434
+ <div>
2435
+ <a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
2436
+ <p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
2437
+ </div>
2438
+
2439
+ <div>
2440
+ <a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
2441
+ <p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
2442
+ </div>
2443
+
2444
+ <div>
2445
+ <a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
2446
+ <p>Tutorial explaining the concepts and implementation of ring attention.</p>
2447
+ </div>
2448
+
2449
+ <div>
2450
+ <a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
2451
+ <p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
2452
+ </div>
2453
+
2454
+ <div>
2455
+ <a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
2456
+ <p>Introduces mixed precision training techniques for deep learning models.</p>
2457
+ </div>
2458
+
2459
  <h3>Hardware</h3>
2460
+
2461
+ <div>
2462
+ <a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
2463
+ <p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
2464
+ </div>
2465
+
2466
+ <div>
2467
+ <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
2468
+ <p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
2469
+ </div>
2470
+
2471
+ <div>
2472
+ <a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
2473
+ <p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
2474
+ </div>
2475
+
2476
  <h3>Others</h3>
2477
+
2478
+ <div>
2479
+ <a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
2480
+ <p>Comprehensive handbook covering various aspects of training LLMs.</p>
2481
+ </div>
2482
+
2483
+ <div>
2484
+ <a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
2485
+ <p>Detailed documentation of the BLOOM model training process and challenges.</p>
2486
+ </div>
2487
+
2488
+ <div>
2489
+ <a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
2490
+ <p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
2491
+ </div>
2492
+
2493
+ <div>
2494
+ <a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
2495
+ <p>Investigation into the relationship between model size and training overhead.</p>
2496
+ </div>
2497
+
2498
+ <div>
2499
+ <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2500
+ <p>Investigation into long context training in terms of data and training cost.</p>
2501
+ </div>
2502
 
2503
  <h2>Appendix</h2>
2504