Nanotron Research

community

Activity Feed

AI & ML interests

Large scale distributed AI model training, model parallelisation, low-level GPU acceleration, make GPUs go brrrrr

Recent Activity

tfrere new activity 12 days ago

nanotron/ultrascale-playbook:dark-mode

tfrere updated a Space 12 days ago

nanotron/ultrascale-playbook

lvwerra new activity 12 days ago

nanotron/ultrascale-playbook:typos

View all activity

eliebak

posted an update 9 days ago

Post

489

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

tfrere

in nanotron/ultrascale-playbook 12 days ago

dark-mode

🤗 1

#82 opened 6 months ago by

serhany

tfrere

updated a Space 12 days ago

3.14k

The Ultra-Scale Playbook

🌌

The ultimate guide to training LLM on large GPU Clusters

lvwerra

in nanotron/ultrascale-playbook 12 days ago

typos

#119 opened 12 days ago by

kashif

thomwolf

updated a Space 23 days ago

3.14k

The Ultra-Scale Playbook

🌌

The ultimate guide to training LLM on large GPU Clusters

thomwolf

in nanotron/ultrascale-playbook 23 days ago

order button

#118 opened 23 days ago by

lvwerra

in nanotron/ultrascale-playbook 23 days ago

order button

#118 opened 23 days ago by

lvwerra

updated a Space about 1 month ago

README

📉

julien-c

updated a Space about 1 month ago

README

📉

lvwerra

in nanotron/book about 1 month ago

Update README.md

#1 opened about 1 month ago by

lvwerra

Update README.md

#2 opened about 1 month ago by

lvwerra

Update README.md

#3 opened about 1 month ago by

lvwerra

julien-c

updated a dataset about 1 month ago

nanotron/book

Updated Jul 30 • 807 • 4

julien-c

published a dataset about 1 month ago

nanotron/book

Updated Jul 30 • 807 • 4

eliebak

posted an update about 1 month ago

Post

4687

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU