Ketan

Ketansomewhere

KetanMann

AI & ML interests

Diffusion Models

Recent Activity

updated a model about 1 month ago

Ketansomewhere/FER_2013_Conditional_Diffusion

liked a dataset 8 months ago

m1guelpf/nouns

reacted to singhsidhukuldeep's post with 🤯 8 months ago

Meta Researchers: How many compute hours should we use to train Llama 3.1? Mr. Zuck: Yes! 🤖💪 Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper 📄 on their findings and technical aspects of the models and their training process! Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. 🖥️🚀 Here are some interesting findings about the computing infrastructure of Llamas: - Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Meta’s production clusters! 📊 - That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Meta’s Grand Teton AI server platform. 🖥️🔋 - What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. 💾📈 - Meta's mad lads saved each GPU’s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. 🛠️🔍 If this sounds big, well, they document the humungous challenges that come with it: - In the 54-day training period, there were 466 job interruptions. 🕒🔄 - About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! 💥🖥️ - Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. 📉💾 - With all this, effective training time—measured as the time spent on useful training over the elapsed time—was higher than 90%. ⏱️📊 I think this is the stuff that movies can be made on! 🎬🌟 Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

View all activity

Organizations

None yet

Ketansomewhere's activity

updated a model about 1 month ago

Ketansomewhere/FER_2013_Conditional_Diffusion

Updated Feb 24

liked a dataset 8 months ago

m1guelpf/nouns

Viewer • Updated Sep 25, 2022 • 49.9k • 264 • 32

reacted to singhsidhukuldeep's post with 🤯 8 months ago

Post

2828

Meta Researchers: How many compute hours should we use to train Llama 3.1?
Mr. Zuck: Yes! 🤖💪

Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper 📄 on their findings and technical aspects of the models and their training process!

Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. 🖥️🚀

Here are some interesting findings about the computing infrastructure of Llamas:

- Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Meta’s production clusters! 📊

- That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Meta’s Grand Teton AI server platform. 🖥️🔋

- What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. 💾📈

- Meta's mad lads saved each GPU’s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. 🛠️🔍

If this sounds big, well, they document the humungous challenges that come with it:

- In the 54-day training period, there were 466 job interruptions. 🕒🔄

- About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! 💥🖥️

- Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. 📉💾

- With all this, effective training time—measured as the time spent on useful training over the elapsed time—was higher than 90%. ⏱️📊

I think this is the stuff that movies can be made on! 🎬🌟

Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/