Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
1
Ketan
Ketansomewhere
Follow
21world's profile picture
1 follower
ยท
1 following
KetanMann
AI & ML interests
Diffusion Models
Recent Activity
updated
a model
about 1 month ago
Ketansomewhere/FER_2013_Conditional_Diffusion
liked
a dataset
8 months ago
m1guelpf/nouns
reacted
to
singhsidhukuldeep
's
post
with ๐คฏ
8 months ago
Meta Researchers: How many compute hours should we use to train Llama 3.1? Mr. Zuck: Yes! ๐ค๐ช Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper ๐ on their findings and technical aspects of the models and their training process! Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. ๐ฅ๏ธ๐ Here are some interesting findings about the computing infrastructure of Llamas: - Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaโs production clusters! ๐ - That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaโs Grand Teton AI server platform. ๐ฅ๏ธ๐ - What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. ๐พ๐ - Meta's mad lads saved each GPUโs model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. ๐ ๏ธ๐ If this sounds big, well, they document the humungous challenges that come with it: - In the 54-day training period, there were 466 job interruptions. ๐๐ - About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! ๐ฅ๐ฅ๏ธ - Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ๐๐พ - With all this, effective training timeโmeasured as the time spent on useful training over the elapsed timeโwas higher than 90%. โฑ๏ธ๐ I think this is the stuff that movies can be made on! ๐ฌ๐ Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
View all activity
Organizations
None yet
Ketansomewhere
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
a dataset
8 months ago
m1guelpf/nouns
Viewer
โข
Updated
Sep 25, 2022
โข
49.9k
โข
264
โข
32