Meta Researchers: How many compute hours should we use to train Llama 3.1? Mr. Zuck: Yes! ๐ค๐ช
Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper ๐ on their findings and technical aspects of the models and their training process!
Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. ๐ฅ๏ธ๐
Here are some interesting findings about the computing infrastructure of Llamas:
- Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaโs production clusters! ๐
- That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaโs Grand Teton AI server platform. ๐ฅ๏ธ๐
- What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. ๐พ๐
- Meta's mad lads saved each GPUโs model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. ๐ ๏ธ๐
If this sounds big, well, they document the humungous challenges that come with it:
- In the 54-day training period, there were 466 job interruptions. ๐๐
- About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! ๐ฅ๐ฅ๏ธ
- Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ๐๐พ
- With all this, effective training timeโmeasured as the time spent on useful training over the elapsed timeโwas higher than 90%. โฑ๏ธ๐
I think this is the stuff that movies can be made on! ๐ฌ๐