Pretraining Time Cost?
#30
by
fov223
- opened
Do you mind sharing how many hours you spend on 512 TPUv5es to pretrain Gemma 2B model? Also how many for Gemma 7B on 4096 TPUv5es? Thx.
Hi @fov223 ,
Sorry for late response, The pretraining time for these models would depend on several factors including optimization strategies, model parallelism, batch sizes, dataset, and hyperparameter choices, which can all affect the overall time spent during pretraining.
For a 2B parameter model on 512 TPUv5es, and assuming an optimized setup, it could take around 5–7 days to train.
For a 7B parameter model, scaling to 4096 TPUv5es would naturally take much longer due to the increased model size, complexity, and the distributed training setup.
Thank you.