Training in progress, step 4000
Browse files
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 1527847357
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:93bc62ba34f7e7a80168b1e5c1b2cff4630e3fcf60ebb0046e78af5fe6a11945
|
3 |
size 1527847357
|
run.log
CHANGED
@@ -1237,3 +1237,254 @@ Rank: 0 partition count [1] and sizes[(763857920, False)]
|
|
1237 |
[2022-12-20 01:14:33,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1238 |
[2022-12-20 01:14:33,082] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1239 |
[2022-12-20 01:14:33,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1237 |
[2022-12-20 01:14:33,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1238 |
[2022-12-20 01:14:33,082] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-3000/global_step3000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1239 |
[2022-12-20 01:14:33,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now!
|
1240 |
+
[2022-12-20 01:18:29,011] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
|
1241 |
+
[2022-12-20 01:18:41,053] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
|
1242 |
+
[2022-12-20 01:19:34,609] [INFO] [logging.py:68:log_dist] [Rank 0] step=3010, skipped=6, lr=[4.437777777777778e-06], mom=[[0.9, 0.999]]
|
1243 |
+
[2022-12-20 01:19:34,611] [INFO] [timer.py:196:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=5.010784671686443, CurrSamplesPerSec=5.7881583152353295, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1244 |
+
[2022-12-20 01:22:00,245] [INFO] [logging.py:68:log_dist] [Rank 0] step=3020, skipped=6, lr=[4.415555555555556e-06], mom=[[0.9, 0.999]]
|
1245 |
+
[2022-12-20 01:22:00,246] [INFO] [timer.py:196:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=5.011494782563396, CurrSamplesPerSec=5.182315133261153, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1246 |
+
{'loss': 0.0001, 'learning_rate': 4.404444444444445e-06, 'epoch': 43.21}
|
1247 |
+
[2022-12-20 01:24:28,024] [INFO] [logging.py:68:log_dist] [Rank 0] step=3030, skipped=6, lr=[4.393333333333334e-06], mom=[[0.9, 0.999]]
|
1248 |
+
[2022-12-20 01:24:28,025] [INFO] [timer.py:196:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=5.011850955905219, CurrSamplesPerSec=5.017293001245616, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1249 |
+
[2022-12-20 01:26:56,035] [INFO] [logging.py:68:log_dist] [Rank 0] step=3040, skipped=6, lr=[4.371111111111112e-06], mom=[[0.9, 0.999]]
|
1250 |
+
[2022-12-20 01:26:56,037] [INFO] [timer.py:196:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=5.012175038461916, CurrSamplesPerSec=4.827043409240733, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1251 |
+
[2022-12-20 01:29:29,556] [INFO] [logging.py:68:log_dist] [Rank 0] step=3050, skipped=6, lr=[4.348888888888889e-06], mom=[[0.9, 0.999]]
|
1252 |
+
[2022-12-20 01:29:29,558] [INFO] [timer.py:196:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=5.011714249674453, CurrSamplesPerSec=5.013898113078914, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1253 |
+
{'loss': 0.0001, 'learning_rate': 4.348888888888889e-06, 'epoch': 43.57}
|
1254 |
+
[2022-12-20 01:31:59,512] [INFO] [logging.py:68:log_dist] [Rank 0] step=3060, skipped=6, lr=[4.326666666666667e-06], mom=[[0.9, 0.999]]
|
1255 |
+
[2022-12-20 01:31:59,514] [INFO] [timer.py:196:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=5.0118575458496135, CurrSamplesPerSec=5.7010847557137065, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1256 |
+
[2022-12-20 01:34:29,706] [INFO] [logging.py:68:log_dist] [Rank 0] step=3070, skipped=6, lr=[4.304444444444445e-06], mom=[[0.9, 0.999]]
|
1257 |
+
[2022-12-20 01:34:29,708] [INFO] [timer.py:196:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=5.011864060021234, CurrSamplesPerSec=4.510793917369373, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1258 |
+
{'loss': 0.0001, 'learning_rate': 4.2933333333333334e-06, 'epoch': 43.93}
|
1259 |
+
[2022-12-20 01:37:10,981] [INFO] [logging.py:68:log_dist] [Rank 0] step=3080, skipped=6, lr=[4.282222222222222e-06], mom=[[0.9, 0.999]]
|
1260 |
+
[2022-12-20 01:37:10,982] [INFO] [timer.py:196:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=5.0104830321936324, CurrSamplesPerSec=4.9590286848937275, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1261 |
+
[2022-12-20 01:39:46,952] [INFO] [logging.py:68:log_dist] [Rank 0] step=3090, skipped=6, lr=[4.26e-06], mom=[[0.9, 0.999]]
|
1262 |
+
[2022-12-20 01:39:46,954] [INFO] [timer.py:196:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=5.009856571118483, CurrSamplesPerSec=4.942705625853167, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1263 |
+
[2022-12-20 01:42:15,154] [INFO] [logging.py:68:log_dist] [Rank 0] step=3100, skipped=6, lr=[4.2377777777777775e-06], mom=[[0.9, 0.999]]
|
1264 |
+
[2022-12-20 01:42:15,155] [INFO] [timer.py:196:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=5.010175794540146, CurrSamplesPerSec=5.249672686832617, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1265 |
+
{'loss': 0.0001, 'learning_rate': 4.2377777777777775e-06, 'epoch': 44.29}
|
1266 |
+
[2022-12-20 01:44:45,522] [INFO] [logging.py:68:log_dist] [Rank 0] step=3110, skipped=6, lr=[4.215555555555556e-06], mom=[[0.9, 0.999]]
|
1267 |
+
[2022-12-20 01:44:45,524] [INFO] [timer.py:196:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=5.010168572741285, CurrSamplesPerSec=4.82225035907906, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1268 |
+
[2022-12-20 01:47:20,205] [INFO] [logging.py:68:log_dist] [Rank 0] step=3120, skipped=6, lr=[4.1933333333333336e-06], mom=[[0.9, 0.999]]
|
1269 |
+
[2022-12-20 01:47:20,207] [INFO] [timer.py:196:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=5.009661435535492, CurrSamplesPerSec=5.020791655184806, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1270 |
+
{'loss': 0.0001, 'learning_rate': 4.182222222222222e-06, 'epoch': 44.64}
|
1271 |
+
[2022-12-20 01:49:49,881] [INFO] [logging.py:68:log_dist] [Rank 0] step=3130, skipped=6, lr=[4.171111111111111e-06], mom=[[0.9, 0.999]]
|
1272 |
+
[2022-12-20 01:49:49,882] [INFO] [timer.py:196:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=5.009877115926111, CurrSamplesPerSec=5.246828135694998, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1273 |
+
[2022-12-20 01:52:15,719] [INFO] [logging.py:68:log_dist] [Rank 0] step=3140, skipped=6, lr=[4.148888888888889e-06], mom=[[0.9, 0.999]]
|
1274 |
+
[2022-12-20 01:52:15,720] [INFO] [timer.py:196:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=5.010416027361077, CurrSamplesPerSec=5.122332311953563, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1275 |
+
[2022-12-20 01:54:43,394] [INFO] [logging.py:68:log_dist] [Rank 0] step=3150, skipped=6, lr=[4.126666666666667e-06], mom=[[0.9, 0.999]]
|
1276 |
+
[2022-12-20 01:54:43,396] [INFO] [timer.py:196:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=5.0107946397971945, CurrSamplesPerSec=5.1281769053064306, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1277 |
+
{'loss': 0.0001, 'learning_rate': 4.126666666666667e-06, 'epoch': 45.0}
|
1278 |
+
[2022-12-20 01:57:10,843] [INFO] [logging.py:68:log_dist] [Rank 0] step=3160, skipped=6, lr=[4.104444444444445e-06], mom=[[0.9, 0.999]]
|
1279 |
+
[2022-12-20 01:57:10,845] [INFO] [timer.py:196:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=5.0111903358827785, CurrSamplesPerSec=5.168605423919754, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1280 |
+
[2022-12-20 01:59:37,211] [INFO] [logging.py:68:log_dist] [Rank 0] step=3170, skipped=6, lr=[4.0822222222222225e-06], mom=[[0.9, 0.999]]
|
1281 |
+
[2022-12-20 01:59:37,212] [INFO] [timer.py:196:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=5.011832383634813, CurrSamplesPerSec=5.17065105423624, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1282 |
+
{'loss': 0.0001, 'learning_rate': 4.071111111111111e-06, 'epoch': 45.36}
|
1283 |
+
[2022-12-20 02:02:02,760] [INFO] [logging.py:68:log_dist] [Rank 0] step=3180, skipped=6, lr=[4.060000000000001e-06], mom=[[0.9, 0.999]]
|
1284 |
+
[2022-12-20 02:02:02,761] [INFO] [timer.py:196:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=5.012414169129676, CurrSamplesPerSec=5.239538770686272, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1285 |
+
[2022-12-20 02:04:28,303] [INFO] [logging.py:68:log_dist] [Rank 0] step=3190, skipped=6, lr=[4.0377777777777786e-06], mom=[[0.9, 0.999]]
|
1286 |
+
[2022-12-20 02:04:28,304] [INFO] [timer.py:196:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=5.013003951326583, CurrSamplesPerSec=5.178411107945444, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1287 |
+
[2022-12-20 02:06:56,787] [INFO] [logging.py:68:log_dist] [Rank 0] step=3200, skipped=6, lr=[4.015555555555556e-06], mom=[[0.9, 0.999]]
|
1288 |
+
[2022-12-20 02:06:56,789] [INFO] [timer.py:196:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=5.013173386209827, CurrSamplesPerSec=4.8972863458813185, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1289 |
+
{'loss': 0.0001, 'learning_rate': 4.015555555555556e-06, 'epoch': 45.71}
|
1290 |
+
[2022-12-20 02:09:29,989] [INFO] [logging.py:68:log_dist] [Rank 0] step=3210, skipped=6, lr=[3.993333333333334e-06], mom=[[0.9, 0.999]]
|
1291 |
+
[2022-12-20 02:09:29,990] [INFO] [timer.py:196:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=5.012791790383584, CurrSamplesPerSec=4.8984594609760315, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1292 |
+
[2022-12-20 02:12:03,668] [INFO] [logging.py:68:log_dist] [Rank 0] step=3220, skipped=6, lr=[3.971111111111111e-06], mom=[[0.9, 0.999]]
|
1293 |
+
[2022-12-20 02:12:03,669] [INFO] [timer.py:196:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=5.012371857760337, CurrSamplesPerSec=5.132404001446214, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1294 |
+
{'loss': 0.0001, 'learning_rate': 3.96e-06, 'epoch': 46.07}
|
1295 |
+
[2022-12-20 02:14:35,956] [INFO] [logging.py:68:log_dist] [Rank 0] step=3230, skipped=6, lr=[3.948888888888889e-06], mom=[[0.9, 0.999]]
|
1296 |
+
[2022-12-20 02:14:35,957] [INFO] [timer.py:196:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=5.012138963050575, CurrSamplesPerSec=4.897640088990262, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1297 |
+
[2022-12-20 02:17:09,501] [INFO] [logging.py:68:log_dist] [Rank 0] step=3240, skipped=6, lr=[3.926666666666667e-06], mom=[[0.9, 0.999]]
|
1298 |
+
[2022-12-20 02:17:09,503] [INFO] [timer.py:196:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=5.011795806427403, CurrSamplesPerSec=4.918891450300068, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1299 |
+
[2022-12-20 02:19:27,538] [INFO] [logging.py:68:log_dist] [Rank 0] step=3250, skipped=6, lr=[3.904444444444444e-06], mom=[[0.9, 0.999]]
|
1300 |
+
[2022-12-20 02:19:27,540] [INFO] [timer.py:196:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=5.013307950894662, CurrSamplesPerSec=5.518680249132624, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1301 |
+
{'loss': 0.0001, 'learning_rate': 3.904444444444444e-06, 'epoch': 46.43}
|
1302 |
+
[2022-12-20 02:21:50,333] [INFO] [logging.py:68:log_dist] [Rank 0] step=3260, skipped=6, lr=[3.882222222222223e-06], mom=[[0.9, 0.999]]
|
1303 |
+
[2022-12-20 02:21:50,334] [INFO] [timer.py:196:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=5.014276951656808, CurrSamplesPerSec=5.244702226717013, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1304 |
+
[2022-12-20 02:24:15,987] [INFO] [logging.py:68:log_dist] [Rank 0] step=3270, skipped=6, lr=[3.86e-06], mom=[[0.9, 0.999]]
|
1305 |
+
[2022-12-20 02:24:15,989] [INFO] [timer.py:196:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=5.014870559660052, CurrSamplesPerSec=4.903460097765076, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1306 |
+
{'loss': 0.0001, 'learning_rate': 3.848888888888889e-06, 'epoch': 46.79}
|
1307 |
+
[2022-12-20 02:26:44,569] [INFO] [logging.py:68:log_dist] [Rank 0] step=3280, skipped=6, lr=[3.837777777777778e-06], mom=[[0.9, 0.999]]
|
1308 |
+
[2022-12-20 02:26:44,571] [INFO] [timer.py:196:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=5.0151254618912855, CurrSamplesPerSec=5.13210128825685, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1309 |
+
[2022-12-20 02:29:10,914] [INFO] [logging.py:68:log_dist] [Rank 0] step=3290, skipped=6, lr=[3.8155555555555555e-06], mom=[[0.9, 0.999]]
|
1310 |
+
[2022-12-20 02:29:10,915] [INFO] [timer.py:196:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=5.0155869116890734, CurrSamplesPerSec=5.244405589391571, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1311 |
+
[2022-12-20 02:31:34,598] [INFO] [logging.py:68:log_dist] [Rank 0] step=3300, skipped=6, lr=[3.793333333333334e-06], mom=[[0.9, 0.999]]
|
1312 |
+
[2022-12-20 02:31:34,599] [INFO] [timer.py:196:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=5.016568403527747, CurrSamplesPerSec=5.4285716076894746, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1313 |
+
{'loss': 0.0001, 'learning_rate': 3.793333333333334e-06, 'epoch': 47.14}
|
1314 |
+
[2022-12-20 02:34:02,010] [INFO] [logging.py:68:log_dist] [Rank 0] step=3310, skipped=6, lr=[3.7711111111111116e-06], mom=[[0.9, 0.999]]
|
1315 |
+
[2022-12-20 02:34:02,011] [INFO] [timer.py:196:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=5.016859073104873, CurrSamplesPerSec=5.228046642973993, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1316 |
+
[2022-12-20 02:36:36,456] [INFO] [logging.py:68:log_dist] [Rank 0] step=3320, skipped=6, lr=[3.7488888888888892e-06], mom=[[0.9, 0.999]]
|
1317 |
+
[2022-12-20 02:36:36,458] [INFO] [timer.py:196:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=5.016446215770842, CurrSamplesPerSec=4.811722676247159, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1318 |
+
{'loss': 0.0001, 'learning_rate': 3.737777777777778e-06, 'epoch': 47.5}
|
1319 |
+
[2022-12-20 02:39:04,992] [INFO] [logging.py:68:log_dist] [Rank 0] step=3330, skipped=6, lr=[3.726666666666667e-06], mom=[[0.9, 0.999]]
|
1320 |
+
[2022-12-20 02:39:04,994] [INFO] [timer.py:196:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=5.016684021365297, CurrSamplesPerSec=5.210960829388422, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1321 |
+
[2022-12-20 02:41:33,573] [INFO] [logging.py:68:log_dist] [Rank 0] step=3340, skipped=6, lr=[3.704444444444445e-06], mom=[[0.9, 0.999]]
|
1322 |
+
[2022-12-20 02:41:33,575] [INFO] [timer.py:196:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=5.016909309171439, CurrSamplesPerSec=5.07443598071175, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1323 |
+
[2022-12-20 02:44:00,851] [INFO] [logging.py:68:log_dist] [Rank 0] step=3350, skipped=6, lr=[3.6822222222222225e-06], mom=[[0.9, 0.999]]
|
1324 |
+
[2022-12-20 02:44:00,852] [INFO] [timer.py:196:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=5.017265218865273, CurrSamplesPerSec=5.0747266527004085, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1325 |
+
{'loss': 0.0001, 'learning_rate': 3.6822222222222225e-06, 'epoch': 47.86}
|
1326 |
+
[2022-12-20 02:46:27,651] [INFO] [logging.py:68:log_dist] [Rank 0] step=3360, skipped=6, lr=[3.66e-06], mom=[[0.9, 0.999]]
|
1327 |
+
[2022-12-20 02:46:27,653] [INFO] [timer.py:196:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=5.017696493100695, CurrSamplesPerSec=5.445819859803268, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1328 |
+
[2022-12-20 02:48:54,300] [INFO] [logging.py:68:log_dist] [Rank 0] step=3370, skipped=6, lr=[3.6377777777777777e-06], mom=[[0.9, 0.999]]
|
1329 |
+
[2022-12-20 02:48:54,301] [INFO] [timer.py:196:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=5.018120277766622, CurrSamplesPerSec=5.119012007093434, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1330 |
+
{'loss': 0.0001, 'learning_rate': 3.6266666666666674e-06, 'epoch': 48.21}
|
1331 |
+
[2022-12-20 02:51:20,831] [INFO] [logging.py:68:log_dist] [Rank 0] step=3380, skipped=6, lr=[3.615555555555556e-06], mom=[[0.9, 0.999]]
|
1332 |
+
[2022-12-20 02:51:20,832] [INFO] [timer.py:196:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=5.018567967339129, CurrSamplesPerSec=5.209512158887764, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1333 |
+
[2022-12-20 02:53:51,344] [INFO] [logging.py:68:log_dist] [Rank 0] step=3390, skipped=6, lr=[3.593333333333334e-06], mom=[[0.9, 0.999]]
|
1334 |
+
[2022-12-20 02:53:51,346] [INFO] [timer.py:196:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=5.018556949270108, CurrSamplesPerSec=4.892052921959464, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1335 |
+
[2022-12-20 02:56:19,431] [INFO] [logging.py:68:log_dist] [Rank 0] step=3400, skipped=6, lr=[3.5711111111111114e-06], mom=[[0.9, 0.999]]
|
1336 |
+
[2022-12-20 02:56:19,432] [INFO] [timer.py:196:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=5.018898400832044, CurrSamplesPerSec=5.22343017008523, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1337 |
+
{'loss': 0.0001, 'learning_rate': 3.5711111111111114e-06, 'epoch': 48.57}
|
1338 |
+
[2022-12-20 02:58:47,138] [INFO] [logging.py:68:log_dist] [Rank 0] step=3410, skipped=6, lr=[3.548888888888889e-06], mom=[[0.9, 0.999]]
|
1339 |
+
[2022-12-20 02:58:47,139] [INFO] [timer.py:196:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=5.019209831341465, CurrSamplesPerSec=5.115540313699374, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1340 |
+
[2022-12-20 03:01:14,393] [INFO] [logging.py:68:log_dist] [Rank 0] step=3420, skipped=6, lr=[3.526666666666667e-06], mom=[[0.9, 0.999]]
|
1341 |
+
[2022-12-20 03:01:14,395] [INFO] [timer.py:196:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=5.019558125349715, CurrSamplesPerSec=5.199672750564973, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1342 |
+
{'loss': 0.0001, 'learning_rate': 3.515555555555556e-06, 'epoch': 48.93}
|
1343 |
+
[2022-12-20 03:03:40,216] [INFO] [logging.py:68:log_dist] [Rank 0] step=3430, skipped=6, lr=[3.5044444444444447e-06], mom=[[0.9, 0.999]]
|
1344 |
+
[2022-12-20 03:03:40,218] [INFO] [timer.py:196:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=5.020104358212663, CurrSamplesPerSec=5.294184786664645, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1345 |
+
[2022-12-20 03:06:08,019] [INFO] [logging.py:68:log_dist] [Rank 0] step=3440, skipped=6, lr=[3.4822222222222223e-06], mom=[[0.9, 0.999]]
|
1346 |
+
[2022-12-20 03:06:08,020] [INFO] [timer.py:196:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=5.020483455479745, CurrSamplesPerSec=5.126783095607625, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1347 |
+
[2022-12-20 03:08:34,247] [INFO] [logging.py:68:log_dist] [Rank 0] step=3450, skipped=6, lr=[3.46e-06], mom=[[0.9, 0.999]]
|
1348 |
+
[2022-12-20 03:08:34,248] [INFO] [timer.py:196:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=5.020923426859924, CurrSamplesPerSec=5.169087042205798, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1349 |
+
{'loss': 0.0001, 'learning_rate': 3.46e-06, 'epoch': 49.29}
|
1350 |
+
[2022-12-20 03:11:00,700] [INFO] [logging.py:68:log_dist] [Rank 0] step=3460, skipped=6, lr=[3.4377777777777784e-06], mom=[[0.9, 0.999]]
|
1351 |
+
[2022-12-20 03:11:00,701] [INFO] [timer.py:196:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=5.021448536640391, CurrSamplesPerSec=5.173228938874232, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1352 |
+
[2022-12-20 03:13:27,986] [INFO] [logging.py:68:log_dist] [Rank 0] step=3470, skipped=6, lr=[3.415555555555556e-06], mom=[[0.9, 0.999]]
|
1353 |
+
[2022-12-20 03:13:27,987] [INFO] [timer.py:196:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=5.021813255367895, CurrSamplesPerSec=5.241449759871188, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1354 |
+
{'loss': 0.0001, 'learning_rate': 3.404444444444445e-06, 'epoch': 49.64}
|
1355 |
+
[2022-12-20 03:15:58,218] [INFO] [logging.py:68:log_dist] [Rank 0] step=3480, skipped=6, lr=[3.3933333333333336e-06], mom=[[0.9, 0.999]]
|
1356 |
+
[2022-12-20 03:15:58,220] [INFO] [timer.py:196:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=5.021783181313551, CurrSamplesPerSec=4.697068445288297, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1357 |
+
[2022-12-20 03:18:32,363] [INFO] [logging.py:68:log_dist] [Rank 0] step=3490, skipped=6, lr=[3.371111111111111e-06], mom=[[0.9, 0.999]]
|
1358 |
+
[2022-12-20 03:18:32,364] [INFO] [timer.py:196:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=5.0213357193232415, CurrSamplesPerSec=4.850845469838286, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1359 |
+
[2022-12-20 03:21:05,658] [INFO] [logging.py:68:log_dist] [Rank 0] step=3500, skipped=6, lr=[3.3488888888888892e-06], mom=[[0.9, 0.999]]
|
1360 |
+
[2022-12-20 03:21:05,659] [INFO] [timer.py:196:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=5.021104853037733, CurrSamplesPerSec=5.175201005614641, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1361 |
+
{'loss': 0.0001, 'learning_rate': 3.3488888888888892e-06, 'epoch': 50.0}
|
1362 |
+
[2022-12-20 03:23:34,614] [INFO] [logging.py:68:log_dist] [Rank 0] step=3510, skipped=6, lr=[3.326666666666667e-06], mom=[[0.9, 0.999]]
|
1363 |
+
[2022-12-20 03:23:34,616] [INFO] [timer.py:196:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=5.021260020015854, CurrSamplesPerSec=5.109705019563074, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1364 |
+
[2022-12-20 03:26:02,060] [INFO] [logging.py:68:log_dist] [Rank 0] step=3520, skipped=6, lr=[3.3044444444444445e-06], mom=[[0.9, 0.999]]
|
1365 |
+
[2022-12-20 03:26:02,062] [INFO] [timer.py:196:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=5.021535536224096, CurrSamplesPerSec=4.963441904576861, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1366 |
+
{'loss': 0.0001, 'learning_rate': 3.2933333333333333e-06, 'epoch': 50.36}
|
1367 |
+
[2022-12-20 03:28:35,024] [INFO] [logging.py:68:log_dist] [Rank 0] step=3530, skipped=6, lr=[3.282222222222223e-06], mom=[[0.9, 0.999]]
|
1368 |
+
[2022-12-20 03:28:35,026] [INFO] [timer.py:196:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=5.021174952910811, CurrSamplesPerSec=4.771169222501317, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1369 |
+
[2022-12-20 03:31:03,479] [INFO] [logging.py:68:log_dist] [Rank 0] step=3540, skipped=6, lr=[3.2600000000000006e-06], mom=[[0.9, 0.999]]
|
1370 |
+
[2022-12-20 03:31:03,480] [INFO] [timer.py:196:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=5.021356333675896, CurrSamplesPerSec=5.1336201202148795, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1371 |
+
[2022-12-20 03:33:37,593] [INFO] [logging.py:68:log_dist] [Rank 0] step=3550, skipped=6, lr=[3.237777777777778e-06], mom=[[0.9, 0.999]]
|
1372 |
+
[2022-12-20 03:33:37,594] [INFO] [timer.py:196:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=5.0209923301417545, CurrSamplesPerSec=4.8032228159450945, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1373 |
+
{'loss': 0.0001, 'learning_rate': 3.237777777777778e-06, 'epoch': 50.71}
|
1374 |
+
[2022-12-20 03:36:08,250] [INFO] [logging.py:68:log_dist] [Rank 0] step=3560, skipped=6, lr=[3.2155555555555558e-06], mom=[[0.9, 0.999]]
|
1375 |
+
[2022-12-20 03:36:08,252] [INFO] [timer.py:196:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=5.020966895146841, CurrSamplesPerSec=5.417848791827928, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1376 |
+
[2022-12-20 03:38:35,914] [INFO] [logging.py:68:log_dist] [Rank 0] step=3570, skipped=6, lr=[3.193333333333334e-06], mom=[[0.9, 0.999]]
|
1377 |
+
[2022-12-20 03:38:35,916] [INFO] [timer.py:196:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=5.021199745560539, CurrSamplesPerSec=5.189035747598663, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1378 |
+
{'loss': 0.0001, 'learning_rate': 3.1822222222222226e-06, 'epoch': 51.07}
|
1379 |
+
[2022-12-20 03:41:09,887] [INFO] [logging.py:68:log_dist] [Rank 0] step=3580, skipped=6, lr=[3.1711111111111114e-06], mom=[[0.9, 0.999]]
|
1380 |
+
[2022-12-20 03:41:09,889] [INFO] [timer.py:196:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=5.020780373418236, CurrSamplesPerSec=4.836567258368214, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1381 |
+
[2022-12-20 03:43:43,941] [INFO] [logging.py:68:log_dist] [Rank 0] step=3590, skipped=6, lr=[3.148888888888889e-06], mom=[[0.9, 0.999]]
|
1382 |
+
[2022-12-20 03:43:43,943] [INFO] [timer.py:196:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=5.020349086789647, CurrSamplesPerSec=4.811152887097264, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1383 |
+
[2022-12-20 03:46:14,664] [INFO] [logging.py:68:log_dist] [Rank 0] step=3600, skipped=6, lr=[3.1266666666666667e-06], mom=[[0.9, 0.999]]
|
1384 |
+
[2022-12-20 03:46:14,666] [INFO] [timer.py:196:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=5.020307607070033, CurrSamplesPerSec=5.033031110111137, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1385 |
+
{'loss': 0.0001, 'learning_rate': 3.1266666666666667e-06, 'epoch': 51.43}
|
1386 |
+
[2022-12-20 03:48:44,894] [INFO] [logging.py:68:log_dist] [Rank 0] step=3610, skipped=6, lr=[3.104444444444445e-06], mom=[[0.9, 0.999]]
|
1387 |
+
[2022-12-20 03:48:44,895] [INFO] [timer.py:196:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=5.020289021142404, CurrSamplesPerSec=5.1586213756487345, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1388 |
+
[2022-12-20 03:51:12,552] [INFO] [logging.py:68:log_dist] [Rank 0] step=3620, skipped=6, lr=[3.0822222222222227e-06], mom=[[0.9, 0.999]]
|
1389 |
+
[2022-12-20 03:51:12,554] [INFO] [timer.py:196:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=5.020543813282841, CurrSamplesPerSec=5.158831748899703, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1390 |
+
{'loss': 0.0001, 'learning_rate': 3.0711111111111115e-06, 'epoch': 51.79}
|
1391 |
+
[2022-12-20 03:53:38,443] [INFO] [logging.py:68:log_dist] [Rank 0] step=3630, skipped=6, lr=[3.0600000000000003e-06], mom=[[0.9, 0.999]]
|
1392 |
+
[2022-12-20 03:53:38,445] [INFO] [timer.py:196:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=5.021014302786805, CurrSamplesPerSec=5.244716982610335, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1393 |
+
[2022-12-20 03:56:03,659] [INFO] [logging.py:68:log_dist] [Rank 0] step=3640, skipped=6, lr=[3.037777777777778e-06], mom=[[0.9, 0.999]]
|
1394 |
+
[2022-12-20 03:56:03,660] [INFO] [timer.py:196:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=5.021504442933683, CurrSamplesPerSec=5.442504694249898, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1395 |
+
[2022-12-20 03:58:30,311] [INFO] [logging.py:68:log_dist] [Rank 0] step=3650, skipped=6, lr=[3.015555555555556e-06], mom=[[0.9, 0.999]]
|
1396 |
+
[2022-12-20 03:58:30,312] [INFO] [timer.py:196:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=5.021887378964869, CurrSamplesPerSec=5.044828980616264, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1397 |
+
{'loss': 0.0001, 'learning_rate': 3.015555555555556e-06, 'epoch': 52.14}
|
1398 |
+
[2022-12-20 04:00:57,179] [INFO] [logging.py:68:log_dist] [Rank 0] step=3660, skipped=6, lr=[2.9933333333333336e-06], mom=[[0.9, 0.999]]
|
1399 |
+
[2022-12-20 04:00:57,181] [INFO] [timer.py:196:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=5.02226319641877, CurrSamplesPerSec=5.175864284088572, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1400 |
+
[2022-12-20 04:03:22,926] [INFO] [logging.py:68:log_dist] [Rank 0] step=3670, skipped=6, lr=[2.9711111111111112e-06], mom=[[0.9, 0.999]]
|
1401 |
+
[2022-12-20 04:03:22,927] [INFO] [timer.py:196:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=5.022750665852945, CurrSamplesPerSec=5.266301154382003, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1402 |
+
{'loss': 0.0001, 'learning_rate': 2.96e-06, 'epoch': 52.5}
|
1403 |
+
[2022-12-20 04:05:48,717] [INFO] [logging.py:68:log_dist] [Rank 0] step=3680, skipped=6, lr=[2.948888888888889e-06], mom=[[0.9, 0.999]]
|
1404 |
+
[2022-12-20 04:05:48,718] [INFO] [timer.py:196:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=5.023179933908105, CurrSamplesPerSec=5.156948914181069, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1405 |
+
[2022-12-20 04:08:17,201] [INFO] [logging.py:68:log_dist] [Rank 0] step=3690, skipped=6, lr=[2.9266666666666673e-06], mom=[[0.9, 0.999]]
|
1406 |
+
[2022-12-20 04:08:17,202] [INFO] [timer.py:196:stop] epoch=0/micro_step=3690/global_step=3690, RunningAvgSamplesPerSec=5.023450580454851, CurrSamplesPerSec=5.209431279570267, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1407 |
+
[2022-12-20 04:10:41,168] [INFO] [logging.py:68:log_dist] [Rank 0] step=3700, skipped=6, lr=[2.904444444444445e-06], mom=[[0.9, 0.999]]
|
1408 |
+
[2022-12-20 04:10:41,170] [INFO] [timer.py:196:stop] epoch=0/micro_step=3700/global_step=3700, RunningAvgSamplesPerSec=5.024155608459135, CurrSamplesPerSec=5.400806545363619, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1409 |
+
{'loss': 0.0001, 'learning_rate': 2.904444444444445e-06, 'epoch': 52.86}
|
1410 |
+
[2022-12-20 04:13:01,639] [INFO] [logging.py:68:log_dist] [Rank 0] step=3710, skipped=6, lr=[2.8822222222222225e-06], mom=[[0.9, 0.999]]
|
1411 |
+
[2022-12-20 04:13:01,640] [INFO] [timer.py:196:stop] epoch=0/micro_step=3710/global_step=3710, RunningAvgSamplesPerSec=5.025176566364872, CurrSamplesPerSec=5.572944957849582, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1412 |
+
[2022-12-20 04:15:28,880] [INFO] [logging.py:68:log_dist] [Rank 0] step=3720, skipped=6, lr=[2.86e-06], mom=[[0.9, 0.999]]
|
1413 |
+
[2022-12-20 04:15:28,882] [INFO] [timer.py:196:stop] epoch=0/micro_step=3720/global_step=3720, RunningAvgSamplesPerSec=5.025486477194859, CurrSamplesPerSec=5.093894625605406, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1414 |
+
{'loss': 0.0001, 'learning_rate': 2.8488888888888894e-06, 'epoch': 53.21}
|
1415 |
+
[2022-12-20 04:17:49,939] [INFO] [logging.py:68:log_dist] [Rank 0] step=3730, skipped=6, lr=[2.837777777777778e-06], mom=[[0.9, 0.999]]
|
1416 |
+
[2022-12-20 04:17:49,941] [INFO] [timer.py:196:stop] epoch=0/micro_step=3730/global_step=3730, RunningAvgSamplesPerSec=5.02646865664269, CurrSamplesPerSec=5.477750536846597, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1417 |
+
[2022-12-20 04:20:15,258] [INFO] [logging.py:68:log_dist] [Rank 0] step=3740, skipped=6, lr=[2.815555555555556e-06], mom=[[0.9, 0.999]]
|
1418 |
+
[2022-12-20 04:20:15,259] [INFO] [timer.py:196:stop] epoch=0/micro_step=3740/global_step=3740, RunningAvgSamplesPerSec=5.026993730402327, CurrSamplesPerSec=5.186795143590519, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1419 |
+
[2022-12-20 04:22:41,188] [INFO] [logging.py:68:log_dist] [Rank 0] step=3750, skipped=6, lr=[2.7933333333333334e-06], mom=[[0.9, 0.999]]
|
1420 |
+
[2022-12-20 04:22:41,190] [INFO] [timer.py:196:stop] epoch=0/micro_step=3750/global_step=3750, RunningAvgSamplesPerSec=5.027447785923126, CurrSamplesPerSec=5.177586687999467, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1421 |
+
{'loss': 0.0001, 'learning_rate': 2.7933333333333334e-06, 'epoch': 53.57}
|
1422 |
+
[2022-12-20 04:25:07,312] [INFO] [logging.py:68:log_dist] [Rank 0] step=3760, skipped=6, lr=[2.771111111111111e-06], mom=[[0.9, 0.999]]
|
1423 |
+
[2022-12-20 04:25:07,314] [INFO] [timer.py:196:stop] epoch=0/micro_step=3760/global_step=3760, RunningAvgSamplesPerSec=5.027896698630529, CurrSamplesPerSec=5.194247237040942, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1424 |
+
[2022-12-20 04:27:32,511] [INFO] [logging.py:68:log_dist] [Rank 0] step=3770, skipped=6, lr=[2.748888888888889e-06], mom=[[0.9, 0.999]]
|
1425 |
+
[2022-12-20 04:27:32,513] [INFO] [timer.py:196:stop] epoch=0/micro_step=3770/global_step=3770, RunningAvgSamplesPerSec=5.028361731968657, CurrSamplesPerSec=5.233565275989457, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1426 |
+
{'loss': 0.0001, 'learning_rate': 2.7377777777777783e-06, 'epoch': 53.93}
|
1427 |
+
[2022-12-20 04:29:58,724] [INFO] [logging.py:68:log_dist] [Rank 0] step=3780, skipped=6, lr=[2.726666666666667e-06], mom=[[0.9, 0.999]]
|
1428 |
+
[2022-12-20 04:29:58,725] [INFO] [timer.py:196:stop] epoch=0/micro_step=3780/global_step=3780, RunningAvgSamplesPerSec=5.028746024654157, CurrSamplesPerSec=5.256519744739989, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1429 |
+
[2022-12-20 04:32:25,797] [INFO] [logging.py:68:log_dist] [Rank 0] step=3790, skipped=6, lr=[2.7044444444444447e-06], mom=[[0.9, 0.999]]
|
1430 |
+
[2022-12-20 04:32:25,798] [INFO] [timer.py:196:stop] epoch=0/micro_step=3790/global_step=3790, RunningAvgSamplesPerSec=5.029074062073731, CurrSamplesPerSec=5.087535000375166, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1431 |
+
[2022-12-20 04:34:52,804] [INFO] [logging.py:68:log_dist] [Rank 0] step=3800, skipped=6, lr=[2.6822222222222223e-06], mom=[[0.9, 0.999]]
|
1432 |
+
[2022-12-20 04:34:52,805] [INFO] [timer.py:196:stop] epoch=0/micro_step=3800/global_step=3800, RunningAvgSamplesPerSec=5.029438184394117, CurrSamplesPerSec=5.30480101511469, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1433 |
+
{'loss': 0.0001, 'learning_rate': 2.6822222222222223e-06, 'epoch': 54.29}
|
1434 |
+
[2022-12-20 04:37:17,870] [INFO] [logging.py:68:log_dist] [Rank 0] step=3810, skipped=6, lr=[2.6600000000000004e-06], mom=[[0.9, 0.999]]
|
1435 |
+
[2022-12-20 04:37:17,871] [INFO] [timer.py:196:stop] epoch=0/micro_step=3810/global_step=3810, RunningAvgSamplesPerSec=5.029967397170501, CurrSamplesPerSec=5.061086466083689, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1436 |
+
[2022-12-20 04:39:46,226] [INFO] [logging.py:68:log_dist] [Rank 0] step=3820, skipped=6, lr=[2.637777777777778e-06], mom=[[0.9, 0.999]]
|
1437 |
+
[2022-12-20 04:39:46,227] [INFO] [timer.py:196:stop] epoch=0/micro_step=3820/global_step=3820, RunningAvgSamplesPerSec=5.030143967156818, CurrSamplesPerSec=5.040433735500809, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1438 |
+
{'loss': 0.0001, 'learning_rate': 2.6266666666666668e-06, 'epoch': 54.64}
|
1439 |
+
[2022-12-20 04:42:13,269] [INFO] [logging.py:68:log_dist] [Rank 0] step=3830, skipped=6, lr=[2.6155555555555556e-06], mom=[[0.9, 0.999]]
|
1440 |
+
[2022-12-20 04:42:13,270] [INFO] [timer.py:196:stop] epoch=0/micro_step=3830/global_step=3830, RunningAvgSamplesPerSec=5.030496972716335, CurrSamplesPerSec=5.0892962707836595, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1441 |
+
[2022-12-20 04:44:34,950] [INFO] [logging.py:68:log_dist] [Rank 0] step=3840, skipped=6, lr=[2.5933333333333336e-06], mom=[[0.9, 0.999]]
|
1442 |
+
[2022-12-20 04:44:34,951] [INFO] [timer.py:196:stop] epoch=0/micro_step=3840/global_step=3840, RunningAvgSamplesPerSec=5.031346325629816, CurrSamplesPerSec=5.6023436575508425, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1443 |
+
[2022-12-20 04:46:57,172] [INFO] [logging.py:68:log_dist] [Rank 0] step=3850, skipped=6, lr=[2.5711111111111112e-06], mom=[[0.9, 0.999]]
|
1444 |
+
[2022-12-20 04:46:57,173] [INFO] [timer.py:196:stop] epoch=0/micro_step=3850/global_step=3850, RunningAvgSamplesPerSec=5.032155181340912, CurrSamplesPerSec=5.437804914413287, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1445 |
+
{'loss': 0.0001, 'learning_rate': 2.5711111111111112e-06, 'epoch': 55.0}
|
1446 |
+
[2022-12-20 04:49:21,798] [INFO] [logging.py:68:log_dist] [Rank 0] step=3860, skipped=6, lr=[2.5488888888888893e-06], mom=[[0.9, 0.999]]
|
1447 |
+
[2022-12-20 04:49:21,800] [INFO] [timer.py:196:stop] epoch=0/micro_step=3860/global_step=3860, RunningAvgSamplesPerSec=5.032747341875238, CurrSamplesPerSec=5.111648800094207, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1448 |
+
[2022-12-20 04:51:46,829] [INFO] [logging.py:68:log_dist] [Rank 0] step=3870, skipped=6, lr=[2.526666666666667e-06], mom=[[0.9, 0.999]]
|
1449 |
+
[2022-12-20 04:51:46,830] [INFO] [timer.py:196:stop] epoch=0/micro_step=3870/global_step=3870, RunningAvgSamplesPerSec=5.033401435078496, CurrSamplesPerSec=5.408632181237298, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1450 |
+
{'loss': 0.0001, 'learning_rate': 2.5155555555555557e-06, 'epoch': 55.36}
|
1451 |
+
[2022-12-20 04:54:07,001] [INFO] [logging.py:68:log_dist] [Rank 0] step=3880, skipped=6, lr=[2.504444444444445e-06], mom=[[0.9, 0.999]]
|
1452 |
+
[2022-12-20 04:54:07,002] [INFO] [timer.py:196:stop] epoch=0/micro_step=3880/global_step=3880, RunningAvgSamplesPerSec=5.034569704497929, CurrSamplesPerSec=5.424155049886138, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1453 |
+
[2022-12-20 04:56:26,657] [INFO] [logging.py:68:log_dist] [Rank 0] step=3890, skipped=6, lr=[2.4822222222222225e-06], mom=[[0.9, 0.999]]
|
1454 |
+
[2022-12-20 04:56:26,659] [INFO] [timer.py:196:stop] epoch=0/micro_step=3890/global_step=3890, RunningAvgSamplesPerSec=5.035784002353498, CurrSamplesPerSec=5.575746512850949, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1455 |
+
[2022-12-20 04:58:46,813] [INFO] [logging.py:68:log_dist] [Rank 0] step=3900, skipped=6, lr=[2.46e-06], mom=[[0.9, 0.999]]
|
1456 |
+
[2022-12-20 04:58:46,815] [INFO] [timer.py:196:stop] epoch=0/micro_step=3900/global_step=3900, RunningAvgSamplesPerSec=5.036970868267857, CurrSamplesPerSec=5.571676606976653, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1457 |
+
{'loss': 0.0001, 'learning_rate': 2.46e-06, 'epoch': 55.71}
|
1458 |
+
[2022-12-20 05:01:08,974] [INFO] [logging.py:68:log_dist] [Rank 0] step=3910, skipped=6, lr=[2.437777777777778e-06], mom=[[0.9, 0.999]]
|
1459 |
+
[2022-12-20 05:01:08,975] [INFO] [timer.py:196:stop] epoch=0/micro_step=3910/global_step=3910, RunningAvgSamplesPerSec=5.038019236661488, CurrSamplesPerSec=5.640549183284285, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1460 |
+
[2022-12-20 05:03:28,700] [INFO] [logging.py:68:log_dist] [Rank 0] step=3920, skipped=6, lr=[2.415555555555556e-06], mom=[[0.9, 0.999]]
|
1461 |
+
[2022-12-20 05:03:28,701] [INFO] [timer.py:196:stop] epoch=0/micro_step=3920/global_step=3920, RunningAvgSamplesPerSec=5.039116343902467, CurrSamplesPerSec=5.5876312844457, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1462 |
+
{'loss': 0.0001, 'learning_rate': 2.4044444444444446e-06, 'epoch': 56.07}
|
1463 |
+
[2022-12-20 05:05:52,728] [INFO] [logging.py:68:log_dist] [Rank 0] step=3930, skipped=6, lr=[2.3933333333333334e-06], mom=[[0.9, 0.999]]
|
1464 |
+
[2022-12-20 05:05:52,730] [INFO] [timer.py:196:stop] epoch=0/micro_step=3930/global_step=3930, RunningAvgSamplesPerSec=5.039683225300292, CurrSamplesPerSec=5.160122717807519, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1465 |
+
[2022-12-20 05:08:18,747] [INFO] [logging.py:68:log_dist] [Rank 0] step=3940, skipped=6, lr=[2.371111111111111e-06], mom=[[0.9, 0.999]]
|
1466 |
+
[2022-12-20 05:08:18,749] [INFO] [timer.py:196:stop] epoch=0/micro_step=3940/global_step=3940, RunningAvgSamplesPerSec=5.040075715280884, CurrSamplesPerSec=5.238380926506598, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1467 |
+
[2022-12-20 05:10:44,757] [INFO] [logging.py:68:log_dist] [Rank 0] step=3950, skipped=6, lr=[2.348888888888889e-06], mom=[[0.9, 0.999]]
|
1468 |
+
[2022-12-20 05:10:44,759] [INFO] [timer.py:196:stop] epoch=0/micro_step=3950/global_step=3950, RunningAvgSamplesPerSec=5.040560949881336, CurrSamplesPerSec=5.267823659463463, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1469 |
+
{'loss': 0.0001, 'learning_rate': 2.348888888888889e-06, 'epoch': 56.43}
|
1470 |
+
[2022-12-20 05:13:08,651] [INFO] [logging.py:68:log_dist] [Rank 0] step=3960, skipped=6, lr=[2.3266666666666667e-06], mom=[[0.9, 0.999]]
|
1471 |
+
[2022-12-20 05:13:08,652] [INFO] [timer.py:196:stop] epoch=0/micro_step=3960/global_step=3960, RunningAvgSamplesPerSec=5.0412194992522865, CurrSamplesPerSec=5.200243387288677, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1472 |
+
[2022-12-20 05:15:36,139] [INFO] [logging.py:68:log_dist] [Rank 0] step=3970, skipped=6, lr=[2.3044444444444447e-06], mom=[[0.9, 0.999]]
|
1473 |
+
[2022-12-20 05:15:36,141] [INFO] [timer.py:196:stop] epoch=0/micro_step=3970/global_step=3970, RunningAvgSamplesPerSec=5.0415317780884195, CurrSamplesPerSec=4.9707619477102325, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1474 |
+
{'loss': 0.0001, 'learning_rate': 2.2933333333333335e-06, 'epoch': 56.79}
|
1475 |
+
[2022-12-20 05:18:09,664] [INFO] [logging.py:68:log_dist] [Rank 0] step=3980, skipped=6, lr=[2.2822222222222223e-06], mom=[[0.9, 0.999]]
|
1476 |
+
[2022-12-20 05:18:09,665] [INFO] [timer.py:196:stop] epoch=0/micro_step=3980/global_step=3980, RunningAvgSamplesPerSec=5.041339650106527, CurrSamplesPerSec=4.984928729113024, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1477 |
+
[2022-12-20 05:20:42,295] [INFO] [logging.py:68:log_dist] [Rank 0] step=3990, skipped=6, lr=[2.2600000000000004e-06], mom=[[0.9, 0.999]]
|
1478 |
+
[2022-12-20 05:20:42,297] [INFO] [timer.py:196:stop] epoch=0/micro_step=3990/global_step=3990, RunningAvgSamplesPerSec=5.041122736363477, CurrSamplesPerSec=5.142247817170418, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1479 |
+
[2022-12-20 05:23:18,112] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=6, lr=[2.237777777777778e-06], mom=[[0.9, 0.999]]
|
1480 |
+
[2022-12-20 05:23:18,114] [INFO] [timer.py:196:stop] epoch=0/micro_step=4000/global_step=4000, RunningAvgSamplesPerSec=5.04069535036469, CurrSamplesPerSec=4.9986659756607175, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1481 |
+
{'loss': 0.0001, 'learning_rate': 2.237777777777778e-06, 'epoch': 57.14}
|
1482 |
+
{'eval_loss': 0.469970703125, 'eval_wer': 23.39362208472156, 'eval_runtime': 827.6996, 'eval_samples_per_second': 2.739, 'eval_steps_per_second': 0.086, 'epoch': 57.14}
|
1483 |
+
[2022-12-20 05:37:09,346] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save!
|
1484 |
+
[2022-12-20 05:37:09,358] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt
|
1485 |
+
[2022-12-20 05:37:09,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt...
|
1486 |
+
[2022-12-20 05:37:12,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/mp_rank_00_model_states.pt.
|
1487 |
+
[2022-12-20 05:37:12,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
|
1488 |
+
[2022-12-20 05:37:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1489 |
+
[2022-12-20 05:37:30,734] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1490 |
+
[2022-12-20 05:37:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
runs/Dec19_11-14-29_fe2747a042f0/events.out.tfevents.1671479623.fe2747a042f0.2334566.0
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:91be7231abeb47f145d7b3919b33df051b5c43d8052fa219a01869232132efa2
|
3 |
+
size 30655
|