Training in progress, step 5000
Browse files
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 1527847357
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:df4b119a6ba413fcd75e9151268d9b1bd9224a59cb7a6093c08491baee8c87fa
|
3 |
size 1527847357
|
run.log
CHANGED
@@ -1488,3 +1488,254 @@ Rank: 0 partition count [1] and sizes[(763857920, False)]
|
|
1488 |
[2022-12-20 05:37:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1489 |
[2022-12-20 05:37:30,734] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1490 |
[2022-12-20 05:37:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1488 |
[2022-12-20 05:37:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1489 |
[2022-12-20 05:37:30,734] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1490 |
[2022-12-20 05:37:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
|
1491 |
+
[2022-12-20 05:41:55,491] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
|
1492 |
+
[2022-12-20 05:42:08,160] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
|
1493 |
+
[2022-12-20 05:42:35,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=4010, skipped=8, lr=[2.2200000000000003e-06], mom=[[0.9, 0.999]]
|
1494 |
+
[2022-12-20 05:42:35,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=4010/global_step=4010, RunningAvgSamplesPerSec=5.042979461159848, CurrSamplesPerSec=5.696590514613029, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1495 |
+
[2022-12-20 05:45:04,141] [INFO] [logging.py:68:log_dist] [Rank 0] step=4020, skipped=8, lr=[2.197777777777778e-06], mom=[[0.9, 0.999]]
|
1496 |
+
[2022-12-20 05:45:04,143] [INFO] [timer.py:196:stop] epoch=0/micro_step=4020/global_step=4020, RunningAvgSamplesPerSec=5.04338680465847, CurrSamplesPerSec=5.146485423364737, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1497 |
+
{'loss': 0.0001, 'learning_rate': 2.1866666666666668e-06, 'epoch': 57.5}
|
1498 |
+
[2022-12-20 05:47:36,085] [INFO] [logging.py:68:log_dist] [Rank 0] step=4030, skipped=8, lr=[2.1755555555555556e-06], mom=[[0.9, 0.999]]
|
1499 |
+
[2022-12-20 05:47:36,086] [INFO] [timer.py:196:stop] epoch=0/micro_step=4030/global_step=4030, RunningAvgSamplesPerSec=5.0433669842670845, CurrSamplesPerSec=5.079788047157066, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1500 |
+
[2022-12-20 05:50:10,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=4040, skipped=8, lr=[2.153333333333333e-06], mom=[[0.9, 0.999]]
|
1501 |
+
[2022-12-20 05:50:10,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=4040/global_step=4040, RunningAvgSamplesPerSec=5.04310395870082, CurrSamplesPerSec=5.0191439155807736, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1502 |
+
[2022-12-20 05:52:46,940] [INFO] [logging.py:68:log_dist] [Rank 0] step=4050, skipped=8, lr=[2.1311111111111112e-06], mom=[[0.9, 0.999]]
|
1503 |
+
[2022-12-20 05:52:46,941] [INFO] [timer.py:196:stop] epoch=0/micro_step=4050/global_step=4050, RunningAvgSamplesPerSec=5.042768502101079, CurrSamplesPerSec=4.842702171084141, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1504 |
+
{'loss': 0.0001, 'learning_rate': 2.1311111111111112e-06, 'epoch': 57.86}
|
1505 |
+
[2022-12-20 05:55:21,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=4060, skipped=8, lr=[2.108888888888889e-06], mom=[[0.9, 0.999]]
|
1506 |
+
[2022-12-20 05:55:21,004] [INFO] [timer.py:196:stop] epoch=0/micro_step=4060/global_step=4060, RunningAvgSamplesPerSec=5.042487681984381, CurrSamplesPerSec=4.99908525471156, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1507 |
+
[2022-12-20 05:57:52,840] [INFO] [logging.py:68:log_dist] [Rank 0] step=4070, skipped=8, lr=[2.086666666666667e-06], mom=[[0.9, 0.999]]
|
1508 |
+
[2022-12-20 05:57:52,842] [INFO] [timer.py:196:stop] epoch=0/micro_step=4070/global_step=4070, RunningAvgSamplesPerSec=5.042344945988743, CurrSamplesPerSec=5.111629527255383, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1509 |
+
{'loss': 0.0001, 'learning_rate': 2.0755555555555557e-06, 'epoch': 58.21}
|
1510 |
+
[2022-12-20 06:00:27,903] [INFO] [logging.py:68:log_dist] [Rank 0] step=4080, skipped=8, lr=[2.064444444444445e-06], mom=[[0.9, 0.999]]
|
1511 |
+
[2022-12-20 06:00:27,905] [INFO] [timer.py:196:stop] epoch=0/micro_step=4080/global_step=4080, RunningAvgSamplesPerSec=5.0419804402842665, CurrSamplesPerSec=4.930916151588083, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1512 |
+
[2022-12-20 06:02:52,754] [INFO] [logging.py:68:log_dist] [Rank 0] step=4090, skipped=8, lr=[2.0422222222222225e-06], mom=[[0.9, 0.999]]
|
1513 |
+
[2022-12-20 06:02:52,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=4090/global_step=4090, RunningAvgSamplesPerSec=5.042556557819047, CurrSamplesPerSec=5.363131248975815, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1514 |
+
[2022-12-20 06:05:17,984] [INFO] [logging.py:68:log_dist] [Rank 0] step=4100, skipped=8, lr=[2.02e-06], mom=[[0.9, 0.999]]
|
1515 |
+
[2022-12-20 06:05:17,986] [INFO] [timer.py:196:stop] epoch=0/micro_step=4100/global_step=4100, RunningAvgSamplesPerSec=5.0431467991488805, CurrSamplesPerSec=5.085720797651708, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1516 |
+
{'loss': 0.0001, 'learning_rate': 2.02e-06, 'epoch': 58.57}
|
1517 |
+
[2022-12-20 06:07:41,649] [INFO] [logging.py:68:log_dist] [Rank 0] step=4110, skipped=8, lr=[1.9977777777777778e-06], mom=[[0.9, 0.999]]
|
1518 |
+
[2022-12-20 06:07:41,650] [INFO] [timer.py:196:stop] epoch=0/micro_step=4110/global_step=4110, RunningAvgSamplesPerSec=5.043948241474721, CurrSamplesPerSec=5.412140246565025, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1519 |
+
[2022-12-20 06:10:10,464] [INFO] [logging.py:68:log_dist] [Rank 0] step=4120, skipped=8, lr=[1.975555555555556e-06], mom=[[0.9, 0.999]]
|
1520 |
+
[2022-12-20 06:10:10,465] [INFO] [timer.py:196:stop] epoch=0/micro_step=4120/global_step=4120, RunningAvgSamplesPerSec=5.044459483834623, CurrSamplesPerSec=5.313749224322798, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1521 |
+
{'loss': 0.0001, 'learning_rate': 1.9644444444444446e-06, 'epoch': 58.93}
|
1522 |
+
[2022-12-20 06:12:39,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=4130, skipped=8, lr=[1.9533333333333334e-06], mom=[[0.9, 0.999]]
|
1523 |
+
[2022-12-20 06:12:39,141] [INFO] [timer.py:196:stop] epoch=0/micro_step=4130/global_step=4130, RunningAvgSamplesPerSec=5.0447642989840675, CurrSamplesPerSec=5.2478424910556045, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1524 |
+
[2022-12-20 06:15:02,533] [INFO] [logging.py:68:log_dist] [Rank 0] step=4140, skipped=8, lr=[1.9311111111111114e-06], mom=[[0.9, 0.999]]
|
1525 |
+
[2022-12-20 06:15:02,535] [INFO] [timer.py:196:stop] epoch=0/micro_step=4140/global_step=4140, RunningAvgSamplesPerSec=5.045575952263303, CurrSamplesPerSec=5.5830596404364705, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1526 |
+
[2022-12-20 06:17:27,095] [INFO] [logging.py:68:log_dist] [Rank 0] step=4150, skipped=8, lr=[1.908888888888889e-06], mom=[[0.9, 0.999]]
|
1527 |
+
[2022-12-20 06:17:27,101] [INFO] [timer.py:196:stop] epoch=0/micro_step=4150/global_step=4150, RunningAvgSamplesPerSec=5.046241123558999, CurrSamplesPerSec=5.299716771107708, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1528 |
+
{'loss': 0.0001, 'learning_rate': 1.908888888888889e-06, 'epoch': 59.29}
|
1529 |
+
[2022-12-20 06:20:25,241] [INFO] [logging.py:68:log_dist] [Rank 0] step=4160, skipped=8, lr=[1.8866666666666669e-06], mom=[[0.9, 0.999]]
|
1530 |
+
[2022-12-20 06:20:25,244] [INFO] [timer.py:196:stop] epoch=0/micro_step=4160/global_step=4160, RunningAvgSamplesPerSec=5.043822466629399, CurrSamplesPerSec=5.56617254195275, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1531 |
+
[2022-12-20 06:22:50,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=4170, skipped=8, lr=[1.8644444444444445e-06], mom=[[0.9, 0.999]]
|
1532 |
+
[2022-12-20 06:22:50,984] [INFO] [timer.py:196:stop] epoch=0/micro_step=4170/global_step=4170, RunningAvgSamplesPerSec=5.044478496015336, CurrSamplesPerSec=5.2003204555095985, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1533 |
+
{'loss': 0.0001, 'learning_rate': 1.8533333333333333e-06, 'epoch': 59.64}
|
1534 |
+
[2022-12-20 06:25:17,347] [INFO] [logging.py:68:log_dist] [Rank 0] step=4180, skipped=8, lr=[1.8422222222222225e-06], mom=[[0.9, 0.999]]
|
1535 |
+
[2022-12-20 06:25:17,348] [INFO] [timer.py:196:stop] epoch=0/micro_step=4180/global_step=4180, RunningAvgSamplesPerSec=5.045012622651632, CurrSamplesPerSec=5.299783527305324, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1536 |
+
[2022-12-20 06:27:42,485] [INFO] [logging.py:68:log_dist] [Rank 0] step=4190, skipped=8, lr=[1.8200000000000002e-06], mom=[[0.9, 0.999]]
|
1537 |
+
[2022-12-20 06:27:42,487] [INFO] [timer.py:196:stop] epoch=0/micro_step=4190/global_step=4190, RunningAvgSamplesPerSec=5.045662826265351, CurrSamplesPerSec=5.721540380522979, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1538 |
+
[2022-12-20 06:30:02,274] [INFO] [logging.py:68:log_dist] [Rank 0] step=4200, skipped=8, lr=[1.797777777777778e-06], mom=[[0.9, 0.999]]
|
1539 |
+
[2022-12-20 06:30:02,276] [INFO] [timer.py:196:stop] epoch=0/micro_step=4200/global_step=4200, RunningAvgSamplesPerSec=5.046791681237073, CurrSamplesPerSec=5.624625885538937, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1540 |
+
{'loss': 0.0001, 'learning_rate': 1.797777777777778e-06, 'epoch': 60.0}
|
1541 |
+
[2022-12-20 06:32:26,649] [INFO] [logging.py:68:log_dist] [Rank 0] step=4210, skipped=8, lr=[1.7755555555555556e-06], mom=[[0.9, 0.999]]
|
1542 |
+
[2022-12-20 06:32:26,651] [INFO] [timer.py:196:stop] epoch=0/micro_step=4210/global_step=4210, RunningAvgSamplesPerSec=5.047454122770001, CurrSamplesPerSec=5.141653397468583, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1543 |
+
[2022-12-20 06:34:52,412] [INFO] [logging.py:68:log_dist] [Rank 0] step=4220, skipped=8, lr=[1.7533333333333336e-06], mom=[[0.9, 0.999]]
|
1544 |
+
[2022-12-20 06:34:52,414] [INFO] [timer.py:196:stop] epoch=0/micro_step=4220/global_step=4220, RunningAvgSamplesPerSec=5.047994245909812, CurrSamplesPerSec=5.319056367099886, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1545 |
+
{'loss': 0.0001, 'learning_rate': 1.7422222222222224e-06, 'epoch': 60.36}
|
1546 |
+
[2022-12-20 06:37:15,987] [INFO] [logging.py:68:log_dist] [Rank 0] step=4230, skipped=8, lr=[1.7311111111111112e-06], mom=[[0.9, 0.999]]
|
1547 |
+
[2022-12-20 06:37:15,989] [INFO] [timer.py:196:stop] epoch=0/micro_step=4230/global_step=4230, RunningAvgSamplesPerSec=5.04866732774536, CurrSamplesPerSec=5.388713208948523, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1548 |
+
[2022-12-20 06:39:39,883] [INFO] [logging.py:68:log_dist] [Rank 0] step=4240, skipped=8, lr=[1.708888888888889e-06], mom=[[0.9, 0.999]]
|
1549 |
+
[2022-12-20 06:39:39,884] [INFO] [timer.py:196:stop] epoch=0/micro_step=4240/global_step=4240, RunningAvgSamplesPerSec=5.049283859017793, CurrSamplesPerSec=5.132029172229295, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1550 |
+
[2022-12-20 06:42:06,810] [INFO] [logging.py:68:log_dist] [Rank 0] step=4250, skipped=8, lr=[1.6866666666666667e-06], mom=[[0.9, 0.999]]
|
1551 |
+
[2022-12-20 06:42:06,811] [INFO] [timer.py:196:stop] epoch=0/micro_step=4250/global_step=4250, RunningAvgSamplesPerSec=5.049603812984208, CurrSamplesPerSec=5.245656612106342, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1552 |
+
{'loss': 0.0001, 'learning_rate': 1.6866666666666667e-06, 'epoch': 60.71}
|
1553 |
+
[2022-12-20 06:44:32,622] [INFO] [logging.py:68:log_dist] [Rank 0] step=4260, skipped=8, lr=[1.6644444444444447e-06], mom=[[0.9, 0.999]]
|
1554 |
+
[2022-12-20 06:44:32,623] [INFO] [timer.py:196:stop] epoch=0/micro_step=4260/global_step=4260, RunningAvgSamplesPerSec=5.05001146839292, CurrSamplesPerSec=5.207611653499008, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1555 |
+
[2022-12-20 06:46:59,516] [INFO] [logging.py:68:log_dist] [Rank 0] step=4270, skipped=8, lr=[1.6422222222222223e-06], mom=[[0.9, 0.999]]
|
1556 |
+
[2022-12-20 06:46:59,518] [INFO] [timer.py:196:stop] epoch=0/micro_step=4270/global_step=4270, RunningAvgSamplesPerSec=5.050310918921533, CurrSamplesPerSec=5.3487399204314645, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1557 |
+
{'loss': 0.0001, 'learning_rate': 1.6311111111111114e-06, 'epoch': 61.07}
|
1558 |
+
[2022-12-20 06:49:25,738] [INFO] [logging.py:68:log_dist] [Rank 0] step=4280, skipped=8, lr=[1.6200000000000002e-06], mom=[[0.9, 0.999]]
|
1559 |
+
[2022-12-20 06:49:25,739] [INFO] [timer.py:196:stop] epoch=0/micro_step=4280/global_step=4280, RunningAvgSamplesPerSec=5.050636049742784, CurrSamplesPerSec=5.267817766999322, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1560 |
+
[2022-12-20 06:51:52,463] [INFO] [logging.py:68:log_dist] [Rank 0] step=4290, skipped=8, lr=[1.5977777777777778e-06], mom=[[0.9, 0.999]]
|
1561 |
+
[2022-12-20 06:51:52,464] [INFO] [timer.py:196:stop] epoch=0/micro_step=4290/global_step=4290, RunningAvgSamplesPerSec=5.050927321793066, CurrSamplesPerSec=5.091219627389776, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1562 |
+
[2022-12-20 06:54:20,387] [INFO] [logging.py:68:log_dist] [Rank 0] step=4300, skipped=8, lr=[1.5755555555555558e-06], mom=[[0.9, 0.999]]
|
1563 |
+
[2022-12-20 06:54:20,389] [INFO] [timer.py:196:stop] epoch=0/micro_step=4300/global_step=4300, RunningAvgSamplesPerSec=5.051116554864754, CurrSamplesPerSec=5.131412687451037, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1564 |
+
{'loss': 0.0001, 'learning_rate': 1.5755555555555558e-06, 'epoch': 61.43}
|
1565 |
+
[2022-12-20 06:56:49,899] [INFO] [logging.py:68:log_dist] [Rank 0] step=4310, skipped=8, lr=[1.5533333333333334e-06], mom=[[0.9, 0.999]]
|
1566 |
+
[2022-12-20 06:56:49,901] [INFO] [timer.py:196:stop] epoch=0/micro_step=4310/global_step=4310, RunningAvgSamplesPerSec=5.051115029415557, CurrSamplesPerSec=5.207083942120238, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1567 |
+
[2022-12-20 06:59:24,979] [INFO] [logging.py:68:log_dist] [Rank 0] step=4320, skipped=8, lr=[1.5311111111111113e-06], mom=[[0.9, 0.999]]
|
1568 |
+
[2022-12-20 06:59:24,980] [INFO] [timer.py:196:stop] epoch=0/micro_step=4320/global_step=4320, RunningAvgSamplesPerSec=5.050576559473915, CurrSamplesPerSec=4.720912328409414, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1569 |
+
{'loss': 0.0001, 'learning_rate': 1.52e-06, 'epoch': 61.79}
|
1570 |
+
[2022-12-20 07:02:01,098] [INFO] [logging.py:68:log_dist] [Rank 0] step=4330, skipped=8, lr=[1.5088888888888889e-06], mom=[[0.9, 0.999]]
|
1571 |
+
[2022-12-20 07:02:01,099] [INFO] [timer.py:196:stop] epoch=0/micro_step=4330/global_step=4330, RunningAvgSamplesPerSec=5.0499269186777225, CurrSamplesPerSec=4.822616218008508, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1572 |
+
[2022-12-20 07:04:34,365] [INFO] [logging.py:68:log_dist] [Rank 0] step=4340, skipped=8, lr=[1.486666666666667e-06], mom=[[0.9, 0.999]]
|
1573 |
+
[2022-12-20 07:04:34,366] [INFO] [timer.py:196:stop] epoch=0/micro_step=4340/global_step=4340, RunningAvgSamplesPerSec=5.049580360469089, CurrSamplesPerSec=5.267244917024575, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1574 |
+
[2022-12-20 07:07:06,150] [INFO] [logging.py:68:log_dist] [Rank 0] step=4350, skipped=8, lr=[1.4644444444444445e-06], mom=[[0.9, 0.999]]
|
1575 |
+
[2022-12-20 07:07:06,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4350/global_step=4350, RunningAvgSamplesPerSec=5.04936238047649, CurrSamplesPerSec=4.829278603413176, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1576 |
+
{'loss': 0.0001, 'learning_rate': 1.4644444444444445e-06, 'epoch': 62.14}
|
1577 |
+
[2022-12-20 07:09:42,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=4360, skipped=8, lr=[1.4422222222222223e-06], mom=[[0.9, 0.999]]
|
1578 |
+
[2022-12-20 07:09:42,082] [INFO] [timer.py:196:stop] epoch=0/micro_step=4360/global_step=4360, RunningAvgSamplesPerSec=5.048784732951024, CurrSamplesPerSec=4.751107522016899, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1579 |
+
[2022-12-20 07:12:19,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=4370, skipped=8, lr=[1.42e-06], mom=[[0.9, 0.999]]
|
1580 |
+
[2022-12-20 07:12:19,736] [INFO] [timer.py:196:stop] epoch=0/micro_step=4370/global_step=4370, RunningAvgSamplesPerSec=5.048028850314156, CurrSamplesPerSec=4.923453726629133, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1581 |
+
{'loss': 0.0001, 'learning_rate': 1.4088888888888892e-06, 'epoch': 62.5}
|
1582 |
+
[2022-12-20 07:14:51,218] [INFO] [logging.py:68:log_dist] [Rank 0] step=4380, skipped=8, lr=[1.397777777777778e-06], mom=[[0.9, 0.999]]
|
1583 |
+
[2022-12-20 07:14:51,219] [INFO] [timer.py:196:stop] epoch=0/micro_step=4380/global_step=4380, RunningAvgSamplesPerSec=5.047867689289667, CurrSamplesPerSec=4.890229593461905, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1584 |
+
[2022-12-20 07:17:28,428] [INFO] [logging.py:68:log_dist] [Rank 0] step=4390, skipped=8, lr=[1.3755555555555556e-06], mom=[[0.9, 0.999]]
|
1585 |
+
[2022-12-20 07:17:28,430] [INFO] [timer.py:196:stop] epoch=0/micro_step=4390/global_step=4390, RunningAvgSamplesPerSec=5.047159814157629, CurrSamplesPerSec=4.948499742634576, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1586 |
+
[2022-12-20 07:20:06,640] [INFO] [logging.py:68:log_dist] [Rank 0] step=4400, skipped=8, lr=[1.3533333333333334e-06], mom=[[0.9, 0.999]]
|
1587 |
+
[2022-12-20 07:20:06,642] [INFO] [timer.py:196:stop] epoch=0/micro_step=4400/global_step=4400, RunningAvgSamplesPerSec=5.046366139085683, CurrSamplesPerSec=4.623666005682293, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1588 |
+
{'loss': 0.0001, 'learning_rate': 1.3533333333333334e-06, 'epoch': 62.86}
|
1589 |
+
[2022-12-20 07:22:42,638] [INFO] [logging.py:68:log_dist] [Rank 0] step=4410, skipped=8, lr=[1.3311111111111113e-06], mom=[[0.9, 0.999]]
|
1590 |
+
[2022-12-20 07:22:42,640] [INFO] [timer.py:196:stop] epoch=0/micro_step=4410/global_step=4410, RunningAvgSamplesPerSec=5.045758620454289, CurrSamplesPerSec=5.076388247096151, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1591 |
+
[2022-12-20 07:25:17,610] [INFO] [logging.py:68:log_dist] [Rank 0] step=4420, skipped=8, lr=[1.308888888888889e-06], mom=[[0.9, 0.999]]
|
1592 |
+
[2022-12-20 07:25:17,612] [INFO] [timer.py:196:stop] epoch=0/micro_step=4420/global_step=4420, RunningAvgSamplesPerSec=5.045221775215709, CurrSamplesPerSec=4.847442318721658, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1593 |
+
{'loss': 0.0001, 'learning_rate': 1.2977777777777779e-06, 'epoch': 63.21}
|
1594 |
+
[2022-12-20 07:27:55,202] [INFO] [logging.py:68:log_dist] [Rank 0] step=4430, skipped=8, lr=[1.286666666666667e-06], mom=[[0.9, 0.999]]
|
1595 |
+
[2022-12-20 07:27:55,203] [INFO] [timer.py:196:stop] epoch=0/micro_step=4430/global_step=4430, RunningAvgSamplesPerSec=5.0444825303875485, CurrSamplesPerSec=4.71763327003565, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1596 |
+
[2022-12-20 07:30:32,300] [INFO] [logging.py:68:log_dist] [Rank 0] step=4440, skipped=8, lr=[1.2644444444444445e-06], mom=[[0.9, 0.999]]
|
1597 |
+
[2022-12-20 07:30:32,302] [INFO] [timer.py:196:stop] epoch=0/micro_step=4440/global_step=4440, RunningAvgSamplesPerSec=5.043765816300071, CurrSamplesPerSec=4.756459465106135, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1598 |
+
[2022-12-20 07:33:08,239] [INFO] [logging.py:68:log_dist] [Rank 0] step=4450, skipped=8, lr=[1.2422222222222224e-06], mom=[[0.9, 0.999]]
|
1599 |
+
[2022-12-20 07:33:08,241] [INFO] [timer.py:196:stop] epoch=0/micro_step=4450/global_step=4450, RunningAvgSamplesPerSec=5.0432365230532445, CurrSamplesPerSec=5.085449771164148, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1600 |
+
{'loss': 0.0001, 'learning_rate': 1.2422222222222224e-06, 'epoch': 63.57}
|
1601 |
+
[2022-12-20 07:35:43,710] [INFO] [logging.py:68:log_dist] [Rank 0] step=4460, skipped=8, lr=[1.2200000000000002e-06], mom=[[0.9, 0.999]]
|
1602 |
+
[2022-12-20 07:35:43,711] [INFO] [timer.py:196:stop] epoch=0/micro_step=4460/global_step=4460, RunningAvgSamplesPerSec=5.042712521939606, CurrSamplesPerSec=4.778885074217907, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1603 |
+
[2022-12-20 07:38:17,792] [INFO] [logging.py:68:log_dist] [Rank 0] step=4470, skipped=8, lr=[1.1977777777777778e-06], mom=[[0.9, 0.999]]
|
1604 |
+
[2022-12-20 07:38:17,794] [INFO] [timer.py:196:stop] epoch=0/micro_step=4470/global_step=4470, RunningAvgSamplesPerSec=5.042281683097091, CurrSamplesPerSec=5.02905169577589, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1605 |
+
{'loss': 0.0001, 'learning_rate': 1.1866666666666668e-06, 'epoch': 63.93}
|
1606 |
+
[2022-12-20 07:40:49,311] [INFO] [logging.py:68:log_dist] [Rank 0] step=4480, skipped=8, lr=[1.1755555555555556e-06], mom=[[0.9, 0.999]]
|
1607 |
+
[2022-12-20 07:40:49,312] [INFO] [timer.py:196:stop] epoch=0/micro_step=4480/global_step=4480, RunningAvgSamplesPerSec=5.042148780780479, CurrSamplesPerSec=5.1255290111304, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1608 |
+
[2022-12-20 07:43:11,925] [INFO] [logging.py:68:log_dist] [Rank 0] step=4490, skipped=8, lr=[1.1533333333333334e-06], mom=[[0.9, 0.999]]
|
1609 |
+
[2022-12-20 07:43:11,926] [INFO] [timer.py:196:stop] epoch=0/micro_step=4490/global_step=4490, RunningAvgSamplesPerSec=5.042788143087416, CurrSamplesPerSec=5.203591461607821, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1610 |
+
[2022-12-20 07:45:36,568] [INFO] [logging.py:68:log_dist] [Rank 0] step=4500, skipped=8, lr=[1.131111111111111e-06], mom=[[0.9, 0.999]]
|
1611 |
+
[2022-12-20 07:45:36,570] [INFO] [timer.py:196:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=5.043178181242045, CurrSamplesPerSec=5.263032480306624, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1612 |
+
{'loss': 0.0001, 'learning_rate': 1.131111111111111e-06, 'epoch': 64.29}
|
1613 |
+
[2022-12-20 07:48:03,486] [INFO] [logging.py:68:log_dist] [Rank 0] step=4510, skipped=8, lr=[1.1088888888888889e-06], mom=[[0.9, 0.999]]
|
1614 |
+
[2022-12-20 07:48:03,488] [INFO] [timer.py:196:stop] epoch=0/micro_step=4510/global_step=4510, RunningAvgSamplesPerSec=5.043365421470672, CurrSamplesPerSec=4.939414436159629, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1615 |
+
[2022-12-20 07:50:29,784] [INFO] [logging.py:68:log_dist] [Rank 0] step=4520, skipped=8, lr=[1.0866666666666667e-06], mom=[[0.9, 0.999]]
|
1616 |
+
[2022-12-20 07:50:29,785] [INFO] [timer.py:196:stop] epoch=0/micro_step=4520/global_step=4520, RunningAvgSamplesPerSec=5.043697590182427, CurrSamplesPerSec=5.449464959209118, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1617 |
+
{'loss': 0.0001, 'learning_rate': 1.0755555555555557e-06, 'epoch': 64.64}
|
1618 |
+
[2022-12-20 07:52:56,799] [INFO] [logging.py:68:log_dist] [Rank 0] step=4530, skipped=8, lr=[1.0644444444444445e-06], mom=[[0.9, 0.999]]
|
1619 |
+
[2022-12-20 07:52:56,801] [INFO] [timer.py:196:stop] epoch=0/micro_step=4530/global_step=4530, RunningAvgSamplesPerSec=5.043944921085424, CurrSamplesPerSec=4.848887789663811, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1620 |
+
[2022-12-20 07:55:33,034] [INFO] [logging.py:68:log_dist] [Rank 0] step=4540, skipped=8, lr=[1.0422222222222221e-06], mom=[[0.9, 0.999]]
|
1621 |
+
[2022-12-20 07:55:33,036] [INFO] [timer.py:196:stop] epoch=0/micro_step=4540/global_step=4540, RunningAvgSamplesPerSec=5.043375303491235, CurrSamplesPerSec=4.778829859739575, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1622 |
+
[2022-12-20 07:58:03,913] [INFO] [logging.py:68:log_dist] [Rank 0] step=4550, skipped=8, lr=[1.02e-06], mom=[[0.9, 0.999]]
|
1623 |
+
[2022-12-20 07:58:03,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=4550/global_step=4550, RunningAvgSamplesPerSec=5.043328963013769, CurrSamplesPerSec=5.608318938320819, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1624 |
+
{'loss': 0.0001, 'learning_rate': 1.02e-06, 'epoch': 65.0}
|
1625 |
+
[2022-12-20 08:00:25,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=4560, skipped=8, lr=[9.97777777777778e-07], mom=[[0.9, 0.999]]
|
1626 |
+
[2022-12-20 08:00:25,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=4560/global_step=4560, RunningAvgSamplesPerSec=5.044098201656415, CurrSamplesPerSec=5.341246991766891, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1627 |
+
[2022-12-20 08:02:46,105] [INFO] [logging.py:68:log_dist] [Rank 0] step=4570, skipped=8, lr=[9.755555555555556e-07], mom=[[0.9, 0.999]]
|
1628 |
+
[2022-12-20 08:02:46,106] [INFO] [timer.py:196:stop] epoch=0/micro_step=4570/global_step=4570, RunningAvgSamplesPerSec=5.044950401998271, CurrSamplesPerSec=5.438711647812863, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1629 |
+
{'loss': 0.0001, 'learning_rate': 9.644444444444444e-07, 'epoch': 65.36}
|
1630 |
+
[2022-12-20 08:05:10,797] [INFO] [logging.py:68:log_dist] [Rank 0] step=4580, skipped=8, lr=[9.533333333333335e-07], mom=[[0.9, 0.999]]
|
1631 |
+
[2022-12-20 08:05:10,798] [INFO] [timer.py:196:stop] epoch=0/micro_step=4580/global_step=4580, RunningAvgSamplesPerSec=5.045476504759206, CurrSamplesPerSec=5.188928621390404, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1632 |
+
[2022-12-20 08:07:36,800] [INFO] [logging.py:68:log_dist] [Rank 0] step=4590, skipped=8, lr=[9.311111111111113e-07], mom=[[0.9, 0.999]]
|
1633 |
+
[2022-12-20 08:07:36,802] [INFO] [timer.py:196:stop] epoch=0/micro_step=4590/global_step=4590, RunningAvgSamplesPerSec=5.045847427660788, CurrSamplesPerSec=5.12575734566362, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1634 |
+
[2022-12-20 08:10:02,459] [INFO] [logging.py:68:log_dist] [Rank 0] step=4600, skipped=8, lr=[9.08888888888889e-07], mom=[[0.9, 0.999]]
|
1635 |
+
[2022-12-20 08:10:02,460] [INFO] [timer.py:196:stop] epoch=0/micro_step=4600/global_step=4600, RunningAvgSamplesPerSec=5.046269260328224, CurrSamplesPerSec=5.21262751846709, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1636 |
+
{'loss': 0.0001, 'learning_rate': 9.08888888888889e-07, 'epoch': 65.71}
|
1637 |
+
[2022-12-20 08:12:26,062] [INFO] [logging.py:68:log_dist] [Rank 0] step=4610, skipped=8, lr=[8.866666666666668e-07], mom=[[0.9, 0.999]]
|
1638 |
+
[2022-12-20 08:12:26,063] [INFO] [timer.py:196:stop] epoch=0/micro_step=4610/global_step=4610, RunningAvgSamplesPerSec=5.046797422211321, CurrSamplesPerSec=5.262579211303562, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1639 |
+
[2022-12-20 08:14:48,783] [INFO] [logging.py:68:log_dist] [Rank 0] step=4620, skipped=8, lr=[8.644444444444445e-07], mom=[[0.9, 0.999]]
|
1640 |
+
[2022-12-20 08:14:48,785] [INFO] [timer.py:196:stop] epoch=0/micro_step=4620/global_step=4620, RunningAvgSamplesPerSec=5.0474143889414895, CurrSamplesPerSec=5.478198811861338, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1641 |
+
{'loss': 0.0001, 'learning_rate': 8.533333333333334e-07, 'epoch': 66.07}
|
1642 |
+
[2022-12-20 08:17:14,091] [INFO] [logging.py:68:log_dist] [Rank 0] step=4630, skipped=8, lr=[8.422222222222224e-07], mom=[[0.9, 0.999]]
|
1643 |
+
[2022-12-20 08:17:14,092] [INFO] [timer.py:196:stop] epoch=0/micro_step=4630/global_step=4630, RunningAvgSamplesPerSec=5.047884003074243, CurrSamplesPerSec=5.288036006059478, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1644 |
+
[2022-12-20 08:19:37,655] [INFO] [logging.py:68:log_dist] [Rank 0] step=4640, skipped=8, lr=[8.200000000000001e-07], mom=[[0.9, 0.999]]
|
1645 |
+
[2022-12-20 08:19:37,657] [INFO] [timer.py:196:stop] epoch=0/micro_step=4640/global_step=4640, RunningAvgSamplesPerSec=5.048458691418149, CurrSamplesPerSec=5.257712499872688, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1646 |
+
[2022-12-20 08:22:00,831] [INFO] [logging.py:68:log_dist] [Rank 0] step=4650, skipped=8, lr=[7.977777777777779e-07], mom=[[0.9, 0.999]]
|
1647 |
+
[2022-12-20 08:22:00,833] [INFO] [timer.py:196:stop] epoch=0/micro_step=4650/global_step=4650, RunningAvgSamplesPerSec=5.049032023727786, CurrSamplesPerSec=5.335302655950034, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1648 |
+
{'loss': 0.0001, 'learning_rate': 7.977777777777779e-07, 'epoch': 66.43}
|
1649 |
+
[2022-12-20 08:24:23,381] [INFO] [logging.py:68:log_dist] [Rank 0] step=4660, skipped=8, lr=[7.755555555555556e-07], mom=[[0.9, 0.999]]
|
1650 |
+
[2022-12-20 08:24:23,382] [INFO] [timer.py:196:stop] epoch=0/micro_step=4660/global_step=4660, RunningAvgSamplesPerSec=5.049679844711924, CurrSamplesPerSec=5.473512879448466, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1651 |
+
[2022-12-20 08:26:48,255] [INFO] [logging.py:68:log_dist] [Rank 0] step=4670, skipped=8, lr=[7.533333333333335e-07], mom=[[0.9, 0.999]]
|
1652 |
+
[2022-12-20 08:26:48,257] [INFO] [timer.py:196:stop] epoch=0/micro_step=4670/global_step=4670, RunningAvgSamplesPerSec=5.050125782334478, CurrSamplesPerSec=5.377273288074986, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1653 |
+
{'loss': 0.0001, 'learning_rate': 7.422222222222223e-07, 'epoch': 66.79}
|
1654 |
+
[2022-12-20 08:29:15,132] [INFO] [logging.py:68:log_dist] [Rank 0] step=4680, skipped=8, lr=[7.311111111111112e-07], mom=[[0.9, 0.999]]
|
1655 |
+
[2022-12-20 08:29:15,134] [INFO] [timer.py:196:stop] epoch=0/micro_step=4680/global_step=4680, RunningAvgSamplesPerSec=5.050512474845127, CurrSamplesPerSec=5.138506888939374, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1656 |
+
[2022-12-20 08:31:41,228] [INFO] [logging.py:68:log_dist] [Rank 0] step=4690, skipped=8, lr=[7.08888888888889e-07], mom=[[0.9, 0.999]]
|
1657 |
+
[2022-12-20 08:31:41,230] [INFO] [timer.py:196:stop] epoch=0/micro_step=4690/global_step=4690, RunningAvgSamplesPerSec=5.050892558272724, CurrSamplesPerSec=5.392371818165035, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1658 |
+
[2022-12-20 08:34:05,021] [INFO] [logging.py:68:log_dist] [Rank 0] step=4700, skipped=8, lr=[6.866666666666667e-07], mom=[[0.9, 0.999]]
|
1659 |
+
[2022-12-20 08:34:05,022] [INFO] [timer.py:196:stop] epoch=0/micro_step=4700/global_step=4700, RunningAvgSamplesPerSec=5.051463391225271, CurrSamplesPerSec=5.1622127629198085, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1660 |
+
{'loss': 0.0001, 'learning_rate': 6.866666666666667e-07, 'epoch': 67.14}
|
1661 |
+
[2022-12-20 08:36:30,437] [INFO] [logging.py:68:log_dist] [Rank 0] step=4710, skipped=8, lr=[6.644444444444446e-07], mom=[[0.9, 0.999]]
|
1662 |
+
[2022-12-20 08:36:30,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=4710/global_step=4710, RunningAvgSamplesPerSec=5.051893912055331, CurrSamplesPerSec=5.445218800319529, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1663 |
+
[2022-12-20 08:38:59,891] [INFO] [logging.py:68:log_dist] [Rank 0] step=4720, skipped=8, lr=[6.422222222222223e-07], mom=[[0.9, 0.999]]
|
1664 |
+
[2022-12-20 08:38:59,893] [INFO] [timer.py:196:stop] epoch=0/micro_step=4720/global_step=4720, RunningAvgSamplesPerSec=5.052178801787182, CurrSamplesPerSec=5.115131684190571, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1665 |
+
{'loss': 0.0001, 'learning_rate': 6.311111111111112e-07, 'epoch': 67.5}
|
1666 |
+
[2022-12-20 08:41:26,544] [INFO] [logging.py:68:log_dist] [Rank 0] step=4730, skipped=8, lr=[6.200000000000001e-07], mom=[[0.9, 0.999]]
|
1667 |
+
[2022-12-20 08:41:26,545] [INFO] [timer.py:196:stop] epoch=0/micro_step=4730/global_step=4730, RunningAvgSamplesPerSec=5.05254451772132, CurrSamplesPerSec=5.304123565487315, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1668 |
+
[2022-12-20 08:43:52,849] [INFO] [logging.py:68:log_dist] [Rank 0] step=4740, skipped=8, lr=[5.977777777777778e-07], mom=[[0.9, 0.999]]
|
1669 |
+
[2022-12-20 08:43:52,851] [INFO] [timer.py:196:stop] epoch=0/micro_step=4740/global_step=4740, RunningAvgSamplesPerSec=5.053035885215057, CurrSamplesPerSec=5.329902208617103, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1670 |
+
[2022-12-20 08:46:17,184] [INFO] [logging.py:68:log_dist] [Rank 0] step=4750, skipped=8, lr=[5.755555555555555e-07], mom=[[0.9, 0.999]]
|
1671 |
+
[2022-12-20 08:46:17,185] [INFO] [timer.py:196:stop] epoch=0/micro_step=4750/global_step=4750, RunningAvgSamplesPerSec=5.053785097454139, CurrSamplesPerSec=5.481972930186799, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1672 |
+
{'loss': 0.0001, 'learning_rate': 5.755555555555555e-07, 'epoch': 67.86}
|
1673 |
+
[2022-12-20 08:48:33,665] [INFO] [logging.py:68:log_dist] [Rank 0] step=4760, skipped=8, lr=[5.533333333333334e-07], mom=[[0.9, 0.999]]
|
1674 |
+
[2022-12-20 08:48:33,666] [INFO] [timer.py:196:stop] epoch=0/micro_step=4760/global_step=4760, RunningAvgSamplesPerSec=5.055487957320154, CurrSamplesPerSec=6.146145092417891, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1675 |
+
[2022-12-20 08:50:52,761] [INFO] [logging.py:68:log_dist] [Rank 0] step=4770, skipped=8, lr=[5.311111111111111e-07], mom=[[0.9, 0.999]]
|
1676 |
+
[2022-12-20 08:50:52,762] [INFO] [timer.py:196:stop] epoch=0/micro_step=4770/global_step=4770, RunningAvgSamplesPerSec=5.056655527182182, CurrSamplesPerSec=5.561989132264451, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1677 |
+
{'loss': 0.0001, 'learning_rate': 5.2e-07, 'epoch': 68.21}
|
1678 |
+
[2022-12-20 08:53:12,380] [INFO] [logging.py:68:log_dist] [Rank 0] step=4780, skipped=8, lr=[5.088888888888889e-07], mom=[[0.9, 0.999]]
|
1679 |
+
[2022-12-20 08:53:12,382] [INFO] [timer.py:196:stop] epoch=0/micro_step=4780/global_step=4780, RunningAvgSamplesPerSec=5.057621824492988, CurrSamplesPerSec=5.652548091745247, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1680 |
+
[2022-12-20 08:55:31,207] [INFO] [logging.py:68:log_dist] [Rank 0] step=4790, skipped=8, lr=[4.866666666666666e-07], mom=[[0.9, 0.999]]
|
1681 |
+
[2022-12-20 08:55:31,208] [INFO] [timer.py:196:stop] epoch=0/micro_step=4790/global_step=4790, RunningAvgSamplesPerSec=5.058667203586027, CurrSamplesPerSec=5.664042371231427, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1682 |
+
[2022-12-20 08:57:50,720] [INFO] [logging.py:68:log_dist] [Rank 0] step=4800, skipped=8, lr=[4.6444444444444446e-07], mom=[[0.9, 0.999]]
|
1683 |
+
[2022-12-20 08:57:50,721] [INFO] [timer.py:196:stop] epoch=0/micro_step=4800/global_step=4800, RunningAvgSamplesPerSec=5.059570925383668, CurrSamplesPerSec=5.519459240106216, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1684 |
+
{'loss': 0.0001, 'learning_rate': 4.6444444444444446e-07, 'epoch': 68.57}
|
1685 |
+
[2022-12-20 09:00:12,990] [INFO] [logging.py:68:log_dist] [Rank 0] step=4810, skipped=8, lr=[4.422222222222223e-07], mom=[[0.9, 0.999]]
|
1686 |
+
[2022-12-20 09:00:12,992] [INFO] [timer.py:196:stop] epoch=0/micro_step=4810/global_step=4810, RunningAvgSamplesPerSec=5.060229523268901, CurrSamplesPerSec=5.224763032139902, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1687 |
+
[2022-12-20 09:02:36,045] [INFO] [logging.py:68:log_dist] [Rank 0] step=4820, skipped=8, lr=[4.2000000000000006e-07], mom=[[0.9, 0.999]]
|
1688 |
+
[2022-12-20 09:02:36,046] [INFO] [timer.py:196:stop] epoch=0/micro_step=4820/global_step=4820, RunningAvgSamplesPerSec=5.0608145350618585, CurrSamplesPerSec=5.335017418001835, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1689 |
+
{'loss': 0.0001, 'learning_rate': 4.0888888888888897e-07, 'epoch': 68.93}
|
1690 |
+
[2022-12-20 09:05:03,069] [INFO] [logging.py:68:log_dist] [Rank 0] step=4830, skipped=8, lr=[3.9777777777777783e-07], mom=[[0.9, 0.999]]
|
1691 |
+
[2022-12-20 09:05:03,071] [INFO] [timer.py:196:stop] epoch=0/micro_step=4830/global_step=4830, RunningAvgSamplesPerSec=5.061128322114403, CurrSamplesPerSec=5.27507112151907, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1692 |
+
[2022-12-20 09:07:29,555] [INFO] [logging.py:68:log_dist] [Rank 0] step=4840, skipped=8, lr=[3.755555555555556e-07], mom=[[0.9, 0.999]]
|
1693 |
+
[2022-12-20 09:07:29,557] [INFO] [timer.py:196:stop] epoch=0/micro_step=4840/global_step=4840, RunningAvgSamplesPerSec=5.061443297556752, CurrSamplesPerSec=5.330757008706792, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1694 |
+
[2022-12-20 09:09:54,208] [INFO] [logging.py:68:log_dist] [Rank 0] step=4850, skipped=8, lr=[3.533333333333334e-07], mom=[[0.9, 0.999]]
|
1695 |
+
[2022-12-20 09:09:54,210] [INFO] [timer.py:196:stop] epoch=0/micro_step=4850/global_step=4850, RunningAvgSamplesPerSec=5.061879695879943, CurrSamplesPerSec=5.24628147577871, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1696 |
+
{'loss': 0.0001, 'learning_rate': 3.533333333333334e-07, 'epoch': 69.29}
|
1697 |
+
[2022-12-20 09:12:18,737] [INFO] [logging.py:68:log_dist] [Rank 0] step=4860, skipped=8, lr=[3.3111111111111115e-07], mom=[[0.9, 0.999]]
|
1698 |
+
[2022-12-20 09:12:18,738] [INFO] [timer.py:196:stop] epoch=0/micro_step=4860/global_step=4860, RunningAvgSamplesPerSec=5.062324568211323, CurrSamplesPerSec=5.235211029343304, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1699 |
+
[2022-12-20 09:14:42,180] [INFO] [logging.py:68:log_dist] [Rank 0] step=4870, skipped=8, lr=[3.088888888888889e-07], mom=[[0.9, 0.999]]
|
1700 |
+
[2022-12-20 09:14:42,181] [INFO] [timer.py:196:stop] epoch=0/micro_step=4870/global_step=4870, RunningAvgSamplesPerSec=5.062819565868312, CurrSamplesPerSec=5.294168184906786, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1701 |
+
{'loss': 0.0001, 'learning_rate': 2.977777777777778e-07, 'epoch': 69.64}
|
1702 |
+
[2022-12-20 09:17:07,652] [INFO] [logging.py:68:log_dist] [Rank 0] step=4880, skipped=8, lr=[2.866666666666667e-07], mom=[[0.9, 0.999]]
|
1703 |
+
[2022-12-20 09:17:07,654] [INFO] [timer.py:196:stop] epoch=0/micro_step=4880/global_step=4880, RunningAvgSamplesPerSec=5.063114334971396, CurrSamplesPerSec=5.268436237516054, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1704 |
+
[2022-12-20 09:19:33,593] [INFO] [logging.py:68:log_dist] [Rank 0] step=4890, skipped=8, lr=[2.6444444444444447e-07], mom=[[0.9, 0.999]]
|
1705 |
+
[2022-12-20 09:19:33,595] [INFO] [timer.py:196:stop] epoch=0/micro_step=4890/global_step=4890, RunningAvgSamplesPerSec=5.063346113868746, CurrSamplesPerSec=5.148231572465386, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1706 |
+
[2022-12-20 09:21:59,702] [INFO] [logging.py:68:log_dist] [Rank 0] step=4900, skipped=8, lr=[2.4222222222222224e-07], mom=[[0.9, 0.999]]
|
1707 |
+
[2022-12-20 09:21:59,704] [INFO] [timer.py:196:stop] epoch=0/micro_step=4900/global_step=4900, RunningAvgSamplesPerSec=5.063642263622397, CurrSamplesPerSec=5.281941504872331, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1708 |
+
{'loss': 0.0001, 'learning_rate': 2.4222222222222224e-07, 'epoch': 70.0}
|
1709 |
+
[2022-12-20 09:24:26,305] [INFO] [logging.py:68:log_dist] [Rank 0] step=4910, skipped=8, lr=[2.2e-07], mom=[[0.9, 0.999]]
|
1710 |
+
[2022-12-20 09:24:26,307] [INFO] [timer.py:196:stop] epoch=0/micro_step=4910/global_step=4910, RunningAvgSamplesPerSec=5.063868324555161, CurrSamplesPerSec=5.059670421090867, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1711 |
+
[2022-12-20 09:26:55,101] [INFO] [logging.py:68:log_dist] [Rank 0] step=4920, skipped=8, lr=[1.9777777777777778e-07], mom=[[0.9, 0.999]]
|
1712 |
+
[2022-12-20 09:26:55,103] [INFO] [timer.py:196:stop] epoch=0/micro_step=4920/global_step=4920, RunningAvgSamplesPerSec=5.063894210846909, CurrSamplesPerSec=4.78239546982397, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1713 |
+
{'loss': 0.0001, 'learning_rate': 1.866666666666667e-07, 'epoch': 70.36}
|
1714 |
+
[2022-12-20 09:29:28,157] [INFO] [logging.py:68:log_dist] [Rank 0] step=4930, skipped=8, lr=[1.7555555555555558e-07], mom=[[0.9, 0.999]]
|
1715 |
+
[2022-12-20 09:29:28,158] [INFO] [timer.py:196:stop] epoch=0/micro_step=4930/global_step=4930, RunningAvgSamplesPerSec=5.063701372391198, CurrSamplesPerSec=5.047858548350161, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1716 |
+
[2022-12-20 09:31:48,059] [INFO] [logging.py:68:log_dist] [Rank 0] step=4940, skipped=8, lr=[1.5333333333333333e-07], mom=[[0.9, 0.999]]
|
1717 |
+
[2022-12-20 09:31:48,061] [INFO] [timer.py:196:stop] epoch=0/micro_step=4940/global_step=4940, RunningAvgSamplesPerSec=5.064516563171399, CurrSamplesPerSec=5.5484346820325605, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1718 |
+
[2022-12-20 09:34:08,499] [INFO] [logging.py:68:log_dist] [Rank 0] step=4950, skipped=8, lr=[1.3111111111111113e-07], mom=[[0.9, 0.999]]
|
1719 |
+
[2022-12-20 09:34:08,501] [INFO] [timer.py:196:stop] epoch=0/micro_step=4950/global_step=4950, RunningAvgSamplesPerSec=5.065306002780262, CurrSamplesPerSec=5.450649164948605, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1720 |
+
{'loss': 0.0001, 'learning_rate': 1.3111111111111113e-07, 'epoch': 70.71}
|
1721 |
+
[2022-12-20 09:36:28,150] [INFO] [logging.py:68:log_dist] [Rank 0] step=4960, skipped=8, lr=[1.088888888888889e-07], mom=[[0.9, 0.999]]
|
1722 |
+
[2022-12-20 09:36:28,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4960/global_step=4960, RunningAvgSamplesPerSec=5.066117476082817, CurrSamplesPerSec=5.618326183471552, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1723 |
+
[2022-12-20 09:38:47,744] [INFO] [logging.py:68:log_dist] [Rank 0] step=4970, skipped=8, lr=[8.666666666666668e-08], mom=[[0.9, 0.999]]
|
1724 |
+
[2022-12-20 09:38:47,746] [INFO] [timer.py:196:stop] epoch=0/micro_step=4970/global_step=4970, RunningAvgSamplesPerSec=5.066950118398116, CurrSamplesPerSec=5.674350640600091, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1725 |
+
{'loss': 0.0001, 'learning_rate': 7.555555555555556e-08, 'epoch': 71.07}
|
1726 |
+
[2022-12-20 09:41:10,723] [INFO] [logging.py:68:log_dist] [Rank 0] step=4980, skipped=8, lr=[6.444444444444445e-08], mom=[[0.9, 0.999]]
|
1727 |
+
[2022-12-20 09:41:10,724] [INFO] [timer.py:196:stop] epoch=0/micro_step=4980/global_step=4980, RunningAvgSamplesPerSec=5.067469981862319, CurrSamplesPerSec=5.1892294486294315, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1728 |
+
[2022-12-20 09:43:36,226] [INFO] [logging.py:68:log_dist] [Rank 0] step=4990, skipped=8, lr=[4.222222222222222e-08], mom=[[0.9, 0.999]]
|
1729 |
+
[2022-12-20 09:43:36,228] [INFO] [timer.py:196:stop] epoch=0/micro_step=4990/global_step=4990, RunningAvgSamplesPerSec=5.067770417735178, CurrSamplesPerSec=5.121883309677963, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1730 |
+
[2022-12-20 09:46:01,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=5000, skipped=8, lr=[2e-08], mom=[[0.9, 0.999]]
|
1731 |
+
[2022-12-20 09:46:01,736] [INFO] [timer.py:196:stop] epoch=0/micro_step=5000/global_step=5000, RunningAvgSamplesPerSec=5.068131732607621, CurrSamplesPerSec=5.1918555188372055, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
|
1732 |
+
{'loss': 0.0001, 'learning_rate': 2e-08, 'epoch': 71.43}
|
1733 |
+
{'eval_loss': 0.475341796875, 'eval_wer': 23.453117563065206, 'eval_runtime': 788.365, 'eval_samples_per_second': 2.876, 'eval_steps_per_second': 0.09, 'epoch': 71.43}
|
1734 |
+
[2022-12-20 09:59:13,106] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save!
|
1735 |
+
[2022-12-20 09:59:13,118] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt
|
1736 |
+
[2022-12-20 09:59:13,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt...
|
1737 |
+
[2022-12-20 09:59:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt.
|
1738 |
+
[2022-12-20 09:59:15,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
|
1739 |
+
[2022-12-20 09:59:28,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
1740 |
+
[2022-12-20 09:59:28,048] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
1741 |
+
[2022-12-20 09:59:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now!
|
runs/Dec19_11-14-29_fe2747a042f0/events.out.tfevents.1671479623.fe2747a042f0.2334566.0
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:448c7b16314fbb86a4ce12fcaf589d963384d5ecaa204735d799c10ddf30307d
|
3 |
+
size 37253
|