INFO:__main__:Initializing DDP settings... INFO:__main__: is_ddp = False INFO:__main__:Initializing PyTorch settings... INFO:__main__:Initializing models and optimizers... INFO:__main__: Resuming training from `/home/bd4sur/ai/Nano/checkpoint/nano_56m_20241027_pt_99000.pt`... INFO:__main__: block_size = 512 INFO:__main__: vocab_size = 16384 INFO:__main__: n_layer = 16 INFO:__main__: n_embd = 512 INFO:__main__: n_head = 16 INFO:__main__: n_kv_head = 8 INFO:__main__: n_hidden = 1408 INFO:__main__: Parameters INFO:__main__: Total = 55,591,424 (0.055591424B) INFO:__main__: Train = 55,591,424 (100.0%) INFO:__main__:Loading dataset... INFO:__main__: Train set 0 : 2,092,403 samples (1,071,310,336 tokens) INFO:__main__: Valid set 0 : 104,642 samples (53,576,704 tokens) INFO:__main__:2024-11-04 22:27:45 | Start training from iteration #99000 INFO:__main__:2024-11-04 22:27:49 | Epoch: 0 | Step: 99000 | Dataset: 0-64 | Loss: 3.276 | 3992 ms/step , 1575.56 GFLOP/s , 0.0 tokens/s INFO:__main__:2024-11-04 22:27:50 | Validation | Step: 99000 | Val_loss: 4.331 | Best_val_loss: 19.4081 INFO:__main__:2024-11-04 22:27:59 | Epoch: 0 | Step: 99010 | Dataset: 0-384 | Loss: 3.181 | 914 ms/step , 6884.60 GFLOP/s , 15263.6 tokens/s INFO:__main__:2024-11-04 22:28:09 | Epoch: 0 | Step: 99020 | Dataset: 0-704 | Loss: 3.035 | 911 ms/step , 6904.04 GFLOP/s , 17958.9 tokens/s INFO:__main__:2024-11-04 22:28:18 | Epoch: 0 | Step: 99030 | Dataset: 0-1024 | Loss: 2.928 | 911 ms/step , 6900.40 GFLOP/s , 17959.8 tokens/s INFO:__main__:2024-11-04 22:28:27 | Epoch: 0 | Step: 99040 | Dataset: 0-1344 | Loss: 3.366 | 912 ms/step , 6897.83 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-04 22:28:36 | Epoch: 0 | Step: 99050 | Dataset: 0-1664 | Loss: 3.055 | 912 ms/step , 6895.49 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-04 22:28:45 | Epoch: 0 | Step: 99060 | Dataset: 0-1984 | Loss: 2.679 | 912 ms/step , 6899.03 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-04 22:28:54 | Epoch: 0 | Step: 99070 | Dataset: 0-2304 | Loss: 2.477 | 912 ms/step , 6894.12 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-04 22:29:03 | Epoch: 0 | Step: 99080 | Dataset: 0-2624 | Loss: 2.809 | 912 ms/step , 6900.02 GFLOP/s , 17957.1 tokens/s INFO:__main__:2024-11-04 22:29:12 | Epoch: 0 | Step: 99090 | Dataset: 0-2944 | Loss: 2.668 | 913 ms/step , 6891.59 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-04 22:29:22 | Epoch: 0 | Step: 99100 | Dataset: 0-3264 | Loss: 2.872 | 912 ms/step , 6899.17 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-04 22:29:23 | Validation | Step: 99100 | Val_loss: 4.112 | Best_val_loss: 19.4081 INFO:__main__:2024-11-04 22:29:23 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_222923_step_99100.pt` INFO:__main__:2024-11-04 22:29:34 | Epoch: 0 | Step: 99110 | Dataset: 0-3584 | Loss: 2.765 | 912 ms/step , 6893.00 GFLOP/s , 13784.4 tokens/s INFO:__main__:2024-11-04 22:29:43 | Epoch: 0 | Step: 99120 | Dataset: 0-3904 | Loss: 2.497 | 913 ms/step , 6891.72 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-04 22:29:52 | Epoch: 0 | Step: 99130 | Dataset: 0-4224 | Loss: 2.535 | 912 ms/step , 6892.69 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-04 22:30:01 | Epoch: 0 | Step: 99140 | Dataset: 0-4544 | Loss: 2.719 | 911 ms/step , 6901.17 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-04 22:30:10 | Epoch: 0 | Step: 99150 | Dataset: 0-4864 | Loss: 2.716 | 912 ms/step , 6898.42 GFLOP/s , 17962.5 tokens/s INFO:__main__:2024-11-04 22:30:19 | Epoch: 0 | Step: 99160 | Dataset: 0-5184 | Loss: 2.671 | 912 ms/step , 6895.57 GFLOP/s , 17959.8 tokens/s INFO:__main__:2024-11-04 22:30:28 | Epoch: 0 | Step: 99170 | Dataset: 0-5504 | Loss: 2.698 | 916 ms/step , 6863.62 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-04 22:30:37 | Epoch: 0 | Step: 99180 | Dataset: 0-5824 | Loss: 2.504 | 914 ms/step , 6877.62 GFLOP/s , 17895.1 tokens/s INFO:__main__:2024-11-04 22:30:47 | Epoch: 0 | Step: 99190 | Dataset: 0-6144 | Loss: 2.929 | 913 ms/step , 6888.26 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-04 22:30:56 | Epoch: 0 | Step: 99200 | Dataset: 0-6464 | Loss: 2.661 | 913 ms/step , 6885.56 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-04 22:30:57 | Validation | Step: 99200 | Val_loss: 3.912 | Best_val_loss: 4.1120 INFO:__main__:2024-11-04 22:30:57 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_223057_step_99200.pt` INFO:__main__:2024-11-04 22:31:08 | Epoch: 0 | Step: 99210 | Dataset: 0-6784 | Loss: 2.879 | 913 ms/step , 6887.08 GFLOP/s , 13800.9 tokens/s INFO:__main__:2024-11-04 22:31:17 | Epoch: 0 | Step: 99220 | Dataset: 0-7104 | Loss: 2.547 | 913 ms/step , 6887.57 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-04 22:31:26 | Epoch: 0 | Step: 99230 | Dataset: 0-7424 | Loss: 2.527 | 912 ms/step , 6894.44 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-04 22:31:35 | Epoch: 0 | Step: 99240 | Dataset: 0-7744 | Loss: 2.863 | 913 ms/step , 6888.56 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-04 22:31:44 | Epoch: 0 | Step: 99250 | Dataset: 0-8064 | Loss: 2.425 | 913 ms/step , 6891.11 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-04 22:31:53 | Epoch: 0 | Step: 99260 | Dataset: 0-8384 | Loss: 2.723 | 913 ms/step , 6889.08 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-04 22:32:02 | Epoch: 0 | Step: 99270 | Dataset: 0-8704 | Loss: 2.641 | 913 ms/step , 6887.08 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-04 22:32:12 | Epoch: 0 | Step: 99280 | Dataset: 0-9024 | Loss: 2.495 | 915 ms/step , 6876.70 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-04 22:32:21 | Epoch: 0 | Step: 99290 | Dataset: 0-9344 | Loss: 2.792 | 913 ms/step , 6890.64 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-04 22:32:30 | Epoch: 0 | Step: 99300 | Dataset: 0-9664 | Loss: 2.762 | 913 ms/step , 6887.89 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-04 22:32:31 | Validation | Step: 99300 | Val_loss: 3.852 | Best_val_loss: 3.9122 INFO:__main__:2024-11-04 22:32:31 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_223231_step_99300.pt` INFO:__main__:2024-11-04 22:32:42 | Epoch: 0 | Step: 99310 | Dataset: 0-9984 | Loss: 2.299 | 914 ms/step , 6879.75 GFLOP/s , 13806.1 tokens/s INFO:__main__:2024-11-04 22:32:51 | Epoch: 0 | Step: 99320 | Dataset: 0-10304 | Loss: 2.739 | 913 ms/step , 6890.95 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-04 22:33:00 | Epoch: 0 | Step: 99330 | Dataset: 0-10624 | Loss: 2.534 | 912 ms/step , 6893.85 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-04 22:33:09 | Epoch: 0 | Step: 99340 | Dataset: 0-10944 | Loss: 2.467 | 913 ms/step , 6892.58 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-04 22:33:18 | Epoch: 0 | Step: 99350 | Dataset: 0-11264 | Loss: 2.264 | 914 ms/step , 6880.86 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-04 22:33:27 | Epoch: 0 | Step: 99360 | Dataset: 0-11584 | Loss: 2.460 | 914 ms/step , 6884.41 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-04 22:33:36 | Epoch: 0 | Step: 99370 | Dataset: 0-11904 | Loss: 2.152 | 913 ms/step , 6886.98 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-04 22:33:46 | Epoch: 0 | Step: 99380 | Dataset: 0-12224 | Loss: 2.578 | 913 ms/step , 6886.96 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-04 22:33:55 | Epoch: 0 | Step: 99390 | Dataset: 0-12544 | Loss: 2.354 | 913 ms/step , 6885.60 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-04 22:34:04 | Epoch: 0 | Step: 99400 | Dataset: 0-12864 | Loss: 2.440 | 914 ms/step , 6881.11 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-04 22:34:06 | Validation | Step: 99400 | Val_loss: 3.719 | Best_val_loss: 3.8518 INFO:__main__:2024-11-04 22:34:06 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_223406_step_99400.pt` INFO:__main__:2024-11-04 22:34:16 | Epoch: 0 | Step: 99410 | Dataset: 0-13184 | Loss: 2.415 | 914 ms/step , 6882.04 GFLOP/s , 13782.6 tokens/s INFO:__main__:2024-11-04 22:34:25 | Epoch: 0 | Step: 99420 | Dataset: 0-13504 | Loss: 2.268 | 914 ms/step , 6883.74 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-04 22:34:34 | Epoch: 0 | Step: 99430 | Dataset: 0-13824 | Loss: 2.495 | 913 ms/step , 6887.86 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-04 22:34:43 | Epoch: 0 | Step: 99440 | Dataset: 0-14144 | Loss: 2.431 | 913 ms/step , 6892.52 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-04 22:34:52 | Epoch: 0 | Step: 99450 | Dataset: 0-14464 | Loss: 2.300 | 914 ms/step , 6883.05 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-04 22:35:01 | Epoch: 0 | Step: 99460 | Dataset: 0-14784 | Loss: 2.630 | 916 ms/step , 6867.92 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-04 22:35:11 | Epoch: 0 | Step: 99470 | Dataset: 0-15104 | Loss: 2.435 | 912 ms/step , 6893.36 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-04 22:35:20 | Epoch: 0 | Step: 99480 | Dataset: 0-15424 | Loss: 2.148 | 914 ms/step , 6878.98 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-04 22:35:29 | Epoch: 0 | Step: 99490 | Dataset: 0-15744 | Loss: 2.056 | 915 ms/step , 6875.19 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-04 22:35:38 | Epoch: 0 | Step: 99500 | Dataset: 0-16064 | Loss: 2.312 | 913 ms/step , 6886.67 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-04 22:35:40 | Validation | Step: 99500 | Val_loss: 3.568 | Best_val_loss: 3.7195 INFO:__main__:2024-11-04 22:35:40 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_223540_step_99500.pt` INFO:__main__:2024-11-04 22:35:50 | Epoch: 0 | Step: 99510 | Dataset: 0-16384 | Loss: 2.123 | 914 ms/step , 6884.49 GFLOP/s , 13794.0 tokens/s INFO:__main__:2024-11-04 22:35:59 | Epoch: 0 | Step: 99520 | Dataset: 0-16704 | Loss: 2.226 | 914 ms/step , 6884.37 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-04 22:36:08 | Epoch: 0 | Step: 99530 | Dataset: 0-17024 | Loss: 2.416 | 912 ms/step , 6892.93 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-04 22:36:17 | Epoch: 0 | Step: 99540 | Dataset: 0-17344 | Loss: 2.390 | 913 ms/step , 6885.25 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-04 22:36:26 | Epoch: 0 | Step: 99550 | Dataset: 0-17664 | Loss: 2.248 | 913 ms/step , 6886.54 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-04 22:36:36 | Epoch: 0 | Step: 99560 | Dataset: 0-17984 | Loss: 1.948 | 915 ms/step , 6877.06 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-04 22:36:45 | Epoch: 0 | Step: 99570 | Dataset: 0-18304 | Loss: 2.182 | 914 ms/step , 6883.56 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-04 22:36:54 | Epoch: 0 | Step: 99580 | Dataset: 0-18624 | Loss: 2.090 | 913 ms/step , 6890.68 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-04 22:37:03 | Epoch: 0 | Step: 99590 | Dataset: 0-18944 | Loss: 2.322 | 913 ms/step , 6888.75 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-04 22:37:12 | Epoch: 0 | Step: 99600 | Dataset: 0-19264 | Loss: 2.169 | 913 ms/step , 6891.05 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-04 22:37:14 | Validation | Step: 99600 | Val_loss: 3.298 | Best_val_loss: 3.5681 INFO:__main__:2024-11-04 22:37:14 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_223714_step_99600.pt` INFO:__main__:2024-11-04 22:37:24 | Epoch: 0 | Step: 99610 | Dataset: 0-19584 | Loss: 2.371 | 913 ms/step , 6891.25 GFLOP/s , 13788.5 tokens/s INFO:__main__:2024-11-04 22:37:33 | Epoch: 0 | Step: 99620 | Dataset: 0-19904 | Loss: 1.994 | 913 ms/step , 6890.71 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-04 22:37:42 | Epoch: 0 | Step: 99630 | Dataset: 0-20224 | Loss: 2.087 | 914 ms/step , 6880.93 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-04 22:37:52 | Epoch: 0 | Step: 99640 | Dataset: 0-20544 | Loss: 2.242 | 914 ms/step , 6879.08 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-04 22:38:01 | Epoch: 0 | Step: 99650 | Dataset: 0-20864 | Loss: 1.899 | 914 ms/step , 6878.09 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-04 22:38:10 | Epoch: 0 | Step: 99660 | Dataset: 0-21184 | Loss: 2.267 | 914 ms/step , 6881.99 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-04 22:38:19 | Epoch: 0 | Step: 99670 | Dataset: 0-21504 | Loss: 2.122 | 913 ms/step , 6889.02 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-04 22:38:28 | Epoch: 0 | Step: 99680 | Dataset: 0-21824 | Loss: 2.073 | 914 ms/step , 6878.44 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-04 22:38:37 | Epoch: 0 | Step: 99690 | Dataset: 0-22144 | Loss: 1.956 | 915 ms/step , 6874.04 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-04 22:38:46 | Epoch: 0 | Step: 99700 | Dataset: 0-22464 | Loss: 2.028 | 914 ms/step , 6882.04 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-04 22:38:48 | Validation | Step: 99700 | Val_loss: 3.298 | Best_val_loss: 3.2977 INFO:__main__:2024-11-04 22:38:57 | Epoch: 0 | Step: 99710 | Dataset: 0-22784 | Loss: 2.086 | 914 ms/step , 6884.59 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-04 22:39:06 | Epoch: 0 | Step: 99720 | Dataset: 0-23104 | Loss: 2.156 | 912 ms/step , 6899.77 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-04 22:39:15 | Epoch: 0 | Step: 99730 | Dataset: 0-23424 | Loss: 2.002 | 914 ms/step , 6884.70 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-04 22:39:25 | Epoch: 0 | Step: 99740 | Dataset: 0-23744 | Loss: 1.965 | 914 ms/step , 6882.52 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-04 22:39:34 | Epoch: 0 | Step: 99750 | Dataset: 0-24064 | Loss: 1.886 | 914 ms/step , 6881.22 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-04 22:39:43 | Epoch: 0 | Step: 99760 | Dataset: 0-24384 | Loss: 1.811 | 914 ms/step , 6880.49 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-04 22:39:52 | Epoch: 0 | Step: 99770 | Dataset: 0-24704 | Loss: 2.302 | 914 ms/step , 6884.91 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-04 22:40:01 | Epoch: 0 | Step: 99780 | Dataset: 0-25024 | Loss: 2.028 | 914 ms/step , 6879.47 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-04 22:40:10 | Epoch: 0 | Step: 99790 | Dataset: 0-25344 | Loss: 1.948 | 913 ms/step , 6887.90 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-04 22:40:19 | Epoch: 0 | Step: 99800 | Dataset: 0-25664 | Loss: 2.098 | 914 ms/step , 6882.11 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-04 22:40:21 | Validation | Step: 99800 | Val_loss: 2.948 | Best_val_loss: 3.2977 INFO:__main__:2024-11-04 22:40:21 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224021_step_99800.pt` INFO:__main__:2024-11-04 22:40:31 | Epoch: 0 | Step: 99810 | Dataset: 0-25984 | Loss: 1.895 | 913 ms/step , 6886.26 GFLOP/s , 13818.3 tokens/s INFO:__main__:2024-11-04 22:40:40 | Epoch: 0 | Step: 99820 | Dataset: 0-26304 | Loss: 1.971 | 914 ms/step , 6883.90 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-04 22:40:49 | Epoch: 0 | Step: 99830 | Dataset: 0-26624 | Loss: 2.122 | 913 ms/step , 6886.00 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-04 22:40:59 | Epoch: 0 | Step: 99840 | Dataset: 0-26944 | Loss: 1.893 | 914 ms/step , 6882.56 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-04 22:41:08 | Epoch: 0 | Step: 99850 | Dataset: 0-27264 | Loss: 1.878 | 913 ms/step , 6889.47 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-04 22:41:17 | Epoch: 0 | Step: 99860 | Dataset: 0-27584 | Loss: 2.181 | 913 ms/step , 6887.58 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-04 22:41:26 | Epoch: 0 | Step: 99870 | Dataset: 0-27904 | Loss: 1.792 | 915 ms/step , 6871.60 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-04 22:41:35 | Epoch: 0 | Step: 99880 | Dataset: 0-28224 | Loss: 1.904 | 915 ms/step , 6874.89 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-04 22:41:44 | Epoch: 0 | Step: 99890 | Dataset: 0-28544 | Loss: 2.069 | 912 ms/step , 6895.27 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-04 22:41:53 | Epoch: 0 | Step: 99900 | Dataset: 0-28864 | Loss: 1.673 | 915 ms/step , 6873.66 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-04 22:41:55 | Validation | Step: 99900 | Val_loss: 2.872 | Best_val_loss: 2.9482 INFO:__main__:2024-11-04 22:41:55 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224155_step_99900.pt` INFO:__main__:2024-11-04 22:42:05 | Epoch: 0 | Step: 99910 | Dataset: 0-29184 | Loss: 1.790 | 912 ms/step , 6893.59 GFLOP/s , 13768.8 tokens/s INFO:__main__:2024-11-04 22:42:15 | Epoch: 0 | Step: 99920 | Dataset: 0-29504 | Loss: 1.805 | 914 ms/step , 6880.59 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-04 22:42:24 | Epoch: 0 | Step: 99930 | Dataset: 0-29824 | Loss: 1.656 | 914 ms/step , 6881.41 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-04 22:42:33 | Epoch: 0 | Step: 99940 | Dataset: 0-30144 | Loss: 1.909 | 913 ms/step , 6888.36 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-04 22:42:42 | Epoch: 0 | Step: 99950 | Dataset: 0-30464 | Loss: 1.778 | 913 ms/step , 6885.40 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-04 22:42:51 | Epoch: 0 | Step: 99960 | Dataset: 0-30784 | Loss: 1.706 | 916 ms/step , 6868.13 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-04 22:43:00 | Epoch: 0 | Step: 99970 | Dataset: 0-31104 | Loss: 1.858 | 915 ms/step , 6872.34 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-04 22:43:09 | Epoch: 0 | Step: 99980 | Dataset: 0-31424 | Loss: 1.852 | 913 ms/step , 6885.79 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-04 22:43:19 | Epoch: 0 | Step: 99990 | Dataset: 0-31744 | Loss: 1.726 | 914 ms/step , 6883.99 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-04 22:43:28 | Epoch: 0 | Step: 100000 | Dataset: 0-32064 | Loss: 1.866 | 914 ms/step , 6880.20 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-04 22:43:29 | Validation | Step: 100000 | Val_loss: 2.695 | Best_val_loss: 2.8717 INFO:__main__:2024-11-04 22:43:29 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224329_step_100000.pt` INFO:__main__:2024-11-04 22:43:40 | Epoch: 0 | Step: 100010 | Dataset: 0-32384 | Loss: 1.805 | 913 ms/step , 6892.17 GFLOP/s , 13804.8 tokens/s INFO:__main__:2024-11-04 22:43:49 | Epoch: 0 | Step: 100020 | Dataset: 0-32704 | Loss: 1.749 | 915 ms/step , 6876.38 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-04 22:43:58 | Epoch: 0 | Step: 100030 | Dataset: 0-33024 | Loss: 1.860 | 914 ms/step , 6884.00 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-04 22:44:07 | Epoch: 0 | Step: 100040 | Dataset: 0-33344 | Loss: 1.895 | 914 ms/step , 6881.85 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-04 22:44:16 | Epoch: 0 | Step: 100050 | Dataset: 0-33664 | Loss: 1.743 | 913 ms/step , 6886.48 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-04 22:44:25 | Epoch: 0 | Step: 100060 | Dataset: 0-33984 | Loss: 1.525 | 913 ms/step , 6890.33 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-04 22:44:34 | Epoch: 0 | Step: 100070 | Dataset: 0-34304 | Loss: 1.767 | 913 ms/step , 6887.40 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-04 22:44:44 | Epoch: 0 | Step: 100080 | Dataset: 0-34624 | Loss: 1.669 | 914 ms/step , 6884.65 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-04 22:44:53 | Epoch: 0 | Step: 100090 | Dataset: 0-34944 | Loss: 1.615 | 915 ms/step , 6875.05 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-04 22:45:02 | Epoch: 0 | Step: 100100 | Dataset: 0-35264 | Loss: 1.647 | 913 ms/step , 6885.98 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-04 22:45:03 | Validation | Step: 100100 | Val_loss: 2.495 | Best_val_loss: 2.6948 INFO:__main__:2024-11-04 22:45:03 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224503_step_100100.pt` INFO:__main__:2024-11-04 22:45:14 | Epoch: 0 | Step: 100110 | Dataset: 0-35584 | Loss: 1.683 | 913 ms/step , 6888.44 GFLOP/s , 13772.3 tokens/s INFO:__main__:2024-11-04 22:45:23 | Epoch: 0 | Step: 100120 | Dataset: 0-35904 | Loss: 1.635 | 914 ms/step , 6883.61 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-04 22:45:32 | Epoch: 0 | Step: 100130 | Dataset: 0-36224 | Loss: 1.634 | 914 ms/step , 6878.93 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-04 22:45:41 | Epoch: 0 | Step: 100140 | Dataset: 0-36544 | Loss: 1.586 | 915 ms/step , 6870.34 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-04 22:45:50 | Epoch: 0 | Step: 100150 | Dataset: 0-36864 | Loss: 1.585 | 912 ms/step , 6893.21 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-04 22:45:59 | Epoch: 0 | Step: 100160 | Dataset: 0-37184 | Loss: 1.517 | 913 ms/step , 6887.85 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-04 22:46:09 | Epoch: 0 | Step: 100170 | Dataset: 0-37504 | Loss: 1.608 | 913 ms/step , 6888.53 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-04 22:46:18 | Epoch: 0 | Step: 100180 | Dataset: 0-37824 | Loss: 1.619 | 914 ms/step , 6877.70 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-04 22:46:27 | Epoch: 0 | Step: 100190 | Dataset: 0-38144 | Loss: 1.633 | 915 ms/step , 6876.84 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-04 22:46:36 | Epoch: 0 | Step: 100200 | Dataset: 0-38464 | Loss: 1.650 | 914 ms/step , 6877.74 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-04 22:46:38 | Validation | Step: 100200 | Val_loss: 2.200 | Best_val_loss: 2.4949 INFO:__main__:2024-11-04 22:46:38 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224638_step_100200.pt` INFO:__main__:2024-11-04 22:46:48 | Epoch: 0 | Step: 100210 | Dataset: 0-38784 | Loss: 1.537 | 915 ms/step , 6872.73 GFLOP/s , 13826.0 tokens/s INFO:__main__:2024-11-04 22:46:57 | Epoch: 0 | Step: 100220 | Dataset: 0-39104 | Loss: 1.621 | 916 ms/step , 6869.27 GFLOP/s , 17901.7 tokens/s INFO:__main__:2024-11-04 22:47:06 | Epoch: 0 | Step: 100230 | Dataset: 0-39424 | Loss: 1.530 | 916 ms/step , 6869.04 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-04 22:47:15 | Epoch: 0 | Step: 100240 | Dataset: 0-39744 | Loss: 1.641 | 914 ms/step , 6883.87 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-04 22:47:24 | Epoch: 0 | Step: 100250 | Dataset: 0-40064 | Loss: 1.320 | 915 ms/step , 6874.84 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-04 22:47:34 | Epoch: 0 | Step: 100260 | Dataset: 0-40384 | Loss: 1.492 | 914 ms/step , 6885.04 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-04 22:47:43 | Epoch: 0 | Step: 100270 | Dataset: 0-40704 | Loss: 1.669 | 913 ms/step , 6890.99 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-04 22:47:52 | Epoch: 0 | Step: 100280 | Dataset: 0-41024 | Loss: 1.485 | 916 ms/step , 6868.81 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-04 22:48:01 | Epoch: 0 | Step: 100290 | Dataset: 0-41344 | Loss: 1.531 | 913 ms/step , 6885.42 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-04 22:48:10 | Epoch: 0 | Step: 100300 | Dataset: 0-41664 | Loss: 1.483 | 913 ms/step , 6889.17 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-04 22:48:12 | Validation | Step: 100300 | Val_loss: 1.797 | Best_val_loss: 2.2004 INFO:__main__:2024-11-04 22:48:12 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224812_step_100300.pt` INFO:__main__:2024-11-04 22:48:22 | Epoch: 0 | Step: 100310 | Dataset: 0-41984 | Loss: 1.394 | 914 ms/step , 6879.66 GFLOP/s , 13802.6 tokens/s INFO:__main__:2024-11-04 22:48:31 | Epoch: 0 | Step: 100320 | Dataset: 0-42304 | Loss: 1.485 | 913 ms/step , 6885.43 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-04 22:48:40 | Epoch: 0 | Step: 100330 | Dataset: 0-42624 | Loss: 1.464 | 914 ms/step , 6881.61 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-04 22:48:49 | Epoch: 0 | Step: 100340 | Dataset: 0-42944 | Loss: 1.460 | 913 ms/step , 6888.89 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-04 22:48:59 | Epoch: 0 | Step: 100350 | Dataset: 0-43264 | Loss: 1.443 | 914 ms/step , 6882.81 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-04 22:49:08 | Epoch: 0 | Step: 100360 | Dataset: 0-43584 | Loss: 1.607 | 912 ms/step , 6896.84 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-04 22:49:17 | Epoch: 0 | Step: 100370 | Dataset: 0-43904 | Loss: 1.349 | 913 ms/step , 6889.45 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-04 22:49:26 | Epoch: 0 | Step: 100380 | Dataset: 0-44224 | Loss: 1.524 | 914 ms/step , 6880.65 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-04 22:49:35 | Epoch: 0 | Step: 100390 | Dataset: 0-44544 | Loss: 1.391 | 913 ms/step , 6888.51 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-04 22:49:44 | Epoch: 0 | Step: 100400 | Dataset: 0-44864 | Loss: 1.362 | 914 ms/step , 6882.82 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-04 22:49:46 | Validation | Step: 100400 | Val_loss: 1.621 | Best_val_loss: 1.7969 INFO:__main__:2024-11-04 22:49:46 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_224946_step_100400.pt` INFO:__main__:2024-11-04 22:49:56 | Epoch: 0 | Step: 100410 | Dataset: 0-45184 | Loss: 1.396 | 913 ms/step , 6887.91 GFLOP/s , 13794.0 tokens/s INFO:__main__:2024-11-04 22:50:05 | Epoch: 0 | Step: 100420 | Dataset: 0-45504 | Loss: 1.441 | 914 ms/step , 6880.13 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-04 22:50:14 | Epoch: 0 | Step: 100430 | Dataset: 0-45824 | Loss: 1.412 | 915 ms/step , 6875.85 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-04 22:50:24 | Epoch: 0 | Step: 100440 | Dataset: 0-46144 | Loss: 1.396 | 913 ms/step , 6886.09 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-04 22:50:33 | Epoch: 0 | Step: 100450 | Dataset: 0-46464 | Loss: 1.398 | 913 ms/step , 6886.09 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-04 22:50:42 | Epoch: 0 | Step: 100460 | Dataset: 0-46784 | Loss: 1.453 | 912 ms/step , 6896.82 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-04 22:50:51 | Epoch: 0 | Step: 100470 | Dataset: 0-47104 | Loss: 1.240 | 915 ms/step , 6872.78 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-04 22:51:00 | Epoch: 0 | Step: 100480 | Dataset: 0-47424 | Loss: 1.427 | 914 ms/step , 6877.72 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 22:51:09 | Epoch: 0 | Step: 100490 | Dataset: 0-47744 | Loss: 1.391 | 914 ms/step , 6882.11 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-04 22:51:18 | Epoch: 0 | Step: 100500 | Dataset: 0-48064 | Loss: 1.345 | 914 ms/step , 6884.40 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 22:51:20 | Validation | Step: 100500 | Val_loss: 1.452 | Best_val_loss: 1.6212 INFO:__main__:2024-11-04 22:51:20 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225120_step_100500.pt` INFO:__main__:2024-11-04 22:51:30 | Epoch: 0 | Step: 100510 | Dataset: 0-48384 | Loss: 1.315 | 914 ms/step , 6884.86 GFLOP/s , 13817.7 tokens/s INFO:__main__:2024-11-04 22:51:39 | Epoch: 0 | Step: 100520 | Dataset: 0-48704 | Loss: 1.271 | 913 ms/step , 6886.16 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-04 22:51:49 | Epoch: 0 | Step: 100530 | Dataset: 0-49024 | Loss: 1.229 | 913 ms/step , 6889.05 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-04 22:51:58 | Epoch: 0 | Step: 100540 | Dataset: 0-49344 | Loss: 1.282 | 914 ms/step , 6880.58 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-04 22:52:07 | Epoch: 0 | Step: 100550 | Dataset: 0-49664 | Loss: 1.271 | 914 ms/step , 6883.88 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-04 22:52:16 | Epoch: 0 | Step: 100560 | Dataset: 0-49984 | Loss: 1.226 | 914 ms/step , 6878.02 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-04 22:52:25 | Epoch: 0 | Step: 100570 | Dataset: 0-50304 | Loss: 1.268 | 913 ms/step , 6889.93 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-04 22:52:34 | Epoch: 0 | Step: 100580 | Dataset: 0-50624 | Loss: 1.259 | 913 ms/step , 6885.18 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-04 22:52:43 | Epoch: 0 | Step: 100590 | Dataset: 0-50944 | Loss: 1.266 | 914 ms/step , 6882.79 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-04 22:52:53 | Epoch: 0 | Step: 100600 | Dataset: 0-51264 | Loss: 1.217 | 916 ms/step , 6869.47 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-04 22:52:54 | Validation | Step: 100600 | Val_loss: 1.408 | Best_val_loss: 1.4518 INFO:__main__:2024-11-04 22:52:54 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225254_step_100600.pt` INFO:__main__:2024-11-04 22:53:04 | Epoch: 0 | Step: 100610 | Dataset: 0-51584 | Loss: 1.324 | 914 ms/step , 6880.35 GFLOP/s , 13806.8 tokens/s INFO:__main__:2024-11-04 22:53:14 | Epoch: 0 | Step: 100620 | Dataset: 0-51904 | Loss: 1.294 | 915 ms/step , 6876.91 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-04 22:53:23 | Epoch: 0 | Step: 100630 | Dataset: 0-52224 | Loss: 1.196 | 913 ms/step , 6890.65 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-04 22:53:32 | Epoch: 0 | Step: 100640 | Dataset: 0-52544 | Loss: 1.237 | 913 ms/step , 6887.42 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-04 22:53:41 | Epoch: 0 | Step: 100650 | Dataset: 0-52864 | Loss: 1.223 | 914 ms/step , 6884.97 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-04 22:53:50 | Epoch: 0 | Step: 100660 | Dataset: 0-53184 | Loss: 1.255 | 915 ms/step , 6877.31 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-04 22:53:59 | Epoch: 0 | Step: 100670 | Dataset: 0-53504 | Loss: 1.190 | 914 ms/step , 6880.12 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-04 22:54:08 | Epoch: 0 | Step: 100680 | Dataset: 0-53824 | Loss: 1.215 | 915 ms/step , 6877.28 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-04 22:54:18 | Epoch: 0 | Step: 100690 | Dataset: 0-54144 | Loss: 1.110 | 915 ms/step , 6870.72 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-04 22:54:27 | Epoch: 0 | Step: 100700 | Dataset: 0-54464 | Loss: 1.105 | 913 ms/step , 6885.97 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-04 22:54:28 | Validation | Step: 100700 | Val_loss: 1.329 | Best_val_loss: 1.4081 INFO:__main__:2024-11-04 22:54:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225428_step_100700.pt` INFO:__main__:2024-11-04 22:54:39 | Epoch: 0 | Step: 100710 | Dataset: 0-54784 | Loss: 1.174 | 913 ms/step , 6890.79 GFLOP/s , 13807.0 tokens/s INFO:__main__:2024-11-04 22:54:48 | Epoch: 0 | Step: 100720 | Dataset: 0-55104 | Loss: 1.179 | 913 ms/step , 6885.49 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-04 22:54:57 | Epoch: 0 | Step: 100730 | Dataset: 0-55424 | Loss: 1.205 | 914 ms/step , 6882.50 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-04 22:55:06 | Epoch: 0 | Step: 100740 | Dataset: 0-55744 | Loss: 1.129 | 913 ms/step , 6886.73 GFLOP/s , 17901.1 tokens/s INFO:__main__:2024-11-04 22:55:15 | Epoch: 0 | Step: 100750 | Dataset: 0-56064 | Loss: 1.157 | 915 ms/step , 6877.34 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-04 22:55:24 | Epoch: 0 | Step: 100760 | Dataset: 0-56384 | Loss: 1.184 | 914 ms/step , 6883.45 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-04 22:55:33 | Epoch: 0 | Step: 100770 | Dataset: 0-56704 | Loss: 1.117 | 915 ms/step , 6877.03 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-04 22:55:43 | Epoch: 0 | Step: 100780 | Dataset: 0-57024 | Loss: 1.215 | 915 ms/step , 6872.27 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-04 22:55:52 | Epoch: 0 | Step: 100790 | Dataset: 0-57344 | Loss: 1.140 | 915 ms/step , 6875.87 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-04 22:56:01 | Epoch: 0 | Step: 100800 | Dataset: 0-57664 | Loss: 1.113 | 914 ms/step , 6882.99 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-04 22:56:02 | Validation | Step: 100800 | Val_loss: 1.221 | Best_val_loss: 1.3286 INFO:__main__:2024-11-04 22:56:02 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225602_step_100800.pt` INFO:__main__:2024-11-04 22:56:13 | Epoch: 0 | Step: 100810 | Dataset: 0-57984 | Loss: 1.136 | 914 ms/step , 6880.61 GFLOP/s , 13829.8 tokens/s INFO:__main__:2024-11-04 22:56:22 | Epoch: 0 | Step: 100820 | Dataset: 0-58304 | Loss: 1.084 | 913 ms/step , 6886.12 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-04 22:56:31 | Epoch: 0 | Step: 100830 | Dataset: 0-58624 | Loss: 1.124 | 913 ms/step , 6890.01 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-04 22:56:40 | Epoch: 0 | Step: 100840 | Dataset: 0-58944 | Loss: 1.062 | 914 ms/step , 6879.51 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-04 22:56:49 | Epoch: 0 | Step: 100850 | Dataset: 0-59264 | Loss: 1.101 | 914 ms/step , 6878.19 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-04 22:56:58 | Epoch: 0 | Step: 100860 | Dataset: 0-59584 | Loss: 1.102 | 914 ms/step , 6879.62 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-04 22:57:08 | Epoch: 0 | Step: 100870 | Dataset: 0-59904 | Loss: 1.114 | 913 ms/step , 6886.82 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-04 22:57:17 | Epoch: 0 | Step: 100880 | Dataset: 0-60224 | Loss: 1.063 | 914 ms/step , 6882.36 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-04 22:57:26 | Epoch: 0 | Step: 100890 | Dataset: 0-60544 | Loss: 1.055 | 913 ms/step , 6887.59 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-04 22:57:35 | Epoch: 0 | Step: 100900 | Dataset: 0-60864 | Loss: 1.132 | 915 ms/step , 6876.76 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-04 22:57:37 | Validation | Step: 100900 | Val_loss: 1.150 | Best_val_loss: 1.2214 INFO:__main__:2024-11-04 22:57:37 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225737_step_100900.pt` INFO:__main__:2024-11-04 22:57:47 | Epoch: 0 | Step: 100910 | Dataset: 0-61184 | Loss: 1.114 | 914 ms/step , 6882.50 GFLOP/s , 13839.4 tokens/s INFO:__main__:2024-11-04 22:57:56 | Epoch: 0 | Step: 100920 | Dataset: 0-61504 | Loss: 1.178 | 913 ms/step , 6887.98 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-04 22:58:05 | Epoch: 0 | Step: 100930 | Dataset: 0-61824 | Loss: 1.045 | 913 ms/step , 6886.18 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-04 22:58:14 | Epoch: 0 | Step: 100940 | Dataset: 0-62144 | Loss: 0.994 | 914 ms/step , 6884.13 GFLOP/s , 17906.5 tokens/s INFO:__main__:2024-11-04 22:58:23 | Epoch: 0 | Step: 100950 | Dataset: 0-62464 | Loss: 1.058 | 913 ms/step , 6885.18 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-04 22:58:33 | Epoch: 0 | Step: 100960 | Dataset: 0-62784 | Loss: 1.073 | 914 ms/step , 6884.31 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-04 22:58:42 | Epoch: 0 | Step: 100970 | Dataset: 0-63104 | Loss: 1.056 | 914 ms/step , 6883.61 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-04 22:58:51 | Epoch: 0 | Step: 100980 | Dataset: 0-63424 | Loss: 1.078 | 914 ms/step , 6878.87 GFLOP/s , 17905.9 tokens/s INFO:__main__:2024-11-04 22:59:00 | Epoch: 0 | Step: 100990 | Dataset: 0-63744 | Loss: 1.046 | 915 ms/step , 6876.27 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-04 22:59:09 | Epoch: 0 | Step: 101000 | Dataset: 0-64064 | Loss: 1.016 | 914 ms/step , 6883.42 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-04 22:59:11 | Validation | Step: 101000 | Val_loss: 1.094 | Best_val_loss: 1.1497 INFO:__main__:2024-11-04 22:59:11 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_225911_step_101000.pt` INFO:__main__:2024-11-04 22:59:21 | Epoch: 0 | Step: 101010 | Dataset: 0-64384 | Loss: 1.101 | 916 ms/step , 6865.95 GFLOP/s , 13786.6 tokens/s INFO:__main__:2024-11-04 22:59:30 | Epoch: 0 | Step: 101020 | Dataset: 0-64704 | Loss: 0.993 | 915 ms/step , 6872.85 GFLOP/s , 17886.3 tokens/s INFO:__main__:2024-11-04 22:59:39 | Epoch: 0 | Step: 101030 | Dataset: 0-65024 | Loss: 1.085 | 915 ms/step , 6870.56 GFLOP/s , 17901.4 tokens/s INFO:__main__:2024-11-04 22:59:49 | Epoch: 0 | Step: 101040 | Dataset: 0-65344 | Loss: 0.996 | 917 ms/step , 6861.77 GFLOP/s , 17886.2 tokens/s INFO:__main__:2024-11-04 22:59:58 | Epoch: 0 | Step: 101050 | Dataset: 0-65664 | Loss: 1.012 | 915 ms/step , 6871.43 GFLOP/s , 17901.7 tokens/s INFO:__main__:2024-11-04 23:00:07 | Epoch: 0 | Step: 101060 | Dataset: 0-65984 | Loss: 1.021 | 916 ms/step , 6869.96 GFLOP/s , 17904.8 tokens/s INFO:__main__:2024-11-04 23:00:16 | Epoch: 0 | Step: 101070 | Dataset: 0-66304 | Loss: 1.034 | 915 ms/step , 6874.01 GFLOP/s , 17903.6 tokens/s INFO:__main__:2024-11-04 23:00:25 | Epoch: 0 | Step: 101080 | Dataset: 0-66624 | Loss: 1.025 | 914 ms/step , 6878.58 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-04 23:00:34 | Epoch: 0 | Step: 101090 | Dataset: 0-66944 | Loss: 0.974 | 913 ms/step , 6885.87 GFLOP/s , 17899.2 tokens/s INFO:__main__:2024-11-04 23:00:43 | Epoch: 0 | Step: 101100 | Dataset: 0-67264 | Loss: 0.981 | 915 ms/step , 6872.77 GFLOP/s , 17859.0 tokens/s INFO:__main__:2024-11-04 23:00:45 | Validation | Step: 101100 | Val_loss: 1.082 | Best_val_loss: 1.0945 INFO:__main__:2024-11-04 23:00:45 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230045_step_101100.pt` INFO:__main__:2024-11-04 23:00:55 | Epoch: 0 | Step: 101110 | Dataset: 0-67584 | Loss: 0.945 | 914 ms/step , 6883.20 GFLOP/s , 13750.9 tokens/s INFO:__main__:2024-11-04 23:01:05 | Epoch: 0 | Step: 101120 | Dataset: 0-67904 | Loss: 0.969 | 915 ms/step , 6873.52 GFLOP/s , 17896.6 tokens/s INFO:__main__:2024-11-04 23:01:14 | Epoch: 0 | Step: 101130 | Dataset: 0-68224 | Loss: 1.032 | 916 ms/step , 6868.32 GFLOP/s , 17903.7 tokens/s INFO:__main__:2024-11-04 23:01:23 | Epoch: 0 | Step: 101140 | Dataset: 0-68544 | Loss: 1.014 | 914 ms/step , 6878.98 GFLOP/s , 17891.9 tokens/s INFO:__main__:2024-11-04 23:01:32 | Epoch: 0 | Step: 101150 | Dataset: 0-68864 | Loss: 0.959 | 914 ms/step , 6878.13 GFLOP/s , 17897.6 tokens/s INFO:__main__:2024-11-04 23:01:41 | Epoch: 0 | Step: 101160 | Dataset: 0-69184 | Loss: 1.011 | 915 ms/step , 6871.84 GFLOP/s , 17896.6 tokens/s INFO:__main__:2024-11-04 23:01:50 | Epoch: 0 | Step: 101170 | Dataset: 0-69504 | Loss: 0.951 | 915 ms/step , 6875.15 GFLOP/s , 17894.7 tokens/s INFO:__main__:2024-11-04 23:01:59 | Epoch: 0 | Step: 101180 | Dataset: 0-69824 | Loss: 1.049 | 915 ms/step , 6872.27 GFLOP/s , 17888.7 tokens/s INFO:__main__:2024-11-04 23:02:09 | Epoch: 0 | Step: 101190 | Dataset: 0-70144 | Loss: 0.969 | 914 ms/step , 6879.92 GFLOP/s , 17899.0 tokens/s INFO:__main__:2024-11-04 23:02:18 | Epoch: 0 | Step: 101200 | Dataset: 0-70464 | Loss: 0.935 | 915 ms/step , 6873.60 GFLOP/s , 17893.3 tokens/s INFO:__main__:2024-11-04 23:02:19 | Validation | Step: 101200 | Val_loss: 1.032 | Best_val_loss: 1.0817 INFO:__main__:2024-11-04 23:02:19 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230219_step_101200.pt` INFO:__main__:2024-11-04 23:02:30 | Epoch: 0 | Step: 101210 | Dataset: 0-70784 | Loss: 0.900 | 915 ms/step , 6874.44 GFLOP/s , 13790.3 tokens/s INFO:__main__:2024-11-04 23:02:39 | Epoch: 0 | Step: 101220 | Dataset: 0-71104 | Loss: 0.995 | 914 ms/step , 6882.39 GFLOP/s , 17887.1 tokens/s INFO:__main__:2024-11-04 23:02:48 | Epoch: 0 | Step: 101230 | Dataset: 0-71424 | Loss: 0.910 | 916 ms/step , 6869.65 GFLOP/s , 17892.3 tokens/s INFO:__main__:2024-11-04 23:02:57 | Epoch: 0 | Step: 101240 | Dataset: 0-71744 | Loss: 0.868 | 916 ms/step , 6869.45 GFLOP/s , 17895.4 tokens/s INFO:__main__:2024-11-04 23:03:06 | Epoch: 0 | Step: 101250 | Dataset: 0-72064 | Loss: 0.926 | 915 ms/step , 6872.70 GFLOP/s , 17900.0 tokens/s INFO:__main__:2024-11-04 23:03:15 | Epoch: 0 | Step: 101260 | Dataset: 0-72384 | Loss: 1.018 | 914 ms/step , 6883.73 GFLOP/s , 17899.5 tokens/s INFO:__main__:2024-11-04 23:03:25 | Epoch: 0 | Step: 101270 | Dataset: 0-72704 | Loss: 0.923 | 915 ms/step , 6870.36 GFLOP/s , 17893.2 tokens/s INFO:__main__:2024-11-04 23:03:34 | Epoch: 0 | Step: 101280 | Dataset: 0-73024 | Loss: 0.965 | 916 ms/step , 6866.94 GFLOP/s , 17890.3 tokens/s INFO:__main__:2024-11-04 23:03:43 | Epoch: 0 | Step: 101290 | Dataset: 0-73344 | Loss: 0.845 | 917 ms/step , 6862.11 GFLOP/s , 17898.2 tokens/s INFO:__main__:2024-11-04 23:03:52 | Epoch: 0 | Step: 101300 | Dataset: 0-73664 | Loss: 1.041 | 914 ms/step , 6878.50 GFLOP/s , 17891.1 tokens/s INFO:__main__:2024-11-04 23:03:54 | Validation | Step: 101300 | Val_loss: 0.996 | Best_val_loss: 1.0319 INFO:__main__:2024-11-04 23:03:54 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230354_step_101300.pt` INFO:__main__:2024-11-04 23:04:04 | Epoch: 0 | Step: 101310 | Dataset: 0-73984 | Loss: 1.034 | 915 ms/step , 6874.34 GFLOP/s , 13768.2 tokens/s INFO:__main__:2024-11-04 23:04:13 | Epoch: 0 | Step: 101320 | Dataset: 0-74304 | Loss: 1.010 | 915 ms/step , 6875.11 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-04 23:04:22 | Epoch: 0 | Step: 101330 | Dataset: 0-74624 | Loss: 0.976 | 914 ms/step , 6878.86 GFLOP/s , 17901.6 tokens/s INFO:__main__:2024-11-04 23:04:31 | Epoch: 0 | Step: 101340 | Dataset: 0-74944 | Loss: 0.963 | 914 ms/step , 6878.61 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-04 23:04:41 | Epoch: 0 | Step: 101350 | Dataset: 0-75264 | Loss: 1.059 | 914 ms/step , 6880.58 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-04 23:04:50 | Epoch: 0 | Step: 101360 | Dataset: 0-75584 | Loss: 0.995 | 914 ms/step , 6880.95 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-04 23:04:59 | Epoch: 0 | Step: 101370 | Dataset: 0-75904 | Loss: 1.033 | 915 ms/step , 6870.41 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-04 23:05:08 | Epoch: 0 | Step: 101380 | Dataset: 0-76224 | Loss: 1.086 | 915 ms/step , 6874.53 GFLOP/s , 17907.6 tokens/s INFO:__main__:2024-11-04 23:05:17 | Epoch: 0 | Step: 101390 | Dataset: 0-76544 | Loss: 0.924 | 915 ms/step , 6877.39 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-04 23:05:26 | Epoch: 0 | Step: 101400 | Dataset: 0-76864 | Loss: 1.063 | 913 ms/step , 6886.25 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-04 23:05:28 | Validation | Step: 101400 | Val_loss: 0.963 | Best_val_loss: 0.9961 INFO:__main__:2024-11-04 23:05:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230528_step_101400.pt` INFO:__main__:2024-11-04 23:05:38 | Epoch: 0 | Step: 101410 | Dataset: 0-77184 | Loss: 0.970 | 914 ms/step , 6880.59 GFLOP/s , 13788.1 tokens/s INFO:__main__:2024-11-04 23:05:47 | Epoch: 0 | Step: 101420 | Dataset: 0-77504 | Loss: 0.911 | 916 ms/step , 6868.68 GFLOP/s , 17884.1 tokens/s INFO:__main__:2024-11-04 23:05:56 | Epoch: 0 | Step: 101430 | Dataset: 0-77824 | Loss: 0.848 | 915 ms/step , 6871.76 GFLOP/s , 17878.6 tokens/s INFO:__main__:2024-11-04 23:06:06 | Epoch: 0 | Step: 101440 | Dataset: 0-78144 | Loss: 0.982 | 916 ms/step , 6868.02 GFLOP/s , 17899.6 tokens/s INFO:__main__:2024-11-04 23:06:15 | Epoch: 0 | Step: 101450 | Dataset: 0-78464 | Loss: 0.974 | 914 ms/step , 6879.35 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-04 23:06:24 | Epoch: 0 | Step: 101460 | Dataset: 0-78784 | Loss: 0.948 | 914 ms/step , 6882.05 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-04 23:06:33 | Epoch: 0 | Step: 101470 | Dataset: 0-79104 | Loss: 0.903 | 914 ms/step , 6879.68 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-04 23:06:42 | Epoch: 0 | Step: 101480 | Dataset: 0-79424 | Loss: 1.093 | 916 ms/step , 6867.97 GFLOP/s , 17907.9 tokens/s INFO:__main__:2024-11-04 23:06:51 | Epoch: 0 | Step: 101490 | Dataset: 0-79744 | Loss: 1.027 | 915 ms/step , 6873.44 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-04 23:07:01 | Epoch: 0 | Step: 101500 | Dataset: 0-80064 | Loss: 1.006 | 914 ms/step , 6879.96 GFLOP/s , 17901.2 tokens/s INFO:__main__:2024-11-04 23:07:02 | Validation | Step: 101500 | Val_loss: 0.948 | Best_val_loss: 0.9625 INFO:__main__:2024-11-04 23:07:02 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230702_step_101500.pt` INFO:__main__:2024-11-04 23:07:12 | Epoch: 0 | Step: 101510 | Dataset: 0-80384 | Loss: 0.915 | 914 ms/step , 6877.63 GFLOP/s , 13756.1 tokens/s INFO:__main__:2024-11-04 23:07:22 | Epoch: 0 | Step: 101520 | Dataset: 0-80704 | Loss: 1.068 | 916 ms/step , 6865.22 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-04 23:07:31 | Epoch: 0 | Step: 101530 | Dataset: 0-81024 | Loss: 1.029 | 914 ms/step , 6877.74 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-04 23:07:40 | Epoch: 0 | Step: 101540 | Dataset: 0-81344 | Loss: 0.992 | 916 ms/step , 6869.67 GFLOP/s , 17895.4 tokens/s INFO:__main__:2024-11-04 23:07:49 | Epoch: 0 | Step: 101550 | Dataset: 0-81664 | Loss: 0.922 | 914 ms/step , 6884.56 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-04 23:07:58 | Epoch: 0 | Step: 101560 | Dataset: 0-81984 | Loss: 0.978 | 914 ms/step , 6880.03 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-04 23:08:07 | Epoch: 0 | Step: 101570 | Dataset: 0-82304 | Loss: 0.955 | 914 ms/step , 6878.40 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-04 23:08:16 | Epoch: 0 | Step: 101580 | Dataset: 0-82624 | Loss: 1.026 | 917 ms/step , 6856.60 GFLOP/s , 17900.7 tokens/s INFO:__main__:2024-11-04 23:08:26 | Epoch: 0 | Step: 101590 | Dataset: 0-82944 | Loss: 0.788 | 914 ms/step , 6880.70 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-04 23:08:35 | Epoch: 0 | Step: 101600 | Dataset: 0-83264 | Loss: 0.998 | 916 ms/step , 6867.02 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-04 23:08:36 | Validation | Step: 101600 | Val_loss: 0.939 | Best_val_loss: 0.9479 INFO:__main__:2024-11-04 23:08:36 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_230836_step_101600.pt` INFO:__main__:2024-11-04 23:08:47 | Epoch: 0 | Step: 101610 | Dataset: 0-83584 | Loss: 1.022 | 912 ms/step , 6892.87 GFLOP/s , 13797.0 tokens/s INFO:__main__:2024-11-04 23:08:56 | Epoch: 0 | Step: 101620 | Dataset: 0-83904 | Loss: 1.037 | 914 ms/step , 6880.22 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-04 23:09:05 | Epoch: 0 | Step: 101630 | Dataset: 0-84224 | Loss: 0.975 | 914 ms/step , 6877.54 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-04 23:09:14 | Epoch: 0 | Step: 101640 | Dataset: 0-84544 | Loss: 0.802 | 914 ms/step , 6880.33 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-04 23:09:23 | Epoch: 0 | Step: 101650 | Dataset: 0-84864 | Loss: 0.958 | 916 ms/step , 6867.28 GFLOP/s , 17904.9 tokens/s INFO:__main__:2024-11-04 23:09:32 | Epoch: 0 | Step: 101660 | Dataset: 0-85184 | Loss: 0.972 | 916 ms/step , 6866.22 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-04 23:09:42 | Epoch: 0 | Step: 101670 | Dataset: 0-85504 | Loss: 0.762 | 915 ms/step , 6874.52 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-04 23:09:51 | Epoch: 0 | Step: 101680 | Dataset: 0-85824 | Loss: 0.886 | 915 ms/step , 6874.13 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-04 23:10:00 | Epoch: 0 | Step: 101690 | Dataset: 0-86144 | Loss: 0.951 | 915 ms/step , 6874.22 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-04 23:10:09 | Epoch: 0 | Step: 101700 | Dataset: 0-86464 | Loss: 0.952 | 914 ms/step , 6883.80 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-04 23:10:11 | Validation | Step: 101700 | Val_loss: 0.911 | Best_val_loss: 0.9391 INFO:__main__:2024-11-04 23:10:11 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_231011_step_101700.pt` INFO:__main__:2024-11-04 23:10:21 | Epoch: 0 | Step: 101710 | Dataset: 0-86784 | Loss: 0.927 | 915 ms/step , 6874.84 GFLOP/s , 13699.6 tokens/s INFO:__main__:2024-11-04 23:10:30 | Epoch: 0 | Step: 101720 | Dataset: 0-87104 | Loss: 0.830 | 915 ms/step , 6876.14 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-04 23:10:39 | Epoch: 0 | Step: 101730 | Dataset: 0-87424 | Loss: 0.934 | 916 ms/step , 6867.46 GFLOP/s , 17895.0 tokens/s INFO:__main__:2024-11-04 23:10:48 | Epoch: 0 | Step: 101740 | Dataset: 0-87744 | Loss: 0.923 | 915 ms/step , 6874.80 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-04 23:10:58 | Epoch: 0 | Step: 101750 | Dataset: 0-88064 | Loss: 0.639 | 914 ms/step , 6879.72 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-04 23:11:07 | Epoch: 0 | Step: 101760 | Dataset: 0-88384 | Loss: 0.882 | 914 ms/step , 6881.15 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-04 23:11:16 | Epoch: 0 | Step: 101770 | Dataset: 0-88704 | Loss: 0.793 | 912 ms/step , 6894.90 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-04 23:11:25 | Epoch: 0 | Step: 101780 | Dataset: 0-89024 | Loss: 0.880 | 913 ms/step , 6887.33 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-04 23:11:34 | Epoch: 0 | Step: 101790 | Dataset: 0-89344 | Loss: 0.758 | 914 ms/step , 6882.41 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-04 23:11:43 | Epoch: 0 | Step: 101800 | Dataset: 0-89664 | Loss: 0.911 | 912 ms/step , 6896.39 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-04 23:11:45 | Validation | Step: 101800 | Val_loss: 0.885 | Best_val_loss: 0.9114 INFO:__main__:2024-11-04 23:11:45 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_231145_step_101800.pt` INFO:__main__:2024-11-04 23:11:55 | Epoch: 0 | Step: 101810 | Dataset: 0-89984 | Loss: 0.840 | 914 ms/step , 6880.24 GFLOP/s , 13774.3 tokens/s INFO:__main__:2024-11-04 23:12:04 | Epoch: 0 | Step: 101820 | Dataset: 0-90304 | Loss: 0.904 | 916 ms/step , 6869.01 GFLOP/s , 17891.9 tokens/s INFO:__main__:2024-11-04 23:12:13 | Epoch: 0 | Step: 101830 | Dataset: 0-90624 | Loss: 0.896 | 914 ms/step , 6883.52 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-04 23:12:23 | Epoch: 0 | Step: 101840 | Dataset: 0-90944 | Loss: 0.890 | 913 ms/step , 6889.12 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-04 23:12:32 | Epoch: 0 | Step: 101850 | Dataset: 0-91264 | Loss: 0.879 | 915 ms/step , 6873.39 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-04 23:12:41 | Epoch: 0 | Step: 101860 | Dataset: 0-91584 | Loss: 0.970 | 913 ms/step , 6887.21 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-04 23:12:50 | Epoch: 0 | Step: 101870 | Dataset: 0-91904 | Loss: 0.925 | 914 ms/step , 6879.93 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-04 23:12:59 | Epoch: 0 | Step: 101880 | Dataset: 0-92224 | Loss: 0.871 | 914 ms/step , 6880.78 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-04 23:13:08 | Epoch: 0 | Step: 101890 | Dataset: 0-92544 | Loss: 0.876 | 912 ms/step , 6897.30 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-04 23:13:17 | Epoch: 0 | Step: 101900 | Dataset: 0-92864 | Loss: 0.971 | 914 ms/step , 6884.41 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-04 23:13:19 | Validation | Step: 101900 | Val_loss: 0.912 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:13:28 | Epoch: 0 | Step: 101910 | Dataset: 0-93184 | Loss: 0.989 | 915 ms/step , 6873.98 GFLOP/s , 15257.7 tokens/s INFO:__main__:2024-11-04 23:13:37 | Epoch: 0 | Step: 101920 | Dataset: 0-93504 | Loss: 1.076 | 913 ms/step , 6885.09 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-04 23:13:46 | Epoch: 0 | Step: 101930 | Dataset: 0-93824 | Loss: 0.823 | 914 ms/step , 6878.88 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-04 23:13:56 | Epoch: 0 | Step: 101940 | Dataset: 0-94144 | Loss: 1.010 | 914 ms/step , 6878.30 GFLOP/s , 17912.4 tokens/s INFO:__main__:2024-11-04 23:14:05 | Epoch: 0 | Step: 101950 | Dataset: 0-94464 | Loss: 0.714 | 915 ms/step , 6874.34 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-04 23:14:14 | Epoch: 0 | Step: 101960 | Dataset: 0-94784 | Loss: 1.007 | 914 ms/step , 6878.53 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-04 23:14:23 | Epoch: 0 | Step: 101970 | Dataset: 0-95104 | Loss: 1.000 | 914 ms/step , 6879.93 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-04 23:14:32 | Epoch: 0 | Step: 101980 | Dataset: 0-95424 | Loss: 0.966 | 914 ms/step , 6879.73 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-04 23:14:41 | Epoch: 0 | Step: 101990 | Dataset: 0-95744 | Loss: 0.872 | 913 ms/step , 6886.07 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-04 23:14:51 | Epoch: 0 | Step: 102000 | Dataset: 0-96064 | Loss: 0.916 | 914 ms/step , 6882.40 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-04 23:14:52 | Validation | Step: 102000 | Val_loss: 0.962 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:14:52 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_231452_step_102000.pt` INFO:__main__:2024-11-04 23:15:02 | Epoch: 0 | Step: 102010 | Dataset: 0-96384 | Loss: 0.845 | 920 ms/step , 6836.46 GFLOP/s , 13804.1 tokens/s INFO:__main__:2024-11-04 23:15:12 | Epoch: 0 | Step: 102020 | Dataset: 0-96704 | Loss: 0.946 | 914 ms/step , 6879.80 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-04 23:15:21 | Epoch: 0 | Step: 102030 | Dataset: 0-97024 | Loss: 0.928 | 914 ms/step , 6883.31 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-04 23:15:30 | Epoch: 0 | Step: 102040 | Dataset: 0-97344 | Loss: 0.864 | 913 ms/step , 6885.72 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-04 23:15:39 | Epoch: 0 | Step: 102050 | Dataset: 0-97664 | Loss: 0.971 | 914 ms/step , 6880.59 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-04 23:15:48 | Epoch: 0 | Step: 102060 | Dataset: 0-97984 | Loss: 0.873 | 913 ms/step , 6891.71 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-04 23:15:57 | Epoch: 0 | Step: 102070 | Dataset: 0-98304 | Loss: 0.846 | 912 ms/step , 6895.42 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-04 23:16:06 | Epoch: 0 | Step: 102080 | Dataset: 0-98624 | Loss: 0.828 | 914 ms/step , 6880.49 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-04 23:16:16 | Epoch: 0 | Step: 102090 | Dataset: 0-98944 | Loss: 0.809 | 913 ms/step , 6887.75 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-04 23:16:25 | Epoch: 0 | Step: 102100 | Dataset: 0-99264 | Loss: 0.928 | 915 ms/step , 6874.87 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-04 23:16:26 | Validation | Step: 102100 | Val_loss: 0.953 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:16:35 | Epoch: 0 | Step: 102110 | Dataset: 0-99584 | Loss: 0.807 | 914 ms/step , 6884.61 GFLOP/s , 15270.9 tokens/s INFO:__main__:2024-11-04 23:16:45 | Epoch: 0 | Step: 102120 | Dataset: 0-99904 | Loss: 0.834 | 913 ms/step , 6891.59 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-04 23:16:54 | Epoch: 0 | Step: 102130 | Dataset: 0-100224 | Loss: 0.975 | 914 ms/step , 6881.24 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-04 23:17:03 | Epoch: 0 | Step: 102140 | Dataset: 0-100544 | Loss: 0.950 | 914 ms/step , 6884.11 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-04 23:17:12 | Epoch: 0 | Step: 102150 | Dataset: 0-100864 | Loss: 0.823 | 913 ms/step , 6886.46 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-04 23:17:21 | Epoch: 0 | Step: 102160 | Dataset: 0-101184 | Loss: 0.899 | 914 ms/step , 6882.89 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-04 23:17:30 | Epoch: 0 | Step: 102170 | Dataset: 0-101504 | Loss: 0.827 | 914 ms/step , 6881.25 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-04 23:17:39 | Epoch: 0 | Step: 102180 | Dataset: 0-101824 | Loss: 0.848 | 914 ms/step , 6884.03 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-04 23:17:49 | Epoch: 0 | Step: 102190 | Dataset: 0-102144 | Loss: 0.771 | 914 ms/step , 6884.08 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-04 23:17:58 | Epoch: 0 | Step: 102200 | Dataset: 0-102464 | Loss: 0.812 | 915 ms/step , 6875.17 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-04 23:17:59 | Validation | Step: 102200 | Val_loss: 0.995 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:18:08 | Epoch: 0 | Step: 102210 | Dataset: 0-102784 | Loss: 0.796 | 914 ms/step , 6881.97 GFLOP/s , 15276.1 tokens/s INFO:__main__:2024-11-04 23:18:18 | Epoch: 0 | Step: 102220 | Dataset: 0-103104 | Loss: 0.895 | 913 ms/step , 6890.99 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-04 23:18:27 | Epoch: 0 | Step: 102230 | Dataset: 0-103424 | Loss: 0.897 | 913 ms/step , 6888.36 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-04 23:18:36 | Epoch: 0 | Step: 102240 | Dataset: 0-103744 | Loss: 0.988 | 914 ms/step , 6882.83 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-04 23:18:45 | Epoch: 0 | Step: 102250 | Dataset: 0-104064 | Loss: 0.862 | 914 ms/step , 6884.70 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-04 23:18:54 | Epoch: 0 | Step: 102260 | Dataset: 0-104384 | Loss: 0.934 | 913 ms/step , 6891.65 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-04 23:19:03 | Epoch: 0 | Step: 102270 | Dataset: 0-104704 | Loss: 0.932 | 913 ms/step , 6885.22 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-04 23:19:12 | Epoch: 0 | Step: 102280 | Dataset: 0-105024 | Loss: 0.941 | 914 ms/step , 6883.41 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-04 23:19:21 | Epoch: 0 | Step: 102290 | Dataset: 0-105344 | Loss: 0.845 | 914 ms/step , 6882.77 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-04 23:19:31 | Epoch: 0 | Step: 102300 | Dataset: 0-105664 | Loss: 0.947 | 913 ms/step , 6890.30 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-04 23:19:32 | Validation | Step: 102300 | Val_loss: 0.963 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:19:41 | Epoch: 0 | Step: 102310 | Dataset: 0-105984 | Loss: 0.998 | 913 ms/step , 6889.70 GFLOP/s , 15267.0 tokens/s INFO:__main__:2024-11-04 23:19:51 | Epoch: 0 | Step: 102320 | Dataset: 0-106304 | Loss: 0.895 | 913 ms/step , 6890.51 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-04 23:20:00 | Epoch: 0 | Step: 102330 | Dataset: 0-106624 | Loss: 0.878 | 913 ms/step , 6891.81 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-04 23:20:09 | Epoch: 0 | Step: 102340 | Dataset: 0-106944 | Loss: 0.911 | 913 ms/step , 6888.39 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-04 23:20:18 | Epoch: 0 | Step: 102350 | Dataset: 0-107264 | Loss: 0.874 | 914 ms/step , 6883.53 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-04 23:20:27 | Epoch: 0 | Step: 102360 | Dataset: 0-107584 | Loss: 1.020 | 913 ms/step , 6887.47 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-04 23:20:36 | Epoch: 0 | Step: 102370 | Dataset: 0-107904 | Loss: 0.974 | 913 ms/step , 6885.06 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-04 23:20:45 | Epoch: 0 | Step: 102380 | Dataset: 0-108224 | Loss: 0.815 | 912 ms/step , 6893.58 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-04 23:20:54 | Epoch: 0 | Step: 102390 | Dataset: 0-108544 | Loss: 0.958 | 913 ms/step , 6887.66 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-04 23:21:04 | Epoch: 0 | Step: 102400 | Dataset: 0-108864 | Loss: 0.821 | 912 ms/step , 6894.39 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-04 23:21:05 | Validation | Step: 102400 | Val_loss: 0.956 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:21:14 | Epoch: 0 | Step: 102410 | Dataset: 0-109184 | Loss: 0.944 | 912 ms/step , 6896.16 GFLOP/s , 15265.5 tokens/s INFO:__main__:2024-11-04 23:21:23 | Epoch: 0 | Step: 102420 | Dataset: 0-109504 | Loss: 0.918 | 914 ms/step , 6883.69 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-04 23:21:33 | Epoch: 0 | Step: 102430 | Dataset: 0-109824 | Loss: 0.816 | 915 ms/step , 6871.50 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-04 23:21:42 | Epoch: 0 | Step: 102440 | Dataset: 0-110144 | Loss: 0.960 | 913 ms/step , 6890.24 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-04 23:21:51 | Epoch: 0 | Step: 102450 | Dataset: 0-110464 | Loss: 0.828 | 914 ms/step , 6881.62 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-04 23:22:00 | Epoch: 0 | Step: 102460 | Dataset: 0-110784 | Loss: 0.882 | 914 ms/step , 6881.61 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-04 23:22:09 | Epoch: 0 | Step: 102470 | Dataset: 0-111104 | Loss: 0.801 | 913 ms/step , 6889.25 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-04 23:22:18 | Epoch: 0 | Step: 102480 | Dataset: 0-111424 | Loss: 0.784 | 913 ms/step , 6887.45 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-04 23:22:27 | Epoch: 0 | Step: 102490 | Dataset: 0-111744 | Loss: 0.927 | 913 ms/step , 6891.22 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-04 23:22:37 | Epoch: 0 | Step: 102500 | Dataset: 0-112064 | Loss: 0.889 | 913 ms/step , 6891.54 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-04 23:22:38 | Validation | Step: 102500 | Val_loss: 0.909 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:22:47 | Epoch: 0 | Step: 102510 | Dataset: 0-112384 | Loss: 0.751 | 913 ms/step , 6890.82 GFLOP/s , 15278.1 tokens/s INFO:__main__:2024-11-04 23:22:56 | Epoch: 0 | Step: 102520 | Dataset: 0-112704 | Loss: 0.992 | 913 ms/step , 6889.41 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-04 23:23:06 | Epoch: 0 | Step: 102530 | Dataset: 0-113024 | Loss: 0.859 | 913 ms/step , 6887.81 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-04 23:23:15 | Epoch: 0 | Step: 102540 | Dataset: 0-113344 | Loss: 0.813 | 913 ms/step , 6887.53 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-04 23:23:24 | Epoch: 0 | Step: 102550 | Dataset: 0-113664 | Loss: 0.914 | 913 ms/step , 6890.92 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-04 23:23:33 | Epoch: 0 | Step: 102560 | Dataset: 0-113984 | Loss: 0.886 | 914 ms/step , 6884.83 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-04 23:23:42 | Epoch: 0 | Step: 102570 | Dataset: 0-114304 | Loss: 0.838 | 913 ms/step , 6888.80 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-04 23:23:51 | Epoch: 0 | Step: 102580 | Dataset: 0-114624 | Loss: 0.895 | 914 ms/step , 6883.97 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-04 23:24:00 | Epoch: 0 | Step: 102590 | Dataset: 0-114944 | Loss: 0.964 | 915 ms/step , 6875.01 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-04 23:24:10 | Epoch: 0 | Step: 102600 | Dataset: 0-115264 | Loss: 0.872 | 914 ms/step , 6879.89 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-04 23:24:11 | Validation | Step: 102600 | Val_loss: 0.961 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:24:20 | Epoch: 0 | Step: 102610 | Dataset: 0-115584 | Loss: 0.917 | 913 ms/step , 6890.51 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-04 23:24:29 | Epoch: 0 | Step: 102620 | Dataset: 0-115904 | Loss: 0.872 | 916 ms/step , 6868.98 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-04 23:24:39 | Epoch: 0 | Step: 102630 | Dataset: 0-116224 | Loss: 0.914 | 914 ms/step , 6882.41 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-04 23:24:48 | Epoch: 0 | Step: 102640 | Dataset: 0-116544 | Loss: 0.899 | 914 ms/step , 6882.82 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-04 23:24:57 | Epoch: 0 | Step: 102650 | Dataset: 0-116864 | Loss: 0.847 | 914 ms/step , 6882.36 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-04 23:25:06 | Epoch: 0 | Step: 102660 | Dataset: 0-117184 | Loss: 0.912 | 913 ms/step , 6885.62 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-04 23:25:15 | Epoch: 0 | Step: 102670 | Dataset: 0-117504 | Loss: 0.777 | 913 ms/step , 6885.77 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-04 23:25:24 | Epoch: 0 | Step: 102680 | Dataset: 0-117824 | Loss: 0.936 | 914 ms/step , 6882.18 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-04 23:25:33 | Epoch: 0 | Step: 102690 | Dataset: 0-118144 | Loss: 0.916 | 915 ms/step , 6875.06 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-04 23:25:43 | Epoch: 0 | Step: 102700 | Dataset: 0-118464 | Loss: 0.937 | 913 ms/step , 6891.65 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-04 23:25:44 | Validation | Step: 102700 | Val_loss: 1.022 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:25:53 | Epoch: 0 | Step: 102710 | Dataset: 0-118784 | Loss: 0.884 | 914 ms/step , 6882.80 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-04 23:26:02 | Epoch: 0 | Step: 102720 | Dataset: 0-119104 | Loss: 0.790 | 912 ms/step , 6894.12 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-04 23:26:12 | Epoch: 0 | Step: 102730 | Dataset: 0-119424 | Loss: 0.909 | 913 ms/step , 6887.00 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-04 23:26:21 | Epoch: 0 | Step: 102740 | Dataset: 0-119744 | Loss: 0.834 | 913 ms/step , 6890.07 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-04 23:26:30 | Epoch: 0 | Step: 102750 | Dataset: 0-120064 | Loss: 1.038 | 914 ms/step , 6883.24 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-04 23:26:39 | Epoch: 0 | Step: 102760 | Dataset: 0-120384 | Loss: 0.928 | 913 ms/step , 6891.44 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-04 23:26:48 | Epoch: 0 | Step: 102770 | Dataset: 0-120704 | Loss: 0.753 | 913 ms/step , 6885.62 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-04 23:26:57 | Epoch: 0 | Step: 102780 | Dataset: 0-121024 | Loss: 0.812 | 914 ms/step , 6881.40 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-04 23:27:06 | Epoch: 0 | Step: 102790 | Dataset: 0-121344 | Loss: 0.818 | 912 ms/step , 6897.37 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-04 23:27:15 | Epoch: 0 | Step: 102800 | Dataset: 0-121664 | Loss: 0.895 | 914 ms/step , 6882.46 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-04 23:27:17 | Validation | Step: 102800 | Val_loss: 1.030 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:27:26 | Epoch: 0 | Step: 102810 | Dataset: 0-121984 | Loss: 0.857 | 914 ms/step , 6883.78 GFLOP/s , 15289.5 tokens/s INFO:__main__:2024-11-04 23:27:35 | Epoch: 0 | Step: 102820 | Dataset: 0-122304 | Loss: 0.940 | 914 ms/step , 6882.19 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-04 23:27:44 | Epoch: 0 | Step: 102830 | Dataset: 0-122624 | Loss: 0.822 | 912 ms/step , 6899.81 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-04 23:27:54 | Epoch: 0 | Step: 102840 | Dataset: 0-122944 | Loss: 0.933 | 913 ms/step , 6885.14 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-04 23:28:03 | Epoch: 0 | Step: 102850 | Dataset: 0-123264 | Loss: 0.841 | 912 ms/step , 6893.05 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-04 23:28:12 | Epoch: 0 | Step: 102860 | Dataset: 0-123584 | Loss: 0.978 | 914 ms/step , 6878.43 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-04 23:28:21 | Epoch: 0 | Step: 102870 | Dataset: 0-123904 | Loss: 0.598 | 914 ms/step , 6878.68 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-04 23:28:30 | Epoch: 0 | Step: 102880 | Dataset: 0-124224 | Loss: 0.840 | 913 ms/step , 6891.15 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-04 23:28:39 | Epoch: 0 | Step: 102890 | Dataset: 0-124544 | Loss: 0.822 | 912 ms/step , 6892.98 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-04 23:28:48 | Epoch: 0 | Step: 102900 | Dataset: 0-124864 | Loss: 0.861 | 912 ms/step , 6896.77 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-04 23:28:50 | Validation | Step: 102900 | Val_loss: 0.994 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:28:59 | Epoch: 0 | Step: 102910 | Dataset: 0-125184 | Loss: 0.844 | 912 ms/step , 6896.34 GFLOP/s , 15282.3 tokens/s INFO:__main__:2024-11-04 23:29:08 | Epoch: 0 | Step: 102920 | Dataset: 0-125504 | Loss: 0.826 | 913 ms/step , 6888.96 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-04 23:29:17 | Epoch: 0 | Step: 102930 | Dataset: 0-125824 | Loss: 0.812 | 913 ms/step , 6885.59 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-04 23:29:27 | Epoch: 0 | Step: 102940 | Dataset: 0-126144 | Loss: 0.800 | 912 ms/step , 6892.64 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-04 23:29:36 | Epoch: 0 | Step: 102950 | Dataset: 0-126464 | Loss: 0.759 | 912 ms/step , 6893.62 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-04 23:29:45 | Epoch: 0 | Step: 102960 | Dataset: 0-126784 | Loss: 1.006 | 913 ms/step , 6888.16 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-04 23:29:54 | Epoch: 0 | Step: 102970 | Dataset: 0-127104 | Loss: 0.859 | 914 ms/step , 6878.77 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-04 23:30:03 | Epoch: 0 | Step: 102980 | Dataset: 0-127424 | Loss: 1.013 | 913 ms/step , 6891.19 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-04 23:30:12 | Epoch: 0 | Step: 102990 | Dataset: 0-127744 | Loss: 0.951 | 913 ms/step , 6886.47 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-04 23:30:21 | Epoch: 0 | Step: 103000 | Dataset: 0-128064 | Loss: 0.898 | 913 ms/step , 6887.26 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-04 23:30:23 | Validation | Step: 103000 | Val_loss: 0.996 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:30:23 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_233023_step_103000.pt` INFO:__main__:2024-11-04 23:30:33 | Epoch: 0 | Step: 103010 | Dataset: 0-128384 | Loss: 0.783 | 913 ms/step , 6885.72 GFLOP/s , 13779.5 tokens/s INFO:__main__:2024-11-04 23:30:42 | Epoch: 0 | Step: 103020 | Dataset: 0-128704 | Loss: 0.982 | 915 ms/step , 6873.14 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-04 23:30:52 | Epoch: 0 | Step: 103030 | Dataset: 0-129024 | Loss: 0.849 | 914 ms/step , 6883.33 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-04 23:31:01 | Epoch: 0 | Step: 103040 | Dataset: 0-129344 | Loss: 0.867 | 916 ms/step , 6863.89 GFLOP/s , 17899.4 tokens/s INFO:__main__:2024-11-04 23:31:10 | Epoch: 0 | Step: 103050 | Dataset: 0-129664 | Loss: 0.946 | 914 ms/step , 6883.57 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-04 23:31:19 | Epoch: 0 | Step: 103060 | Dataset: 0-129984 | Loss: 0.952 | 916 ms/step , 6869.31 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-04 23:31:28 | Epoch: 0 | Step: 103070 | Dataset: 0-130304 | Loss: 0.925 | 914 ms/step , 6883.23 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-04 23:31:37 | Epoch: 0 | Step: 103080 | Dataset: 0-130624 | Loss: 0.928 | 911 ms/step , 6900.83 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-04 23:31:46 | Epoch: 0 | Step: 103090 | Dataset: 0-130944 | Loss: 0.869 | 914 ms/step , 6883.11 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-04 23:31:56 | Epoch: 0 | Step: 103100 | Dataset: 0-131264 | Loss: 0.899 | 914 ms/step , 6883.33 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-04 23:31:57 | Validation | Step: 103100 | Val_loss: 0.906 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:32:06 | Epoch: 0 | Step: 103110 | Dataset: 0-131584 | Loss: 0.686 | 913 ms/step , 6889.35 GFLOP/s , 15274.8 tokens/s INFO:__main__:2024-11-04 23:32:15 | Epoch: 0 | Step: 103120 | Dataset: 0-131904 | Loss: 0.865 | 914 ms/step , 6884.25 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-04 23:32:25 | Epoch: 0 | Step: 103130 | Dataset: 0-132224 | Loss: 0.818 | 913 ms/step , 6888.63 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 23:32:34 | Epoch: 0 | Step: 103140 | Dataset: 0-132544 | Loss: 0.913 | 914 ms/step , 6880.22 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-04 23:32:43 | Epoch: 0 | Step: 103150 | Dataset: 0-132864 | Loss: 0.915 | 913 ms/step , 6889.68 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-04 23:32:52 | Epoch: 0 | Step: 103160 | Dataset: 0-133184 | Loss: 0.886 | 914 ms/step , 6881.87 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-04 23:33:01 | Epoch: 0 | Step: 103170 | Dataset: 0-133504 | Loss: 0.648 | 913 ms/step , 6891.16 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-04 23:33:10 | Epoch: 0 | Step: 103180 | Dataset: 0-133824 | Loss: 0.581 | 913 ms/step , 6892.40 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 23:33:19 | Epoch: 0 | Step: 103190 | Dataset: 0-134144 | Loss: 0.538 | 912 ms/step , 6893.94 GFLOP/s , 17954.9 tokens/s INFO:__main__:2024-11-04 23:33:28 | Epoch: 0 | Step: 103200 | Dataset: 0-134464 | Loss: 0.566 | 911 ms/step , 6902.04 GFLOP/s , 17959.1 tokens/s INFO:__main__:2024-11-04 23:33:30 | Validation | Step: 103200 | Val_loss: 0.940 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:33:39 | Epoch: 0 | Step: 103210 | Dataset: 0-134784 | Loss: 0.537 | 912 ms/step , 6894.09 GFLOP/s , 15304.0 tokens/s INFO:__main__:2024-11-04 23:33:48 | Epoch: 0 | Step: 103220 | Dataset: 0-135104 | Loss: 0.501 | 912 ms/step , 6899.41 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-04 23:33:57 | Epoch: 0 | Step: 103230 | Dataset: 0-135424 | Loss: 0.512 | 913 ms/step , 6891.98 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-04 23:34:07 | Epoch: 0 | Step: 103240 | Dataset: 0-135744 | Loss: 0.488 | 912 ms/step , 6893.69 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-04 23:34:16 | Epoch: 0 | Step: 103250 | Dataset: 0-136064 | Loss: 0.504 | 914 ms/step , 6882.53 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-04 23:34:25 | Epoch: 0 | Step: 103260 | Dataset: 0-136384 | Loss: 0.487 | 914 ms/step , 6884.05 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-04 23:34:34 | Epoch: 0 | Step: 103270 | Dataset: 0-136704 | Loss: 0.451 | 913 ms/step , 6891.23 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-04 23:34:43 | Epoch: 0 | Step: 103280 | Dataset: 0-137024 | Loss: 0.475 | 912 ms/step , 6895.91 GFLOP/s , 17954.0 tokens/s INFO:__main__:2024-11-04 23:34:52 | Epoch: 0 | Step: 103290 | Dataset: 0-137344 | Loss: 0.438 | 912 ms/step , 6893.80 GFLOP/s , 17953.6 tokens/s INFO:__main__:2024-11-04 23:35:01 | Epoch: 0 | Step: 103300 | Dataset: 0-137664 | Loss: 0.442 | 912 ms/step , 6895.73 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-04 23:35:03 | Validation | Step: 103300 | Val_loss: 0.831 | Best_val_loss: 0.8851 INFO:__main__:2024-11-04 23:35:03 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_233503_step_103300.pt` INFO:__main__:2024-11-04 23:35:13 | Epoch: 0 | Step: 103310 | Dataset: 0-137984 | Loss: 0.442 | 913 ms/step , 6892.27 GFLOP/s , 13683.9 tokens/s INFO:__main__:2024-11-04 23:35:22 | Epoch: 0 | Step: 103320 | Dataset: 0-138304 | Loss: 0.433 | 912 ms/step , 6897.38 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-04 23:35:32 | Epoch: 0 | Step: 103330 | Dataset: 0-138624 | Loss: 0.491 | 913 ms/step , 6886.31 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-04 23:35:41 | Epoch: 0 | Step: 103340 | Dataset: 0-138944 | Loss: 0.437 | 913 ms/step , 6888.32 GFLOP/s , 17952.5 tokens/s INFO:__main__:2024-11-04 23:35:50 | Epoch: 0 | Step: 103350 | Dataset: 0-139264 | Loss: 0.433 | 912 ms/step , 6896.33 GFLOP/s , 17952.6 tokens/s INFO:__main__:2024-11-04 23:35:59 | Epoch: 0 | Step: 103360 | Dataset: 0-139584 | Loss: 0.457 | 912 ms/step , 6897.15 GFLOP/s , 17956.1 tokens/s INFO:__main__:2024-11-04 23:36:08 | Epoch: 0 | Step: 103370 | Dataset: 0-139904 | Loss: 0.454 | 912 ms/step , 6898.05 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-04 23:36:17 | Epoch: 0 | Step: 103380 | Dataset: 0-140224 | Loss: 0.385 | 912 ms/step , 6896.88 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-04 23:36:26 | Epoch: 0 | Step: 103390 | Dataset: 0-140544 | Loss: 0.432 | 912 ms/step , 6896.75 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-04 23:36:35 | Epoch: 0 | Step: 103400 | Dataset: 0-140864 | Loss: 0.438 | 913 ms/step , 6887.71 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-04 23:36:37 | Validation | Step: 103400 | Val_loss: 0.950 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:36:46 | Epoch: 0 | Step: 103410 | Dataset: 0-141184 | Loss: 0.425 | 911 ms/step , 6900.65 GFLOP/s , 15288.4 tokens/s INFO:__main__:2024-11-04 23:36:55 | Epoch: 0 | Step: 103420 | Dataset: 0-141504 | Loss: 0.457 | 912 ms/step , 6896.23 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-04 23:37:04 | Epoch: 0 | Step: 103430 | Dataset: 0-141824 | Loss: 0.390 | 912 ms/step , 6893.81 GFLOP/s , 17956.2 tokens/s INFO:__main__:2024-11-04 23:37:14 | Epoch: 0 | Step: 103440 | Dataset: 0-142144 | Loss: 0.412 | 912 ms/step , 6895.73 GFLOP/s , 17953.7 tokens/s INFO:__main__:2024-11-04 23:37:23 | Epoch: 0 | Step: 103450 | Dataset: 0-142464 | Loss: 0.419 | 913 ms/step , 6890.72 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-04 23:37:32 | Epoch: 0 | Step: 103460 | Dataset: 0-142784 | Loss: 0.394 | 913 ms/step , 6891.20 GFLOP/s , 17954.8 tokens/s INFO:__main__:2024-11-04 23:37:41 | Epoch: 0 | Step: 103470 | Dataset: 0-143104 | Loss: 0.423 | 912 ms/step , 6898.15 GFLOP/s , 17956.7 tokens/s INFO:__main__:2024-11-04 23:37:50 | Epoch: 0 | Step: 103480 | Dataset: 0-143424 | Loss: 0.378 | 913 ms/step , 6892.48 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-04 23:37:59 | Epoch: 0 | Step: 103490 | Dataset: 0-143744 | Loss: 0.360 | 911 ms/step , 6906.18 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-04 23:38:08 | Epoch: 0 | Step: 103500 | Dataset: 0-144064 | Loss: 0.437 | 913 ms/step , 6886.85 GFLOP/s , 17957.5 tokens/s INFO:__main__:2024-11-04 23:38:10 | Validation | Step: 103500 | Val_loss: 0.934 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:38:19 | Epoch: 0 | Step: 103510 | Dataset: 0-144384 | Loss: 0.362 | 912 ms/step , 6895.80 GFLOP/s , 15294.2 tokens/s INFO:__main__:2024-11-04 23:38:28 | Epoch: 0 | Step: 103520 | Dataset: 0-144704 | Loss: 0.406 | 913 ms/step , 6891.86 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-04 23:38:37 | Epoch: 0 | Step: 103530 | Dataset: 0-145024 | Loss: 0.386 | 912 ms/step , 6895.97 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-04 23:38:46 | Epoch: 0 | Step: 103540 | Dataset: 0-145344 | Loss: 0.393 | 913 ms/step , 6891.97 GFLOP/s , 17950.6 tokens/s INFO:__main__:2024-11-04 23:38:56 | Epoch: 0 | Step: 103550 | Dataset: 0-145664 | Loss: 0.372 | 912 ms/step , 6897.00 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-04 23:39:05 | Epoch: 0 | Step: 103560 | Dataset: 0-145984 | Loss: 0.356 | 911 ms/step , 6901.84 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-04 23:39:14 | Epoch: 0 | Step: 103570 | Dataset: 0-146304 | Loss: 0.409 | 912 ms/step , 6899.50 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-04 23:39:23 | Epoch: 0 | Step: 103580 | Dataset: 0-146624 | Loss: 0.354 | 911 ms/step , 6900.29 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-04 23:39:32 | Epoch: 0 | Step: 103590 | Dataset: 0-146944 | Loss: 0.394 | 913 ms/step , 6889.13 GFLOP/s , 17954.6 tokens/s INFO:__main__:2024-11-04 23:39:41 | Epoch: 0 | Step: 103600 | Dataset: 0-147264 | Loss: 0.389 | 911 ms/step , 6902.43 GFLOP/s , 17960.6 tokens/s INFO:__main__:2024-11-04 23:39:43 | Validation | Step: 103600 | Val_loss: 0.872 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:39:52 | Epoch: 0 | Step: 103610 | Dataset: 0-147584 | Loss: 0.454 | 912 ms/step , 6899.81 GFLOP/s , 15303.5 tokens/s INFO:__main__:2024-11-04 23:40:01 | Epoch: 0 | Step: 103620 | Dataset: 0-147904 | Loss: 0.396 | 912 ms/step , 6899.47 GFLOP/s , 17957.0 tokens/s INFO:__main__:2024-11-04 23:40:10 | Epoch: 0 | Step: 103630 | Dataset: 0-148224 | Loss: 0.414 | 912 ms/step , 6895.29 GFLOP/s , 17957.1 tokens/s INFO:__main__:2024-11-04 23:40:19 | Epoch: 0 | Step: 103640 | Dataset: 0-148544 | Loss: 0.369 | 912 ms/step , 6897.19 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-04 23:40:28 | Epoch: 0 | Step: 103650 | Dataset: 0-148864 | Loss: 0.411 | 912 ms/step , 6895.85 GFLOP/s , 17958.7 tokens/s INFO:__main__:2024-11-04 23:40:37 | Epoch: 0 | Step: 103660 | Dataset: 0-149184 | Loss: 0.373 | 912 ms/step , 6894.08 GFLOP/s , 17956.4 tokens/s INFO:__main__:2024-11-04 23:40:47 | Epoch: 0 | Step: 103670 | Dataset: 0-149504 | Loss: 0.365 | 913 ms/step , 6892.25 GFLOP/s , 17953.6 tokens/s INFO:__main__:2024-11-04 23:40:56 | Epoch: 0 | Step: 103680 | Dataset: 0-149824 | Loss: 0.366 | 912 ms/step , 6896.96 GFLOP/s , 17957.0 tokens/s INFO:__main__:2024-11-04 23:41:05 | Epoch: 0 | Step: 103690 | Dataset: 0-150144 | Loss: 0.368 | 914 ms/step , 6882.77 GFLOP/s , 17955.1 tokens/s INFO:__main__:2024-11-04 23:41:14 | Epoch: 0 | Step: 103700 | Dataset: 0-150464 | Loss: 0.336 | 913 ms/step , 6891.08 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-04 23:41:16 | Validation | Step: 103700 | Val_loss: 1.006 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:41:25 | Epoch: 0 | Step: 103710 | Dataset: 0-150784 | Loss: 0.344 | 912 ms/step , 6896.81 GFLOP/s , 15290.9 tokens/s INFO:__main__:2024-11-04 23:41:34 | Epoch: 0 | Step: 103720 | Dataset: 0-151104 | Loss: 0.385 | 912 ms/step , 6896.78 GFLOP/s , 17958.4 tokens/s INFO:__main__:2024-11-04 23:41:43 | Epoch: 0 | Step: 103730 | Dataset: 0-151424 | Loss: 0.400 | 912 ms/step , 6897.30 GFLOP/s , 17956.1 tokens/s INFO:__main__:2024-11-04 23:41:52 | Epoch: 0 | Step: 103740 | Dataset: 0-151744 | Loss: 0.346 | 913 ms/step , 6891.04 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-04 23:42:01 | Epoch: 0 | Step: 103750 | Dataset: 0-152064 | Loss: 0.371 | 912 ms/step , 6894.96 GFLOP/s , 17956.6 tokens/s INFO:__main__:2024-11-04 23:42:10 | Epoch: 0 | Step: 103760 | Dataset: 0-152384 | Loss: 0.318 | 913 ms/step , 6890.54 GFLOP/s , 17954.1 tokens/s INFO:__main__:2024-11-04 23:42:19 | Epoch: 0 | Step: 103770 | Dataset: 0-152704 | Loss: 0.349 | 911 ms/step , 6901.25 GFLOP/s , 17954.9 tokens/s INFO:__main__:2024-11-04 23:42:29 | Epoch: 0 | Step: 103780 | Dataset: 0-153024 | Loss: 0.390 | 912 ms/step , 6900.14 GFLOP/s , 17955.9 tokens/s INFO:__main__:2024-11-04 23:42:38 | Epoch: 0 | Step: 103790 | Dataset: 0-153344 | Loss: 0.372 | 911 ms/step , 6900.27 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-04 23:42:47 | Epoch: 0 | Step: 103800 | Dataset: 0-153664 | Loss: 0.382 | 911 ms/step , 6901.04 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-04 23:42:48 | Validation | Step: 103800 | Val_loss: 0.839 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:42:58 | Epoch: 0 | Step: 103810 | Dataset: 0-153984 | Loss: 0.349 | 913 ms/step , 6889.80 GFLOP/s , 15283.2 tokens/s INFO:__main__:2024-11-04 23:43:07 | Epoch: 0 | Step: 103820 | Dataset: 0-154304 | Loss: 0.385 | 913 ms/step , 6888.96 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-04 23:43:16 | Epoch: 0 | Step: 103830 | Dataset: 0-154624 | Loss: 0.318 | 912 ms/step , 6893.01 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-04 23:43:25 | Epoch: 0 | Step: 103840 | Dataset: 0-154944 | Loss: 0.345 | 913 ms/step , 6887.29 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-04 23:43:34 | Epoch: 0 | Step: 103850 | Dataset: 0-155264 | Loss: 0.356 | 914 ms/step , 6884.95 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-04 23:43:43 | Epoch: 0 | Step: 103860 | Dataset: 0-155584 | Loss: 0.335 | 913 ms/step , 6892.41 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-04 23:43:52 | Epoch: 0 | Step: 103870 | Dataset: 0-155904 | Loss: 0.366 | 912 ms/step , 6895.19 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-04 23:44:01 | Epoch: 0 | Step: 103880 | Dataset: 0-156224 | Loss: 0.367 | 912 ms/step , 6893.00 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-04 23:44:11 | Epoch: 0 | Step: 103890 | Dataset: 0-156544 | Loss: 0.359 | 911 ms/step , 6900.73 GFLOP/s , 17951.3 tokens/s INFO:__main__:2024-11-04 23:44:20 | Epoch: 0 | Step: 103900 | Dataset: 0-156864 | Loss: 0.310 | 912 ms/step , 6896.57 GFLOP/s , 17949.8 tokens/s INFO:__main__:2024-11-04 23:44:21 | Validation | Step: 103900 | Val_loss: 0.940 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:44:30 | Epoch: 0 | Step: 103910 | Dataset: 0-157184 | Loss: 0.350 | 912 ms/step , 6894.90 GFLOP/s , 15288.8 tokens/s INFO:__main__:2024-11-04 23:44:40 | Epoch: 0 | Step: 103920 | Dataset: 0-157504 | Loss: 0.316 | 911 ms/step , 6901.24 GFLOP/s , 17956.1 tokens/s INFO:__main__:2024-11-04 23:44:49 | Epoch: 0 | Step: 103930 | Dataset: 0-157824 | Loss: 0.370 | 911 ms/step , 6901.62 GFLOP/s , 17948.2 tokens/s INFO:__main__:2024-11-04 23:44:58 | Epoch: 0 | Step: 103940 | Dataset: 0-158144 | Loss: 0.339 | 912 ms/step , 6897.60 GFLOP/s , 17958.6 tokens/s INFO:__main__:2024-11-04 23:45:07 | Epoch: 0 | Step: 103950 | Dataset: 0-158464 | Loss: 0.351 | 912 ms/step , 6900.02 GFLOP/s , 17958.4 tokens/s INFO:__main__:2024-11-04 23:45:16 | Epoch: 0 | Step: 103960 | Dataset: 0-158784 | Loss: 0.296 | 912 ms/step , 6897.02 GFLOP/s , 17950.6 tokens/s INFO:__main__:2024-11-04 23:45:25 | Epoch: 0 | Step: 103970 | Dataset: 0-159104 | Loss: 0.333 | 913 ms/step , 6887.42 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-04 23:45:34 | Epoch: 0 | Step: 103980 | Dataset: 0-159424 | Loss: 0.357 | 912 ms/step , 6898.85 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-04 23:45:43 | Epoch: 0 | Step: 103990 | Dataset: 0-159744 | Loss: 0.309 | 911 ms/step , 6903.87 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-04 23:45:53 | Epoch: 0 | Step: 104000 | Dataset: 0-160064 | Loss: 0.317 | 912 ms/step , 6899.15 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-04 23:45:54 | Validation | Step: 104000 | Val_loss: 0.956 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:45:54 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_234554_step_104000.pt` INFO:__main__:2024-11-04 23:46:04 | Epoch: 0 | Step: 104010 | Dataset: 0-160384 | Loss: 0.331 | 912 ms/step , 6898.21 GFLOP/s , 13834.4 tokens/s INFO:__main__:2024-11-04 23:46:13 | Epoch: 0 | Step: 104020 | Dataset: 0-160704 | Loss: 0.380 | 912 ms/step , 6893.42 GFLOP/s , 17956.7 tokens/s INFO:__main__:2024-11-04 23:46:23 | Epoch: 0 | Step: 104030 | Dataset: 0-161024 | Loss: 0.328 | 911 ms/step , 6904.32 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-04 23:46:32 | Epoch: 0 | Step: 104040 | Dataset: 0-161344 | Loss: 0.354 | 913 ms/step , 6891.90 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-04 23:46:41 | Epoch: 0 | Step: 104050 | Dataset: 0-161664 | Loss: 0.371 | 913 ms/step , 6891.63 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-04 23:46:50 | Epoch: 0 | Step: 104060 | Dataset: 0-161984 | Loss: 0.338 | 913 ms/step , 6888.57 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-04 23:46:59 | Epoch: 0 | Step: 104070 | Dataset: 0-162304 | Loss: 0.350 | 912 ms/step , 6896.25 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-04 23:47:08 | Epoch: 0 | Step: 104080 | Dataset: 0-162624 | Loss: 0.328 | 913 ms/step , 6886.20 GFLOP/s , 17951.0 tokens/s INFO:__main__:2024-11-04 23:47:17 | Epoch: 0 | Step: 104090 | Dataset: 0-162944 | Loss: 0.358 | 912 ms/step , 6898.74 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-04 23:47:26 | Epoch: 0 | Step: 104100 | Dataset: 0-163264 | Loss: 0.342 | 912 ms/step , 6896.69 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-04 23:47:28 | Validation | Step: 104100 | Val_loss: 0.951 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:47:37 | Epoch: 0 | Step: 104110 | Dataset: 0-163584 | Loss: 0.345 | 912 ms/step , 6895.44 GFLOP/s , 15296.9 tokens/s INFO:__main__:2024-11-04 23:47:46 | Epoch: 0 | Step: 104120 | Dataset: 0-163904 | Loss: 0.335 | 911 ms/step , 6900.25 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-04 23:47:55 | Epoch: 0 | Step: 104130 | Dataset: 0-164224 | Loss: 0.266 | 913 ms/step , 6892.52 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-04 23:48:05 | Epoch: 0 | Step: 104140 | Dataset: 0-164544 | Loss: 0.370 | 912 ms/step , 6894.26 GFLOP/s , 17957.3 tokens/s INFO:__main__:2024-11-04 23:48:14 | Epoch: 0 | Step: 104150 | Dataset: 0-164864 | Loss: 0.326 | 911 ms/step , 6904.58 GFLOP/s , 17962.9 tokens/s INFO:__main__:2024-11-04 23:48:23 | Epoch: 0 | Step: 104160 | Dataset: 0-165184 | Loss: 0.309 | 912 ms/step , 6897.24 GFLOP/s , 17954.0 tokens/s INFO:__main__:2024-11-04 23:48:32 | Epoch: 0 | Step: 104170 | Dataset: 0-165504 | Loss: 0.312 | 912 ms/step , 6896.44 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-04 23:48:41 | Epoch: 0 | Step: 104180 | Dataset: 0-165824 | Loss: 0.374 | 912 ms/step , 6896.73 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-04 23:48:50 | Epoch: 0 | Step: 104190 | Dataset: 0-166144 | Loss: 0.390 | 912 ms/step , 6892.78 GFLOP/s , 17955.7 tokens/s INFO:__main__:2024-11-04 23:48:59 | Epoch: 0 | Step: 104200 | Dataset: 0-166464 | Loss: 0.290 | 913 ms/step , 6889.04 GFLOP/s , 17962.4 tokens/s INFO:__main__:2024-11-04 23:49:01 | Validation | Step: 104200 | Val_loss: 0.994 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:49:10 | Epoch: 0 | Step: 104210 | Dataset: 0-166784 | Loss: 0.361 | 913 ms/step , 6892.13 GFLOP/s , 15294.0 tokens/s INFO:__main__:2024-11-04 23:49:19 | Epoch: 0 | Step: 104220 | Dataset: 0-167104 | Loss: 0.311 | 912 ms/step , 6898.91 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-04 23:49:28 | Epoch: 0 | Step: 104230 | Dataset: 0-167424 | Loss: 0.292 | 913 ms/step , 6885.16 GFLOP/s , 17953.9 tokens/s INFO:__main__:2024-11-04 23:49:37 | Epoch: 0 | Step: 104240 | Dataset: 0-167744 | Loss: 0.327 | 912 ms/step , 6892.98 GFLOP/s , 17959.3 tokens/s INFO:__main__:2024-11-04 23:49:47 | Epoch: 0 | Step: 104250 | Dataset: 0-168064 | Loss: 0.378 | 912 ms/step , 6898.93 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-04 23:49:56 | Epoch: 0 | Step: 104260 | Dataset: 0-168384 | Loss: 0.322 | 912 ms/step , 6896.95 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-04 23:50:05 | Epoch: 0 | Step: 104270 | Dataset: 0-168704 | Loss: 0.297 | 912 ms/step , 6896.55 GFLOP/s , 17954.4 tokens/s INFO:__main__:2024-11-04 23:50:14 | Epoch: 0 | Step: 104280 | Dataset: 0-169024 | Loss: 0.350 | 913 ms/step , 6890.16 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-04 23:50:23 | Epoch: 0 | Step: 104290 | Dataset: 0-169344 | Loss: 0.341 | 912 ms/step , 6897.74 GFLOP/s , 17952.7 tokens/s INFO:__main__:2024-11-04 23:50:32 | Epoch: 0 | Step: 104300 | Dataset: 0-169664 | Loss: 0.295 | 912 ms/step , 6895.57 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-04 23:50:34 | Validation | Step: 104300 | Val_loss: 1.003 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:50:43 | Epoch: 0 | Step: 104310 | Dataset: 0-169984 | Loss: 0.317 | 912 ms/step , 6893.76 GFLOP/s , 15294.3 tokens/s INFO:__main__:2024-11-04 23:50:52 | Epoch: 0 | Step: 104320 | Dataset: 0-170304 | Loss: 0.359 | 912 ms/step , 6895.49 GFLOP/s , 17956.3 tokens/s INFO:__main__:2024-11-04 23:51:01 | Epoch: 0 | Step: 104330 | Dataset: 0-170624 | Loss: 0.285 | 913 ms/step , 6888.91 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-04 23:51:10 | Epoch: 0 | Step: 104340 | Dataset: 0-170944 | Loss: 0.894 | 914 ms/step , 6883.33 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-04 23:51:19 | Epoch: 0 | Step: 104350 | Dataset: 0-171264 | Loss: 0.613 | 913 ms/step , 6887.44 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-04 23:51:29 | Epoch: 0 | Step: 104360 | Dataset: 0-171584 | Loss: 0.783 | 912 ms/step , 6894.91 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-04 23:51:38 | Epoch: 0 | Step: 104370 | Dataset: 0-171904 | Loss: 0.876 | 914 ms/step , 6880.17 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-04 23:51:47 | Epoch: 0 | Step: 104380 | Dataset: 0-172224 | Loss: 0.873 | 913 ms/step , 6885.81 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 23:51:56 | Epoch: 0 | Step: 104390 | Dataset: 0-172544 | Loss: 0.892 | 914 ms/step , 6884.54 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-04 23:52:05 | Epoch: 0 | Step: 104400 | Dataset: 0-172864 | Loss: 1.035 | 915 ms/step , 6876.51 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-04 23:52:07 | Validation | Step: 104400 | Val_loss: 0.781 | Best_val_loss: 0.8310 INFO:__main__:2024-11-04 23:52:07 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241104_235207_step_104400.pt` INFO:__main__:2024-11-04 23:52:17 | Epoch: 0 | Step: 104410 | Dataset: 0-173184 | Loss: 0.709 | 917 ms/step , 6861.08 GFLOP/s , 13838.2 tokens/s INFO:__main__:2024-11-04 23:52:26 | Epoch: 0 | Step: 104420 | Dataset: 0-173504 | Loss: 0.716 | 913 ms/step , 6888.08 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-04 23:52:35 | Epoch: 0 | Step: 104430 | Dataset: 0-173824 | Loss: 0.907 | 914 ms/step , 6879.59 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-04 23:52:44 | Epoch: 0 | Step: 104440 | Dataset: 0-174144 | Loss: 0.913 | 914 ms/step , 6881.19 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-04 23:52:54 | Epoch: 0 | Step: 104450 | Dataset: 0-174464 | Loss: 0.987 | 913 ms/step , 6887.25 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-04 23:53:03 | Epoch: 0 | Step: 104460 | Dataset: 0-174784 | Loss: 0.908 | 916 ms/step , 6867.28 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-04 23:53:12 | Epoch: 0 | Step: 104470 | Dataset: 0-175104 | Loss: 0.874 | 913 ms/step , 6888.61 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-04 23:53:21 | Epoch: 0 | Step: 104480 | Dataset: 0-175424 | Loss: 0.871 | 912 ms/step , 6893.37 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-04 23:53:30 | Epoch: 0 | Step: 104490 | Dataset: 0-175744 | Loss: 0.749 | 915 ms/step , 6876.37 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-04 23:53:39 | Epoch: 0 | Step: 104500 | Dataset: 0-176064 | Loss: 0.889 | 912 ms/step , 6894.57 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-04 23:53:41 | Validation | Step: 104500 | Val_loss: 0.946 | Best_val_loss: 0.7806 INFO:__main__:2024-11-04 23:53:50 | Epoch: 0 | Step: 104510 | Dataset: 0-176384 | Loss: 0.765 | 914 ms/step , 6882.85 GFLOP/s , 15278.7 tokens/s INFO:__main__:2024-11-04 23:53:59 | Epoch: 0 | Step: 104520 | Dataset: 0-176704 | Loss: 0.882 | 914 ms/step , 6880.65 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-04 23:54:08 | Epoch: 0 | Step: 104530 | Dataset: 0-177024 | Loss: 0.884 | 914 ms/step , 6879.31 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-04 23:54:17 | Epoch: 0 | Step: 104540 | Dataset: 0-177344 | Loss: 0.836 | 914 ms/step , 6878.40 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-04 23:54:26 | Epoch: 0 | Step: 104550 | Dataset: 0-177664 | Loss: 0.705 | 912 ms/step , 6895.64 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-04 23:54:36 | Epoch: 0 | Step: 104560 | Dataset: 0-177984 | Loss: 0.829 | 913 ms/step , 6885.29 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-04 23:54:45 | Epoch: 0 | Step: 104570 | Dataset: 0-178304 | Loss: 0.835 | 913 ms/step , 6892.09 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-04 23:54:54 | Epoch: 0 | Step: 104580 | Dataset: 0-178624 | Loss: 0.852 | 914 ms/step , 6883.61 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-04 23:55:03 | Epoch: 0 | Step: 104590 | Dataset: 0-178944 | Loss: 0.831 | 911 ms/step , 6903.29 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-04 23:55:12 | Epoch: 0 | Step: 104600 | Dataset: 0-179264 | Loss: 0.845 | 913 ms/step , 6887.14 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-04 23:55:14 | Validation | Step: 104600 | Val_loss: 0.870 | Best_val_loss: 0.7806 INFO:__main__:2024-11-04 23:55:23 | Epoch: 0 | Step: 104610 | Dataset: 0-179584 | Loss: 0.864 | 914 ms/step , 6882.72 GFLOP/s , 15264.6 tokens/s INFO:__main__:2024-11-04 23:55:32 | Epoch: 0 | Step: 104620 | Dataset: 0-179904 | Loss: 0.914 | 914 ms/step , 6883.96 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-04 23:55:41 | Epoch: 0 | Step: 104630 | Dataset: 0-180224 | Loss: 0.700 | 914 ms/step , 6881.62 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-04 23:55:50 | Epoch: 0 | Step: 104640 | Dataset: 0-180544 | Loss: 0.816 | 914 ms/step , 6879.32 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-04 23:55:59 | Epoch: 0 | Step: 104650 | Dataset: 0-180864 | Loss: 0.842 | 913 ms/step , 6886.98 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-04 23:56:09 | Epoch: 0 | Step: 104660 | Dataset: 0-181184 | Loss: 0.906 | 913 ms/step , 6886.60 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-04 23:56:18 | Epoch: 0 | Step: 104670 | Dataset: 0-181504 | Loss: 0.809 | 913 ms/step , 6887.55 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-04 23:56:27 | Epoch: 0 | Step: 104680 | Dataset: 0-181824 | Loss: 0.918 | 914 ms/step , 6883.68 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-04 23:56:36 | Epoch: 0 | Step: 104690 | Dataset: 0-182144 | Loss: 0.822 | 914 ms/step , 6882.18 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-04 23:56:45 | Epoch: 0 | Step: 104700 | Dataset: 0-182464 | Loss: 0.887 | 912 ms/step , 6893.88 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-04 23:56:47 | Validation | Step: 104700 | Val_loss: 0.905 | Best_val_loss: 0.7806 INFO:__main__:2024-11-04 23:56:56 | Epoch: 0 | Step: 104710 | Dataset: 0-182784 | Loss: 0.731 | 913 ms/step , 6890.76 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-04 23:57:05 | Epoch: 0 | Step: 104720 | Dataset: 0-183104 | Loss: 0.699 | 913 ms/step , 6885.92 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-04 23:57:14 | Epoch: 0 | Step: 104730 | Dataset: 0-183424 | Loss: 0.816 | 913 ms/step , 6892.36 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-04 23:57:23 | Epoch: 0 | Step: 104740 | Dataset: 0-183744 | Loss: 0.714 | 913 ms/step , 6891.77 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-04 23:57:32 | Epoch: 0 | Step: 104750 | Dataset: 0-184064 | Loss: 0.987 | 914 ms/step , 6884.51 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-04 23:57:42 | Epoch: 0 | Step: 104760 | Dataset: 0-184384 | Loss: 0.837 | 913 ms/step , 6889.83 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-04 23:57:51 | Epoch: 0 | Step: 104770 | Dataset: 0-184704 | Loss: 0.892 | 914 ms/step , 6883.66 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-04 23:58:00 | Epoch: 0 | Step: 104780 | Dataset: 0-185024 | Loss: 0.837 | 913 ms/step , 6887.23 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-04 23:58:09 | Epoch: 0 | Step: 104790 | Dataset: 0-185344 | Loss: 0.886 | 913 ms/step , 6885.39 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-04 23:58:18 | Epoch: 0 | Step: 104800 | Dataset: 0-185664 | Loss: 0.998 | 914 ms/step , 6882.66 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-04 23:58:20 | Validation | Step: 104800 | Val_loss: 0.869 | Best_val_loss: 0.7806 INFO:__main__:2024-11-04 23:58:29 | Epoch: 0 | Step: 104810 | Dataset: 0-185984 | Loss: 0.941 | 913 ms/step , 6890.01 GFLOP/s , 15265.4 tokens/s INFO:__main__:2024-11-04 23:58:38 | Epoch: 0 | Step: 104820 | Dataset: 0-186304 | Loss: 0.810 | 914 ms/step , 6880.29 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-04 23:58:47 | Epoch: 0 | Step: 104830 | Dataset: 0-186624 | Loss: 0.947 | 916 ms/step , 6869.06 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-04 23:58:56 | Epoch: 0 | Step: 104840 | Dataset: 0-186944 | Loss: 0.867 | 912 ms/step , 6892.82 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-04 23:59:05 | Epoch: 0 | Step: 104850 | Dataset: 0-187264 | Loss: 0.867 | 915 ms/step , 6875.90 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-04 23:59:15 | Epoch: 0 | Step: 104860 | Dataset: 0-187584 | Loss: 0.851 | 914 ms/step , 6881.32 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-04 23:59:24 | Epoch: 0 | Step: 104870 | Dataset: 0-187904 | Loss: 0.811 | 914 ms/step , 6884.39 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-04 23:59:33 | Epoch: 0 | Step: 104880 | Dataset: 0-188224 | Loss: 0.920 | 913 ms/step , 6886.19 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-04 23:59:42 | Epoch: 0 | Step: 104890 | Dataset: 0-188544 | Loss: 0.925 | 914 ms/step , 6884.11 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-04 23:59:51 | Epoch: 0 | Step: 104900 | Dataset: 0-188864 | Loss: 0.739 | 912 ms/step , 6894.99 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-04 23:59:53 | Validation | Step: 104900 | Val_loss: 0.889 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:00:02 | Epoch: 0 | Step: 104910 | Dataset: 0-189184 | Loss: 0.862 | 915 ms/step , 6874.27 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-05 00:00:11 | Epoch: 0 | Step: 104920 | Dataset: 0-189504 | Loss: 0.756 | 912 ms/step , 6893.25 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 00:00:20 | Epoch: 0 | Step: 104930 | Dataset: 0-189824 | Loss: 0.857 | 912 ms/step , 6893.23 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 00:00:29 | Epoch: 0 | Step: 104940 | Dataset: 0-190144 | Loss: 0.847 | 914 ms/step , 6883.19 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 00:00:38 | Epoch: 0 | Step: 104950 | Dataset: 0-190464 | Loss: 0.798 | 914 ms/step , 6877.79 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 00:00:48 | Epoch: 0 | Step: 104960 | Dataset: 0-190784 | Loss: 0.943 | 914 ms/step , 6883.23 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 00:00:57 | Epoch: 0 | Step: 104970 | Dataset: 0-191104 | Loss: 0.598 | 915 ms/step , 6876.09 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 00:01:06 | Epoch: 0 | Step: 104980 | Dataset: 0-191424 | Loss: 0.908 | 914 ms/step , 6880.38 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 00:01:15 | Epoch: 0 | Step: 104990 | Dataset: 0-191744 | Loss: 0.749 | 914 ms/step , 6879.98 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 00:01:24 | Epoch: 0 | Step: 105000 | Dataset: 0-192064 | Loss: 0.835 | 912 ms/step , 6892.74 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 00:01:26 | Validation | Step: 105000 | Val_loss: 0.886 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:01:26 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_000126_step_105000.pt` INFO:__main__:2024-11-05 00:01:36 | Epoch: 0 | Step: 105010 | Dataset: 0-192384 | Loss: 0.923 | 914 ms/step , 6884.35 GFLOP/s , 13793.3 tokens/s INFO:__main__:2024-11-05 00:01:45 | Epoch: 0 | Step: 105020 | Dataset: 0-192704 | Loss: 0.690 | 913 ms/step , 6886.14 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 00:01:54 | Epoch: 0 | Step: 105030 | Dataset: 0-193024 | Loss: 0.706 | 913 ms/step , 6886.82 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 00:02:03 | Epoch: 0 | Step: 105040 | Dataset: 0-193344 | Loss: 0.792 | 914 ms/step , 6882.64 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 00:02:13 | Epoch: 0 | Step: 105050 | Dataset: 0-193664 | Loss: 0.776 | 913 ms/step , 6889.83 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 00:02:22 | Epoch: 0 | Step: 105060 | Dataset: 0-193984 | Loss: 0.812 | 915 ms/step , 6874.91 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 00:02:31 | Epoch: 0 | Step: 105070 | Dataset: 0-194304 | Loss: 0.954 | 914 ms/step , 6879.34 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 00:02:40 | Epoch: 0 | Step: 105080 | Dataset: 0-194624 | Loss: 0.841 | 913 ms/step , 6886.23 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 00:02:49 | Epoch: 0 | Step: 105090 | Dataset: 0-194944 | Loss: 0.781 | 913 ms/step , 6891.45 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 00:02:58 | Epoch: 0 | Step: 105100 | Dataset: 0-195264 | Loss: 0.846 | 914 ms/step , 6882.06 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 00:03:00 | Validation | Step: 105100 | Val_loss: 0.890 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:03:09 | Epoch: 0 | Step: 105110 | Dataset: 0-195584 | Loss: 0.741 | 914 ms/step , 6880.80 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 00:03:18 | Epoch: 0 | Step: 105120 | Dataset: 0-195904 | Loss: 0.672 | 912 ms/step , 6892.88 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 00:03:27 | Epoch: 0 | Step: 105130 | Dataset: 0-196224 | Loss: 0.727 | 913 ms/step , 6886.16 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 00:03:36 | Epoch: 0 | Step: 105140 | Dataset: 0-196544 | Loss: 0.773 | 913 ms/step , 6890.51 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 00:03:46 | Epoch: 0 | Step: 105150 | Dataset: 0-196864 | Loss: 0.831 | 913 ms/step , 6888.34 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 00:03:55 | Epoch: 0 | Step: 105160 | Dataset: 0-197184 | Loss: 0.988 | 913 ms/step , 6886.25 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 00:04:04 | Epoch: 0 | Step: 105170 | Dataset: 0-197504 | Loss: 0.828 | 914 ms/step , 6882.25 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 00:04:13 | Epoch: 0 | Step: 105180 | Dataset: 0-197824 | Loss: 0.868 | 913 ms/step , 6887.66 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 00:04:22 | Epoch: 0 | Step: 105190 | Dataset: 0-198144 | Loss: 0.947 | 914 ms/step , 6878.44 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 00:04:31 | Epoch: 0 | Step: 105200 | Dataset: 0-198464 | Loss: 0.873 | 913 ms/step , 6886.48 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 00:04:33 | Validation | Step: 105200 | Val_loss: 0.889 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:04:42 | Epoch: 0 | Step: 105210 | Dataset: 0-198784 | Loss: 0.790 | 913 ms/step , 6890.31 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 00:04:51 | Epoch: 0 | Step: 105220 | Dataset: 0-199104 | Loss: 0.937 | 913 ms/step , 6891.62 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 00:05:00 | Epoch: 0 | Step: 105230 | Dataset: 0-199424 | Loss: 0.886 | 915 ms/step , 6873.16 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 00:05:09 | Epoch: 0 | Step: 105240 | Dataset: 0-199744 | Loss: 0.862 | 913 ms/step , 6889.60 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 00:05:18 | Epoch: 0 | Step: 105250 | Dataset: 0-200064 | Loss: 0.871 | 914 ms/step , 6881.69 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 00:05:28 | Epoch: 0 | Step: 105260 | Dataset: 0-200384 | Loss: 0.722 | 912 ms/step , 6895.97 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 00:05:37 | Epoch: 0 | Step: 105270 | Dataset: 0-200704 | Loss: 0.733 | 913 ms/step , 6891.26 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 00:05:46 | Epoch: 0 | Step: 105280 | Dataset: 0-201024 | Loss: 0.818 | 914 ms/step , 6884.77 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 00:05:55 | Epoch: 0 | Step: 105290 | Dataset: 0-201344 | Loss: 0.910 | 914 ms/step , 6881.12 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 00:06:04 | Epoch: 0 | Step: 105300 | Dataset: 0-201664 | Loss: 0.659 | 912 ms/step , 6892.69 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 00:06:06 | Validation | Step: 105300 | Val_loss: 0.868 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:06:15 | Epoch: 0 | Step: 105310 | Dataset: 0-201984 | Loss: 0.847 | 916 ms/step , 6867.88 GFLOP/s , 15276.5 tokens/s INFO:__main__:2024-11-05 00:06:24 | Epoch: 0 | Step: 105320 | Dataset: 0-202304 | Loss: 0.917 | 914 ms/step , 6881.40 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 00:06:33 | Epoch: 0 | Step: 105330 | Dataset: 0-202624 | Loss: 0.833 | 914 ms/step , 6877.76 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 00:06:42 | Epoch: 0 | Step: 105340 | Dataset: 0-202944 | Loss: 0.896 | 913 ms/step , 6892.30 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 00:06:51 | Epoch: 0 | Step: 105350 | Dataset: 0-203264 | Loss: 0.799 | 914 ms/step , 6881.59 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 00:07:01 | Epoch: 0 | Step: 105360 | Dataset: 0-203584 | Loss: 0.837 | 913 ms/step , 6887.34 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 00:07:10 | Epoch: 0 | Step: 105370 | Dataset: 0-203904 | Loss: 0.983 | 913 ms/step , 6887.88 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 00:07:19 | Epoch: 0 | Step: 105380 | Dataset: 0-204224 | Loss: 0.963 | 914 ms/step , 6881.70 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 00:07:28 | Epoch: 0 | Step: 105390 | Dataset: 0-204544 | Loss: 0.930 | 916 ms/step , 6869.37 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 00:07:37 | Epoch: 0 | Step: 105400 | Dataset: 0-204864 | Loss: 0.785 | 914 ms/step , 6882.31 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 00:07:39 | Validation | Step: 105400 | Val_loss: 0.905 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:07:48 | Epoch: 0 | Step: 105410 | Dataset: 0-205184 | Loss: 0.797 | 914 ms/step , 6880.96 GFLOP/s , 15277.0 tokens/s INFO:__main__:2024-11-05 00:07:57 | Epoch: 0 | Step: 105420 | Dataset: 0-205504 | Loss: 0.839 | 914 ms/step , 6881.47 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 00:08:06 | Epoch: 0 | Step: 105430 | Dataset: 0-205824 | Loss: 0.775 | 913 ms/step , 6890.51 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 00:08:15 | Epoch: 0 | Step: 105440 | Dataset: 0-206144 | Loss: 0.916 | 915 ms/step , 6871.60 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 00:08:24 | Epoch: 0 | Step: 105450 | Dataset: 0-206464 | Loss: 0.850 | 913 ms/step , 6889.68 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 00:08:34 | Epoch: 0 | Step: 105460 | Dataset: 0-206784 | Loss: 0.832 | 913 ms/step , 6888.75 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 00:08:43 | Epoch: 0 | Step: 105470 | Dataset: 0-207104 | Loss: 0.492 | 913 ms/step , 6885.85 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 00:08:52 | Epoch: 0 | Step: 105480 | Dataset: 0-207424 | Loss: 0.865 | 913 ms/step , 6887.69 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 00:09:01 | Epoch: 0 | Step: 105490 | Dataset: 0-207744 | Loss: 0.935 | 915 ms/step , 6875.35 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 00:09:10 | Epoch: 0 | Step: 105500 | Dataset: 0-208064 | Loss: 0.912 | 914 ms/step , 6883.12 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 00:09:12 | Validation | Step: 105500 | Val_loss: 0.899 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:09:21 | Epoch: 0 | Step: 105510 | Dataset: 0-208384 | Loss: 0.785 | 912 ms/step , 6894.36 GFLOP/s , 15266.5 tokens/s INFO:__main__:2024-11-05 00:09:30 | Epoch: 0 | Step: 105520 | Dataset: 0-208704 | Loss: 0.787 | 913 ms/step , 6890.00 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 00:09:39 | Epoch: 0 | Step: 105530 | Dataset: 0-209024 | Loss: 0.768 | 912 ms/step , 6894.09 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 00:09:48 | Epoch: 0 | Step: 105540 | Dataset: 0-209344 | Loss: 0.852 | 914 ms/step , 6883.23 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 00:09:57 | Epoch: 0 | Step: 105550 | Dataset: 0-209664 | Loss: 1.008 | 915 ms/step , 6874.44 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 00:10:07 | Epoch: 0 | Step: 105560 | Dataset: 0-209984 | Loss: 0.836 | 913 ms/step , 6886.57 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 00:10:16 | Epoch: 0 | Step: 105570 | Dataset: 0-210304 | Loss: 0.806 | 913 ms/step , 6886.13 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 00:10:25 | Epoch: 0 | Step: 105580 | Dataset: 0-210624 | Loss: 0.619 | 914 ms/step , 6884.27 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 00:10:34 | Epoch: 0 | Step: 105590 | Dataset: 0-210944 | Loss: 0.845 | 913 ms/step , 6885.47 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 00:10:43 | Epoch: 0 | Step: 105600 | Dataset: 0-211264 | Loss: 0.924 | 914 ms/step , 6883.24 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 00:10:45 | Validation | Step: 105600 | Val_loss: 0.909 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:10:54 | Epoch: 0 | Step: 105610 | Dataset: 0-211584 | Loss: 0.847 | 914 ms/step , 6879.70 GFLOP/s , 15265.9 tokens/s INFO:__main__:2024-11-05 00:11:03 | Epoch: 0 | Step: 105620 | Dataset: 0-211904 | Loss: 0.894 | 916 ms/step , 6869.61 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 00:11:12 | Epoch: 0 | Step: 105630 | Dataset: 0-212224 | Loss: 0.715 | 913 ms/step , 6886.45 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 00:11:21 | Epoch: 0 | Step: 105640 | Dataset: 0-212544 | Loss: 0.818 | 914 ms/step , 6877.61 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 00:11:30 | Epoch: 0 | Step: 105650 | Dataset: 0-212864 | Loss: 0.840 | 915 ms/step , 6871.83 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 00:11:40 | Epoch: 0 | Step: 105660 | Dataset: 0-213184 | Loss: 0.757 | 913 ms/step , 6889.05 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 00:11:49 | Epoch: 0 | Step: 105670 | Dataset: 0-213504 | Loss: 0.823 | 911 ms/step , 6902.07 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 00:11:58 | Epoch: 0 | Step: 105680 | Dataset: 0-213824 | Loss: 0.842 | 914 ms/step , 6879.42 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 00:12:07 | Epoch: 0 | Step: 105690 | Dataset: 0-214144 | Loss: 0.791 | 913 ms/step , 6889.07 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 00:12:16 | Epoch: 0 | Step: 105700 | Dataset: 0-214464 | Loss: 0.823 | 912 ms/step , 6894.18 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 00:12:18 | Validation | Step: 105700 | Val_loss: 0.912 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:12:27 | Epoch: 0 | Step: 105710 | Dataset: 0-214784 | Loss: 0.714 | 913 ms/step , 6890.46 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 00:12:36 | Epoch: 0 | Step: 105720 | Dataset: 0-215104 | Loss: 0.647 | 911 ms/step , 6901.18 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 00:12:45 | Epoch: 0 | Step: 105730 | Dataset: 0-215424 | Loss: 0.794 | 913 ms/step , 6890.64 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 00:12:54 | Epoch: 0 | Step: 105740 | Dataset: 0-215744 | Loss: 0.691 | 914 ms/step , 6879.01 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 00:13:03 | Epoch: 0 | Step: 105750 | Dataset: 0-216064 | Loss: 0.914 | 914 ms/step , 6882.80 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 00:13:12 | Epoch: 0 | Step: 105760 | Dataset: 0-216384 | Loss: 0.717 | 913 ms/step , 6886.29 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 00:13:22 | Epoch: 0 | Step: 105770 | Dataset: 0-216704 | Loss: 0.920 | 914 ms/step , 6884.61 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 00:13:31 | Epoch: 0 | Step: 105780 | Dataset: 0-217024 | Loss: 0.969 | 913 ms/step , 6887.65 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 00:13:40 | Epoch: 0 | Step: 105790 | Dataset: 0-217344 | Loss: 1.072 | 913 ms/step , 6890.52 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 00:13:49 | Epoch: 0 | Step: 105800 | Dataset: 0-217664 | Loss: 1.054 | 912 ms/step , 6898.33 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 00:13:51 | Validation | Step: 105800 | Val_loss: 0.902 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:14:00 | Epoch: 0 | Step: 105810 | Dataset: 0-217984 | Loss: 0.995 | 913 ms/step , 6885.15 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 00:14:09 | Epoch: 0 | Step: 105820 | Dataset: 0-218304 | Loss: 1.093 | 914 ms/step , 6878.13 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 00:14:18 | Epoch: 0 | Step: 105830 | Dataset: 0-218624 | Loss: 1.038 | 911 ms/step , 6901.87 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 00:14:27 | Epoch: 0 | Step: 105840 | Dataset: 0-218944 | Loss: 1.084 | 915 ms/step , 6875.55 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 00:14:36 | Epoch: 0 | Step: 105850 | Dataset: 0-219264 | Loss: 0.949 | 912 ms/step , 6893.62 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 00:14:45 | Epoch: 0 | Step: 105860 | Dataset: 0-219584 | Loss: 1.069 | 912 ms/step , 6897.41 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 00:14:55 | Epoch: 0 | Step: 105870 | Dataset: 0-219904 | Loss: 0.990 | 914 ms/step , 6884.89 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 00:15:04 | Epoch: 0 | Step: 105880 | Dataset: 0-220224 | Loss: 0.994 | 914 ms/step , 6883.37 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 00:15:13 | Epoch: 0 | Step: 105890 | Dataset: 0-220544 | Loss: 1.059 | 912 ms/step , 6893.85 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 00:15:22 | Epoch: 0 | Step: 105900 | Dataset: 0-220864 | Loss: 1.067 | 912 ms/step , 6896.10 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 00:15:24 | Validation | Step: 105900 | Val_loss: 0.901 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:15:33 | Epoch: 0 | Step: 105910 | Dataset: 0-221184 | Loss: 1.044 | 912 ms/step , 6898.06 GFLOP/s , 15285.5 tokens/s INFO:__main__:2024-11-05 00:15:42 | Epoch: 0 | Step: 105920 | Dataset: 0-221504 | Loss: 1.108 | 914 ms/step , 6884.78 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 00:15:51 | Epoch: 0 | Step: 105930 | Dataset: 0-221824 | Loss: 1.031 | 913 ms/step , 6889.17 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 00:16:00 | Epoch: 0 | Step: 105940 | Dataset: 0-222144 | Loss: 0.911 | 913 ms/step , 6890.98 GFLOP/s , 17950.9 tokens/s INFO:__main__:2024-11-05 00:16:09 | Epoch: 0 | Step: 105950 | Dataset: 0-222464 | Loss: 1.045 | 914 ms/step , 6883.67 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 00:16:18 | Epoch: 0 | Step: 105960 | Dataset: 0-222784 | Loss: 1.039 | 913 ms/step , 6888.63 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 00:16:27 | Epoch: 0 | Step: 105970 | Dataset: 0-223104 | Loss: 0.975 | 912 ms/step , 6898.30 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 00:16:37 | Epoch: 0 | Step: 105980 | Dataset: 0-223424 | Loss: 0.994 | 912 ms/step , 6893.36 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 00:16:46 | Epoch: 0 | Step: 105990 | Dataset: 0-223744 | Loss: 0.997 | 913 ms/step , 6889.69 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 00:16:55 | Epoch: 0 | Step: 106000 | Dataset: 0-224064 | Loss: 0.914 | 913 ms/step , 6892.35 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 00:16:56 | Validation | Step: 106000 | Val_loss: 0.900 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:16:56 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_001656_step_106000.pt` INFO:__main__:2024-11-05 00:17:07 | Epoch: 0 | Step: 106010 | Dataset: 0-224384 | Loss: 0.968 | 913 ms/step , 6890.14 GFLOP/s , 13784.7 tokens/s INFO:__main__:2024-11-05 00:17:16 | Epoch: 0 | Step: 106020 | Dataset: 0-224704 | Loss: 1.133 | 913 ms/step , 6890.64 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 00:17:25 | Epoch: 0 | Step: 106030 | Dataset: 0-225024 | Loss: 0.964 | 914 ms/step , 6883.08 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 00:17:34 | Epoch: 0 | Step: 106040 | Dataset: 0-225344 | Loss: 0.997 | 913 ms/step , 6892.43 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 00:17:43 | Epoch: 0 | Step: 106050 | Dataset: 0-225664 | Loss: 0.951 | 913 ms/step , 6888.66 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 00:17:52 | Epoch: 0 | Step: 106060 | Dataset: 0-225984 | Loss: 1.049 | 913 ms/step , 6890.57 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 00:18:02 | Epoch: 0 | Step: 106070 | Dataset: 0-226304 | Loss: 0.970 | 913 ms/step , 6889.97 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 00:18:11 | Epoch: 0 | Step: 106080 | Dataset: 0-226624 | Loss: 1.153 | 912 ms/step , 6893.01 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 00:18:20 | Epoch: 0 | Step: 106090 | Dataset: 0-226944 | Loss: 0.973 | 913 ms/step , 6891.06 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 00:18:29 | Epoch: 0 | Step: 106100 | Dataset: 0-227264 | Loss: 0.933 | 913 ms/step , 6886.29 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 00:18:31 | Validation | Step: 106100 | Val_loss: 0.905 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:18:40 | Epoch: 0 | Step: 106110 | Dataset: 0-227584 | Loss: 1.046 | 913 ms/step , 6885.22 GFLOP/s , 15282.1 tokens/s INFO:__main__:2024-11-05 00:18:49 | Epoch: 0 | Step: 106120 | Dataset: 0-227904 | Loss: 0.931 | 913 ms/step , 6891.82 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 00:18:58 | Epoch: 0 | Step: 106130 | Dataset: 0-228224 | Loss: 1.002 | 912 ms/step , 6894.43 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 00:19:07 | Epoch: 0 | Step: 106140 | Dataset: 0-228544 | Loss: 0.969 | 912 ms/step , 6894.89 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 00:19:16 | Epoch: 0 | Step: 106150 | Dataset: 0-228864 | Loss: 0.890 | 912 ms/step , 6896.36 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 00:19:25 | Epoch: 0 | Step: 106160 | Dataset: 0-229184 | Loss: 1.013 | 913 ms/step , 6890.62 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 00:19:34 | Epoch: 0 | Step: 106170 | Dataset: 0-229504 | Loss: 1.005 | 912 ms/step , 6894.74 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 00:19:44 | Epoch: 0 | Step: 106180 | Dataset: 0-229824 | Loss: 0.984 | 913 ms/step , 6885.55 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 00:19:53 | Epoch: 0 | Step: 106190 | Dataset: 0-230144 | Loss: 1.008 | 912 ms/step , 6892.83 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 00:20:02 | Epoch: 0 | Step: 106200 | Dataset: 0-230464 | Loss: 0.968 | 912 ms/step , 6895.57 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 00:20:03 | Validation | Step: 106200 | Val_loss: 0.861 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:20:13 | Epoch: 0 | Step: 106210 | Dataset: 0-230784 | Loss: 1.053 | 912 ms/step , 6898.25 GFLOP/s , 15290.2 tokens/s INFO:__main__:2024-11-05 00:20:22 | Epoch: 0 | Step: 106220 | Dataset: 0-231104 | Loss: 0.877 | 912 ms/step , 6895.21 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 00:20:31 | Epoch: 0 | Step: 106230 | Dataset: 0-231424 | Loss: 1.015 | 914 ms/step , 6884.22 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 00:20:40 | Epoch: 0 | Step: 106240 | Dataset: 0-231744 | Loss: 1.071 | 913 ms/step , 6888.85 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 00:20:49 | Epoch: 0 | Step: 106250 | Dataset: 0-232064 | Loss: 0.878 | 912 ms/step , 6893.56 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 00:20:58 | Epoch: 0 | Step: 106260 | Dataset: 0-232384 | Loss: 1.045 | 913 ms/step , 6886.43 GFLOP/s , 17950.6 tokens/s INFO:__main__:2024-11-05 00:21:07 | Epoch: 0 | Step: 106270 | Dataset: 0-232704 | Loss: 1.024 | 912 ms/step , 6897.53 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 00:21:16 | Epoch: 0 | Step: 106280 | Dataset: 0-233024 | Loss: 0.959 | 913 ms/step , 6891.90 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 00:21:26 | Epoch: 0 | Step: 106290 | Dataset: 0-233344 | Loss: 0.987 | 912 ms/step , 6894.15 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 00:21:35 | Epoch: 0 | Step: 106300 | Dataset: 0-233664 | Loss: 1.000 | 913 ms/step , 6890.31 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 00:21:36 | Validation | Step: 106300 | Val_loss: 0.912 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:21:45 | Epoch: 0 | Step: 106310 | Dataset: 0-233984 | Loss: 0.983 | 911 ms/step , 6900.72 GFLOP/s , 15285.9 tokens/s INFO:__main__:2024-11-05 00:21:55 | Epoch: 0 | Step: 106320 | Dataset: 0-234304 | Loss: 1.038 | 914 ms/step , 6879.81 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 00:22:04 | Epoch: 0 | Step: 106330 | Dataset: 0-234624 | Loss: 1.031 | 913 ms/step , 6887.53 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 00:22:13 | Epoch: 0 | Step: 106340 | Dataset: 0-234944 | Loss: 0.978 | 911 ms/step , 6906.49 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 00:22:22 | Epoch: 0 | Step: 106350 | Dataset: 0-235264 | Loss: 1.064 | 913 ms/step , 6885.65 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 00:22:31 | Epoch: 0 | Step: 106360 | Dataset: 0-235584 | Loss: 1.012 | 912 ms/step , 6893.75 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 00:22:40 | Epoch: 0 | Step: 106370 | Dataset: 0-235904 | Loss: 0.986 | 913 ms/step , 6890.73 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 00:22:49 | Epoch: 0 | Step: 106380 | Dataset: 0-236224 | Loss: 0.835 | 912 ms/step , 6896.01 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 00:22:59 | Epoch: 0 | Step: 106390 | Dataset: 0-236544 | Loss: 0.920 | 911 ms/step , 6901.98 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 00:23:08 | Epoch: 0 | Step: 106400 | Dataset: 0-236864 | Loss: 0.835 | 913 ms/step , 6892.44 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 00:23:09 | Validation | Step: 106400 | Val_loss: 0.908 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:23:18 | Epoch: 0 | Step: 106410 | Dataset: 0-237184 | Loss: 0.955 | 913 ms/step , 6886.26 GFLOP/s , 15290.5 tokens/s INFO:__main__:2024-11-05 00:23:27 | Epoch: 0 | Step: 106420 | Dataset: 0-237504 | Loss: 0.890 | 913 ms/step , 6885.91 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 00:23:37 | Epoch: 0 | Step: 106430 | Dataset: 0-237824 | Loss: 0.922 | 913 ms/step , 6886.10 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 00:23:46 | Epoch: 0 | Step: 106440 | Dataset: 0-238144 | Loss: 0.925 | 912 ms/step , 6894.33 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 00:23:55 | Epoch: 0 | Step: 106450 | Dataset: 0-238464 | Loss: 0.968 | 913 ms/step , 6887.79 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 00:24:04 | Epoch: 0 | Step: 106460 | Dataset: 0-238784 | Loss: 1.000 | 914 ms/step , 6881.36 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 00:24:13 | Epoch: 0 | Step: 106470 | Dataset: 0-239104 | Loss: 0.860 | 913 ms/step , 6890.12 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 00:24:22 | Epoch: 0 | Step: 106480 | Dataset: 0-239424 | Loss: 0.964 | 913 ms/step , 6892.23 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 00:24:31 | Epoch: 0 | Step: 106490 | Dataset: 0-239744 | Loss: 0.913 | 913 ms/step , 6889.80 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 00:24:41 | Epoch: 0 | Step: 106500 | Dataset: 0-240064 | Loss: 0.990 | 911 ms/step , 6900.80 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 00:24:42 | Validation | Step: 106500 | Val_loss: 0.899 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:24:51 | Epoch: 0 | Step: 106510 | Dataset: 0-240384 | Loss: 0.952 | 913 ms/step , 6892.47 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 00:25:00 | Epoch: 0 | Step: 106520 | Dataset: 0-240704 | Loss: 0.901 | 913 ms/step , 6887.95 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 00:25:10 | Epoch: 0 | Step: 106530 | Dataset: 0-241024 | Loss: 0.928 | 912 ms/step , 6892.81 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 00:25:19 | Epoch: 0 | Step: 106540 | Dataset: 0-241344 | Loss: 0.933 | 912 ms/step , 6894.32 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 00:25:28 | Epoch: 0 | Step: 106550 | Dataset: 0-241664 | Loss: 0.948 | 914 ms/step , 6883.54 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 00:25:37 | Epoch: 0 | Step: 106560 | Dataset: 0-241984 | Loss: 1.086 | 913 ms/step , 6890.07 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 00:25:46 | Epoch: 0 | Step: 106570 | Dataset: 0-242304 | Loss: 0.799 | 913 ms/step , 6891.36 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 00:25:55 | Epoch: 0 | Step: 106580 | Dataset: 0-242624 | Loss: 0.885 | 912 ms/step , 6893.48 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 00:26:04 | Epoch: 0 | Step: 106590 | Dataset: 0-242944 | Loss: 0.972 | 912 ms/step , 6894.97 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 00:26:13 | Epoch: 0 | Step: 106600 | Dataset: 0-243264 | Loss: 0.989 | 914 ms/step , 6878.15 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 00:26:15 | Validation | Step: 106600 | Val_loss: 0.909 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:26:24 | Epoch: 0 | Step: 106610 | Dataset: 0-243584 | Loss: 0.902 | 913 ms/step , 6885.16 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 00:26:33 | Epoch: 0 | Step: 106620 | Dataset: 0-243904 | Loss: 0.932 | 913 ms/step , 6889.85 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 00:26:42 | Epoch: 0 | Step: 106630 | Dataset: 0-244224 | Loss: 0.433 | 912 ms/step , 6896.32 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 00:26:52 | Epoch: 0 | Step: 106640 | Dataset: 0-244544 | Loss: 0.394 | 911 ms/step , 6903.58 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-05 00:27:01 | Epoch: 0 | Step: 106650 | Dataset: 0-244864 | Loss: 0.303 | 912 ms/step , 6895.47 GFLOP/s , 17956.4 tokens/s INFO:__main__:2024-11-05 00:27:10 | Epoch: 0 | Step: 106660 | Dataset: 0-245184 | Loss: 0.359 | 911 ms/step , 6901.41 GFLOP/s , 17956.7 tokens/s INFO:__main__:2024-11-05 00:27:19 | Epoch: 0 | Step: 106670 | Dataset: 0-245504 | Loss: 0.312 | 912 ms/step , 6893.19 GFLOP/s , 17956.9 tokens/s INFO:__main__:2024-11-05 00:27:28 | Epoch: 0 | Step: 106680 | Dataset: 0-245824 | Loss: 0.370 | 912 ms/step , 6894.79 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-05 00:27:37 | Epoch: 0 | Step: 106690 | Dataset: 0-246144 | Loss: 0.336 | 912 ms/step , 6894.02 GFLOP/s , 17950.7 tokens/s INFO:__main__:2024-11-05 00:27:46 | Epoch: 0 | Step: 106700 | Dataset: 0-246464 | Loss: 0.279 | 912 ms/step , 6895.61 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-05 00:27:48 | Validation | Step: 106700 | Val_loss: 0.907 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:27:57 | Epoch: 0 | Step: 106710 | Dataset: 0-246784 | Loss: 0.357 | 912 ms/step , 6892.77 GFLOP/s , 15285.9 tokens/s INFO:__main__:2024-11-05 00:28:06 | Epoch: 0 | Step: 106720 | Dataset: 0-247104 | Loss: 0.354 | 912 ms/step , 6897.02 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-05 00:28:15 | Epoch: 0 | Step: 106730 | Dataset: 0-247424 | Loss: 0.391 | 912 ms/step , 6899.28 GFLOP/s , 17956.8 tokens/s INFO:__main__:2024-11-05 00:28:24 | Epoch: 0 | Step: 106740 | Dataset: 0-247744 | Loss: 0.375 | 913 ms/step , 6891.04 GFLOP/s , 17955.8 tokens/s INFO:__main__:2024-11-05 00:28:34 | Epoch: 0 | Step: 106750 | Dataset: 0-248064 | Loss: 0.310 | 913 ms/step , 6886.45 GFLOP/s , 17949.7 tokens/s INFO:__main__:2024-11-05 00:28:43 | Epoch: 0 | Step: 106760 | Dataset: 0-248384 | Loss: 0.396 | 913 ms/step , 6891.63 GFLOP/s , 17957.0 tokens/s INFO:__main__:2024-11-05 00:28:52 | Epoch: 0 | Step: 106770 | Dataset: 0-248704 | Loss: 0.308 | 913 ms/step , 6891.32 GFLOP/s , 17953.6 tokens/s INFO:__main__:2024-11-05 00:29:01 | Epoch: 0 | Step: 106780 | Dataset: 0-249024 | Loss: 0.326 | 911 ms/step , 6902.17 GFLOP/s , 17951.7 tokens/s INFO:__main__:2024-11-05 00:29:10 | Epoch: 0 | Step: 106790 | Dataset: 0-249344 | Loss: 0.305 | 912 ms/step , 6893.82 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-05 00:29:19 | Epoch: 0 | Step: 106800 | Dataset: 0-249664 | Loss: 0.295 | 911 ms/step , 6901.79 GFLOP/s , 17956.3 tokens/s INFO:__main__:2024-11-05 00:29:21 | Validation | Step: 106800 | Val_loss: 0.931 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:29:30 | Epoch: 0 | Step: 106810 | Dataset: 0-249984 | Loss: 0.341 | 912 ms/step , 6897.45 GFLOP/s , 15281.8 tokens/s INFO:__main__:2024-11-05 00:29:39 | Epoch: 0 | Step: 106820 | Dataset: 0-250304 | Loss: 0.316 | 913 ms/step , 6889.84 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 00:29:48 | Epoch: 0 | Step: 106830 | Dataset: 0-250624 | Loss: 0.350 | 911 ms/step , 6900.88 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-05 00:29:57 | Epoch: 0 | Step: 106840 | Dataset: 0-250944 | Loss: 0.304 | 912 ms/step , 6899.15 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 00:30:06 | Epoch: 0 | Step: 106850 | Dataset: 0-251264 | Loss: 0.290 | 912 ms/step , 6897.37 GFLOP/s , 17957.2 tokens/s INFO:__main__:2024-11-05 00:30:16 | Epoch: 0 | Step: 106860 | Dataset: 0-251584 | Loss: 0.291 | 912 ms/step , 6897.46 GFLOP/s , 17953.7 tokens/s INFO:__main__:2024-11-05 00:30:25 | Epoch: 0 | Step: 106870 | Dataset: 0-251904 | Loss: 0.316 | 913 ms/step , 6890.78 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 00:30:34 | Epoch: 0 | Step: 106880 | Dataset: 0-252224 | Loss: 0.350 | 912 ms/step , 6896.36 GFLOP/s , 17954.4 tokens/s INFO:__main__:2024-11-05 00:30:43 | Epoch: 0 | Step: 106890 | Dataset: 0-252544 | Loss: 0.292 | 912 ms/step , 6895.86 GFLOP/s , 17957.3 tokens/s INFO:__main__:2024-11-05 00:30:52 | Epoch: 0 | Step: 106900 | Dataset: 0-252864 | Loss: 0.291 | 912 ms/step , 6895.54 GFLOP/s , 17954.2 tokens/s INFO:__main__:2024-11-05 00:30:54 | Validation | Step: 106900 | Val_loss: 0.878 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:31:03 | Epoch: 0 | Step: 106910 | Dataset: 0-253184 | Loss: 0.324 | 912 ms/step , 6896.35 GFLOP/s , 15292.6 tokens/s INFO:__main__:2024-11-05 00:31:12 | Epoch: 0 | Step: 106920 | Dataset: 0-253504 | Loss: 0.826 | 911 ms/step , 6900.16 GFLOP/s , 17952.2 tokens/s INFO:__main__:2024-11-05 00:31:21 | Epoch: 0 | Step: 106930 | Dataset: 0-253824 | Loss: 0.710 | 911 ms/step , 6900.74 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 00:31:30 | Epoch: 0 | Step: 106940 | Dataset: 0-254144 | Loss: 0.779 | 913 ms/step , 6891.55 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 00:31:39 | Epoch: 0 | Step: 106950 | Dataset: 0-254464 | Loss: 0.754 | 912 ms/step , 6897.62 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 00:31:48 | Epoch: 0 | Step: 106960 | Dataset: 0-254784 | Loss: 0.799 | 913 ms/step , 6886.41 GFLOP/s , 17950.9 tokens/s INFO:__main__:2024-11-05 00:31:58 | Epoch: 0 | Step: 106970 | Dataset: 0-255104 | Loss: 0.801 | 913 ms/step , 6886.89 GFLOP/s , 17949.1 tokens/s INFO:__main__:2024-11-05 00:32:07 | Epoch: 0 | Step: 106980 | Dataset: 0-255424 | Loss: 0.782 | 913 ms/step , 6890.70 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 00:32:16 | Epoch: 0 | Step: 106990 | Dataset: 0-255744 | Loss: 0.752 | 913 ms/step , 6888.65 GFLOP/s , 17951.9 tokens/s INFO:__main__:2024-11-05 00:32:25 | Epoch: 0 | Step: 107000 | Dataset: 0-256064 | Loss: 0.728 | 911 ms/step , 6904.05 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 00:32:26 | Validation | Step: 107000 | Val_loss: 0.908 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:32:26 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_003226_step_107000.pt` INFO:__main__:2024-11-05 00:32:37 | Epoch: 0 | Step: 107010 | Dataset: 0-256384 | Loss: 0.725 | 913 ms/step , 6888.37 GFLOP/s , 13844.7 tokens/s INFO:__main__:2024-11-05 00:32:46 | Epoch: 0 | Step: 107020 | Dataset: 0-256704 | Loss: 0.677 | 912 ms/step , 6896.57 GFLOP/s , 17957.8 tokens/s INFO:__main__:2024-11-05 00:32:55 | Epoch: 0 | Step: 107030 | Dataset: 0-257024 | Loss: 0.723 | 912 ms/step , 6899.58 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 00:33:04 | Epoch: 0 | Step: 107040 | Dataset: 0-257344 | Loss: 0.743 | 914 ms/step , 6883.41 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 00:33:13 | Epoch: 0 | Step: 107050 | Dataset: 0-257664 | Loss: 0.674 | 914 ms/step , 6878.45 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 00:33:22 | Epoch: 0 | Step: 107060 | Dataset: 0-257984 | Loss: 0.696 | 913 ms/step , 6892.45 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 00:33:32 | Epoch: 0 | Step: 107070 | Dataset: 0-258304 | Loss: 0.707 | 913 ms/step , 6890.89 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 00:33:41 | Epoch: 0 | Step: 107080 | Dataset: 0-258624 | Loss: 0.716 | 913 ms/step , 6889.75 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 00:33:50 | Epoch: 0 | Step: 107090 | Dataset: 0-258944 | Loss: 0.730 | 913 ms/step , 6892.16 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 00:33:59 | Epoch: 0 | Step: 107100 | Dataset: 0-259264 | Loss: 0.668 | 913 ms/step , 6891.88 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 00:34:01 | Validation | Step: 107100 | Val_loss: 0.888 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:34:10 | Epoch: 0 | Step: 107110 | Dataset: 0-259584 | Loss: 0.784 | 913 ms/step , 6890.25 GFLOP/s , 15288.5 tokens/s INFO:__main__:2024-11-05 00:34:19 | Epoch: 0 | Step: 107120 | Dataset: 0-259904 | Loss: 0.738 | 912 ms/step , 6897.08 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 00:34:28 | Epoch: 0 | Step: 107130 | Dataset: 0-260224 | Loss: 0.686 | 912 ms/step , 6896.61 GFLOP/s , 17952.9 tokens/s INFO:__main__:2024-11-05 00:34:37 | Epoch: 0 | Step: 107140 | Dataset: 0-260544 | Loss: 0.643 | 912 ms/step , 6895.38 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 00:34:46 | Epoch: 0 | Step: 107150 | Dataset: 0-260864 | Loss: 0.816 | 912 ms/step , 6896.78 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 00:34:55 | Epoch: 0 | Step: 107160 | Dataset: 0-261184 | Loss: 0.652 | 913 ms/step , 6889.30 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 00:35:04 | Epoch: 0 | Step: 107170 | Dataset: 0-261504 | Loss: 0.823 | 913 ms/step , 6891.39 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 00:35:14 | Epoch: 0 | Step: 107180 | Dataset: 0-261824 | Loss: 0.795 | 912 ms/step , 6895.95 GFLOP/s , 17951.0 tokens/s INFO:__main__:2024-11-05 00:35:23 | Epoch: 0 | Step: 107190 | Dataset: 0-262144 | Loss: 0.702 | 913 ms/step , 6887.08 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-05 00:35:32 | Epoch: 0 | Step: 107200 | Dataset: 0-262464 | Loss: 0.670 | 913 ms/step , 6887.70 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 00:35:33 | Validation | Step: 107200 | Val_loss: 0.842 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:35:43 | Epoch: 0 | Step: 107210 | Dataset: 0-262784 | Loss: 0.754 | 912 ms/step , 6895.16 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-05 00:35:52 | Epoch: 0 | Step: 107220 | Dataset: 0-263104 | Loss: 0.695 | 913 ms/step , 6891.67 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 00:36:01 | Epoch: 0 | Step: 107230 | Dataset: 0-263424 | Loss: 0.721 | 913 ms/step , 6889.55 GFLOP/s , 17954.8 tokens/s INFO:__main__:2024-11-05 00:36:10 | Epoch: 0 | Step: 107240 | Dataset: 0-263744 | Loss: 0.723 | 912 ms/step , 6893.71 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 00:36:19 | Epoch: 0 | Step: 107250 | Dataset: 0-264064 | Loss: 0.723 | 913 ms/step , 6889.50 GFLOP/s , 17956.9 tokens/s INFO:__main__:2024-11-05 00:36:28 | Epoch: 0 | Step: 107260 | Dataset: 0-264384 | Loss: 0.675 | 912 ms/step , 6900.05 GFLOP/s , 17951.0 tokens/s INFO:__main__:2024-11-05 00:36:37 | Epoch: 0 | Step: 107270 | Dataset: 0-264704 | Loss: 0.767 | 912 ms/step , 6892.65 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 00:36:46 | Epoch: 0 | Step: 107280 | Dataset: 0-265024 | Loss: 0.719 | 913 ms/step , 6891.36 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 00:36:56 | Epoch: 0 | Step: 107290 | Dataset: 0-265344 | Loss: 0.786 | 913 ms/step , 6888.51 GFLOP/s , 17952.4 tokens/s INFO:__main__:2024-11-05 00:37:05 | Epoch: 0 | Step: 107300 | Dataset: 0-265664 | Loss: 0.775 | 913 ms/step , 6889.50 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 00:37:06 | Validation | Step: 107300 | Val_loss: 0.832 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:37:15 | Epoch: 0 | Step: 107310 | Dataset: 0-265984 | Loss: 0.697 | 912 ms/step , 6897.74 GFLOP/s , 15289.6 tokens/s INFO:__main__:2024-11-05 00:37:24 | Epoch: 0 | Step: 107320 | Dataset: 0-266304 | Loss: 0.721 | 912 ms/step , 6896.93 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 00:37:34 | Epoch: 0 | Step: 107330 | Dataset: 0-266624 | Loss: 0.754 | 912 ms/step , 6892.81 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 00:37:43 | Epoch: 0 | Step: 107340 | Dataset: 0-266944 | Loss: 0.657 | 913 ms/step , 6892.35 GFLOP/s , 17950.0 tokens/s INFO:__main__:2024-11-05 00:37:52 | Epoch: 0 | Step: 107350 | Dataset: 0-267264 | Loss: 0.675 | 911 ms/step , 6901.32 GFLOP/s , 17957.0 tokens/s INFO:__main__:2024-11-05 00:38:01 | Epoch: 0 | Step: 107360 | Dataset: 0-267584 | Loss: 0.595 | 912 ms/step , 6896.32 GFLOP/s , 17948.7 tokens/s INFO:__main__:2024-11-05 00:38:10 | Epoch: 0 | Step: 107370 | Dataset: 0-267904 | Loss: 0.619 | 913 ms/step , 6890.84 GFLOP/s , 17951.0 tokens/s INFO:__main__:2024-11-05 00:38:19 | Epoch: 0 | Step: 107380 | Dataset: 0-268224 | Loss: 0.669 | 913 ms/step , 6891.50 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 00:38:28 | Epoch: 0 | Step: 107390 | Dataset: 0-268544 | Loss: 0.740 | 913 ms/step , 6892.19 GFLOP/s , 17947.5 tokens/s INFO:__main__:2024-11-05 00:38:38 | Epoch: 0 | Step: 107400 | Dataset: 0-268864 | Loss: 0.645 | 912 ms/step , 6898.25 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 00:38:39 | Validation | Step: 107400 | Val_loss: 0.906 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:38:48 | Epoch: 0 | Step: 107410 | Dataset: 0-269184 | Loss: 0.726 | 912 ms/step , 6898.68 GFLOP/s , 15287.6 tokens/s INFO:__main__:2024-11-05 00:38:57 | Epoch: 0 | Step: 107420 | Dataset: 0-269504 | Loss: 0.701 | 913 ms/step , 6891.63 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 00:39:07 | Epoch: 0 | Step: 107430 | Dataset: 0-269824 | Loss: 0.660 | 913 ms/step , 6892.39 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 00:39:16 | Epoch: 0 | Step: 107440 | Dataset: 0-270144 | Loss: 0.749 | 913 ms/step , 6886.65 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 00:39:25 | Epoch: 0 | Step: 107450 | Dataset: 0-270464 | Loss: 0.677 | 913 ms/step , 6889.32 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 00:39:34 | Epoch: 0 | Step: 107460 | Dataset: 0-270784 | Loss: 0.708 | 913 ms/step , 6889.93 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 00:39:43 | Epoch: 0 | Step: 107470 | Dataset: 0-271104 | Loss: 0.663 | 913 ms/step , 6891.55 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 00:39:52 | Epoch: 0 | Step: 107480 | Dataset: 0-271424 | Loss: 0.729 | 913 ms/step , 6889.78 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 00:40:01 | Epoch: 0 | Step: 107490 | Dataset: 0-271744 | Loss: 0.707 | 912 ms/step , 6894.86 GFLOP/s , 17949.2 tokens/s INFO:__main__:2024-11-05 00:40:10 | Epoch: 0 | Step: 107500 | Dataset: 0-272064 | Loss: 0.617 | 912 ms/step , 6897.87 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 00:40:12 | Validation | Step: 107500 | Val_loss: 0.900 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:40:21 | Epoch: 0 | Step: 107510 | Dataset: 0-272384 | Loss: 0.585 | 913 ms/step , 6888.26 GFLOP/s , 15276.5 tokens/s INFO:__main__:2024-11-05 00:40:30 | Epoch: 0 | Step: 107520 | Dataset: 0-272704 | Loss: 0.681 | 912 ms/step , 6896.39 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 00:40:39 | Epoch: 0 | Step: 107530 | Dataset: 0-273024 | Loss: 0.754 | 913 ms/step , 6891.51 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 00:40:49 | Epoch: 0 | Step: 107540 | Dataset: 0-273344 | Loss: 0.722 | 913 ms/step , 6887.31 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 00:40:58 | Epoch: 0 | Step: 107550 | Dataset: 0-273664 | Loss: 0.757 | 912 ms/step , 6898.71 GFLOP/s , 17952.4 tokens/s INFO:__main__:2024-11-05 00:41:07 | Epoch: 0 | Step: 107560 | Dataset: 0-273984 | Loss: 0.727 | 912 ms/step , 6896.62 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 00:41:16 | Epoch: 0 | Step: 107570 | Dataset: 0-274304 | Loss: 0.641 | 913 ms/step , 6885.92 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 00:41:25 | Epoch: 0 | Step: 107580 | Dataset: 0-274624 | Loss: 0.654 | 913 ms/step , 6888.62 GFLOP/s , 17952.4 tokens/s INFO:__main__:2024-11-05 00:41:34 | Epoch: 0 | Step: 107590 | Dataset: 0-274944 | Loss: 0.717 | 912 ms/step , 6894.59 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 00:41:43 | Epoch: 0 | Step: 107600 | Dataset: 0-275264 | Loss: 0.677 | 912 ms/step , 6894.30 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 00:41:45 | Validation | Step: 107600 | Val_loss: 0.884 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:41:54 | Epoch: 0 | Step: 107610 | Dataset: 0-275584 | Loss: 0.719 | 913 ms/step , 6889.69 GFLOP/s , 15292.4 tokens/s INFO:__main__:2024-11-05 00:42:03 | Epoch: 0 | Step: 107620 | Dataset: 0-275904 | Loss: 0.690 | 913 ms/step , 6885.83 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 00:42:12 | Epoch: 0 | Step: 107630 | Dataset: 0-276224 | Loss: 0.695 | 913 ms/step , 6892.31 GFLOP/s , 17952.5 tokens/s INFO:__main__:2024-11-05 00:42:21 | Epoch: 0 | Step: 107640 | Dataset: 0-276544 | Loss: 0.643 | 911 ms/step , 6903.73 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 00:42:31 | Epoch: 0 | Step: 107650 | Dataset: 0-276864 | Loss: 0.680 | 912 ms/step , 6893.74 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 00:42:40 | Epoch: 0 | Step: 107660 | Dataset: 0-277184 | Loss: 0.649 | 914 ms/step , 6880.66 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 00:42:49 | Epoch: 0 | Step: 107670 | Dataset: 0-277504 | Loss: 0.651 | 913 ms/step , 6889.22 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-05 00:42:58 | Epoch: 0 | Step: 107680 | Dataset: 0-277824 | Loss: 0.680 | 912 ms/step , 6896.67 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 00:43:07 | Epoch: 0 | Step: 107690 | Dataset: 0-278144 | Loss: 0.637 | 912 ms/step , 6893.26 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 00:43:16 | Epoch: 0 | Step: 107700 | Dataset: 0-278464 | Loss: 0.759 | 913 ms/step , 6890.88 GFLOP/s , 17952.9 tokens/s INFO:__main__:2024-11-05 00:43:18 | Validation | Step: 107700 | Val_loss: 0.947 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:43:27 | Epoch: 0 | Step: 107710 | Dataset: 0-278784 | Loss: 0.749 | 912 ms/step , 6896.66 GFLOP/s , 15286.6 tokens/s INFO:__main__:2024-11-05 00:43:36 | Epoch: 0 | Step: 107720 | Dataset: 0-279104 | Loss: 0.692 | 913 ms/step , 6890.96 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 00:43:45 | Epoch: 0 | Step: 107730 | Dataset: 0-279424 | Loss: 0.693 | 913 ms/step , 6886.37 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 00:43:54 | Epoch: 0 | Step: 107740 | Dataset: 0-279744 | Loss: 0.689 | 913 ms/step , 6886.35 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 00:44:03 | Epoch: 0 | Step: 107750 | Dataset: 0-280064 | Loss: 0.682 | 912 ms/step , 6893.36 GFLOP/s , 17947.5 tokens/s INFO:__main__:2024-11-05 00:44:13 | Epoch: 0 | Step: 107760 | Dataset: 0-280384 | Loss: 0.684 | 913 ms/step , 6888.27 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 00:44:22 | Epoch: 0 | Step: 107770 | Dataset: 0-280704 | Loss: 0.744 | 914 ms/step , 6884.45 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 00:44:31 | Epoch: 0 | Step: 107780 | Dataset: 0-281024 | Loss: 0.674 | 912 ms/step , 6893.15 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 00:44:40 | Epoch: 0 | Step: 107790 | Dataset: 0-281344 | Loss: 0.578 | 912 ms/step , 6898.37 GFLOP/s , 17949.0 tokens/s INFO:__main__:2024-11-05 00:44:49 | Epoch: 0 | Step: 107800 | Dataset: 0-281664 | Loss: 0.692 | 913 ms/step , 6890.54 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 00:44:51 | Validation | Step: 107800 | Val_loss: 0.900 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:45:00 | Epoch: 0 | Step: 107810 | Dataset: 0-281984 | Loss: 0.711 | 913 ms/step , 6890.72 GFLOP/s , 15284.8 tokens/s INFO:__main__:2024-11-05 00:45:09 | Epoch: 0 | Step: 107820 | Dataset: 0-282304 | Loss: 0.752 | 914 ms/step , 6884.78 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 00:45:18 | Epoch: 0 | Step: 107830 | Dataset: 0-282624 | Loss: 0.768 | 911 ms/step , 6901.69 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 00:45:27 | Epoch: 0 | Step: 107840 | Dataset: 0-282944 | Loss: 0.676 | 912 ms/step , 6897.10 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 00:45:36 | Epoch: 0 | Step: 107850 | Dataset: 0-283264 | Loss: 0.666 | 913 ms/step , 6886.68 GFLOP/s , 17949.1 tokens/s INFO:__main__:2024-11-05 00:45:45 | Epoch: 0 | Step: 107860 | Dataset: 0-283584 | Loss: 0.711 | 912 ms/step , 6898.85 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 00:45:55 | Epoch: 0 | Step: 107870 | Dataset: 0-283904 | Loss: 0.659 | 911 ms/step , 6900.21 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-05 00:46:04 | Epoch: 0 | Step: 107880 | Dataset: 0-284224 | Loss: 0.670 | 912 ms/step , 6897.36 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 00:46:13 | Epoch: 0 | Step: 107890 | Dataset: 0-284544 | Loss: 0.722 | 912 ms/step , 6898.63 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 00:46:22 | Epoch: 0 | Step: 107900 | Dataset: 0-284864 | Loss: 0.765 | 913 ms/step , 6885.78 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 00:46:24 | Validation | Step: 107900 | Val_loss: 0.878 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:46:33 | Epoch: 0 | Step: 107910 | Dataset: 0-285184 | Loss: 0.658 | 911 ms/step , 6900.47 GFLOP/s , 15284.6 tokens/s INFO:__main__:2024-11-05 00:46:42 | Epoch: 0 | Step: 107920 | Dataset: 0-285504 | Loss: 0.643 | 913 ms/step , 6890.95 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 00:46:51 | Epoch: 0 | Step: 107930 | Dataset: 0-285824 | Loss: 0.673 | 912 ms/step , 6897.26 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-05 00:47:00 | Epoch: 0 | Step: 107940 | Dataset: 0-286144 | Loss: 0.701 | 912 ms/step , 6896.04 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 00:47:09 | Epoch: 0 | Step: 107950 | Dataset: 0-286464 | Loss: 0.604 | 912 ms/step , 6895.47 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 00:47:18 | Epoch: 0 | Step: 107960 | Dataset: 0-286784 | Loss: 0.681 | 911 ms/step , 6901.38 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 00:47:27 | Epoch: 0 | Step: 107970 | Dataset: 0-287104 | Loss: 0.754 | 913 ms/step , 6888.73 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 00:47:37 | Epoch: 0 | Step: 107980 | Dataset: 0-287424 | Loss: 0.680 | 913 ms/step , 6892.54 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-05 00:47:46 | Epoch: 0 | Step: 107990 | Dataset: 0-287744 | Loss: 0.789 | 913 ms/step , 6889.77 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 00:47:55 | Epoch: 0 | Step: 108000 | Dataset: 0-288064 | Loss: 0.650 | 912 ms/step , 6892.98 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 00:47:56 | Validation | Step: 108000 | Val_loss: 0.880 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:47:56 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_004756_step_108000.pt` INFO:__main__:2024-11-05 00:48:07 | Epoch: 0 | Step: 108010 | Dataset: 0-288384 | Loss: 0.748 | 913 ms/step , 6888.14 GFLOP/s , 13786.6 tokens/s INFO:__main__:2024-11-05 00:48:16 | Epoch: 0 | Step: 108020 | Dataset: 0-288704 | Loss: 0.648 | 912 ms/step , 6893.87 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 00:48:25 | Epoch: 0 | Step: 108030 | Dataset: 0-289024 | Loss: 0.745 | 913 ms/step , 6890.17 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 00:48:34 | Epoch: 0 | Step: 108040 | Dataset: 0-289344 | Loss: 0.721 | 914 ms/step , 6883.02 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 00:48:43 | Epoch: 0 | Step: 108050 | Dataset: 0-289664 | Loss: 0.634 | 914 ms/step , 6884.68 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 00:48:52 | Epoch: 0 | Step: 108060 | Dataset: 0-289984 | Loss: 0.662 | 912 ms/step , 6896.31 GFLOP/s , 17948.2 tokens/s INFO:__main__:2024-11-05 00:49:01 | Epoch: 0 | Step: 108070 | Dataset: 0-290304 | Loss: 0.595 | 912 ms/step , 6893.41 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 00:49:11 | Epoch: 0 | Step: 108080 | Dataset: 0-290624 | Loss: 0.687 | 912 ms/step , 6892.88 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 00:49:20 | Epoch: 0 | Step: 108090 | Dataset: 0-290944 | Loss: 0.612 | 913 ms/step , 6890.21 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 00:49:29 | Epoch: 0 | Step: 108100 | Dataset: 0-291264 | Loss: 0.723 | 913 ms/step , 6889.29 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 00:49:30 | Validation | Step: 108100 | Val_loss: 0.923 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:49:40 | Epoch: 0 | Step: 108110 | Dataset: 0-291584 | Loss: 0.728 | 913 ms/step , 6890.48 GFLOP/s , 15282.2 tokens/s INFO:__main__:2024-11-05 00:49:49 | Epoch: 0 | Step: 108120 | Dataset: 0-291904 | Loss: 0.636 | 912 ms/step , 6896.55 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 00:49:58 | Epoch: 0 | Step: 108130 | Dataset: 0-292224 | Loss: 0.723 | 912 ms/step , 6899.04 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 00:50:07 | Epoch: 0 | Step: 108140 | Dataset: 0-292544 | Loss: 0.687 | 912 ms/step , 6898.49 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 00:50:16 | Epoch: 0 | Step: 108150 | Dataset: 0-292864 | Loss: 0.624 | 911 ms/step , 6903.67 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 00:50:25 | Epoch: 0 | Step: 108160 | Dataset: 0-293184 | Loss: 0.698 | 912 ms/step , 6892.64 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 00:50:34 | Epoch: 0 | Step: 108170 | Dataset: 0-293504 | Loss: 0.730 | 913 ms/step , 6889.82 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 00:50:43 | Epoch: 0 | Step: 108180 | Dataset: 0-293824 | Loss: 0.629 | 912 ms/step , 6894.01 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 00:50:53 | Epoch: 0 | Step: 108190 | Dataset: 0-294144 | Loss: 0.644 | 912 ms/step , 6893.32 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 00:51:02 | Epoch: 0 | Step: 108200 | Dataset: 0-294464 | Loss: 0.585 | 911 ms/step , 6901.32 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-05 00:51:03 | Validation | Step: 108200 | Val_loss: 0.901 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:51:12 | Epoch: 0 | Step: 108210 | Dataset: 0-294784 | Loss: 0.692 | 912 ms/step , 6894.07 GFLOP/s , 15292.9 tokens/s INFO:__main__:2024-11-05 00:51:22 | Epoch: 0 | Step: 108220 | Dataset: 0-295104 | Loss: 0.608 | 913 ms/step , 6891.23 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 00:51:31 | Epoch: 0 | Step: 108230 | Dataset: 0-295424 | Loss: 0.678 | 914 ms/step , 6884.03 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 00:51:40 | Epoch: 0 | Step: 108240 | Dataset: 0-295744 | Loss: 0.635 | 914 ms/step , 6880.13 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 00:51:49 | Epoch: 0 | Step: 108250 | Dataset: 0-296064 | Loss: 0.615 | 912 ms/step , 6892.99 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 00:51:58 | Epoch: 0 | Step: 108260 | Dataset: 0-296384 | Loss: 0.653 | 912 ms/step , 6898.82 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 00:52:07 | Epoch: 0 | Step: 108270 | Dataset: 0-296704 | Loss: 0.665 | 913 ms/step , 6886.37 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 00:52:16 | Epoch: 0 | Step: 108280 | Dataset: 0-297024 | Loss: 0.706 | 912 ms/step , 6895.61 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 00:52:25 | Epoch: 0 | Step: 108290 | Dataset: 0-297344 | Loss: 0.715 | 913 ms/step , 6887.80 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 00:52:35 | Epoch: 0 | Step: 108300 | Dataset: 0-297664 | Loss: 0.651 | 912 ms/step , 6896.94 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 00:52:36 | Validation | Step: 108300 | Val_loss: 0.897 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:52:45 | Epoch: 0 | Step: 108310 | Dataset: 0-297984 | Loss: 0.732 | 912 ms/step , 6896.91 GFLOP/s , 15287.0 tokens/s INFO:__main__:2024-11-05 00:52:54 | Epoch: 0 | Step: 108320 | Dataset: 0-298304 | Loss: 0.703 | 914 ms/step , 6882.10 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 00:53:04 | Epoch: 0 | Step: 108330 | Dataset: 0-298624 | Loss: 0.685 | 912 ms/step , 6896.71 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 00:53:13 | Epoch: 0 | Step: 108340 | Dataset: 0-298944 | Loss: 0.743 | 913 ms/step , 6890.77 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 00:53:22 | Epoch: 0 | Step: 108350 | Dataset: 0-299264 | Loss: 0.638 | 912 ms/step , 6892.79 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 00:53:31 | Epoch: 0 | Step: 108360 | Dataset: 0-299584 | Loss: 0.664 | 911 ms/step , 6904.42 GFLOP/s , 17949.0 tokens/s INFO:__main__:2024-11-05 00:53:40 | Epoch: 0 | Step: 108370 | Dataset: 0-299904 | Loss: 0.630 | 913 ms/step , 6889.39 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 00:53:49 | Epoch: 0 | Step: 108380 | Dataset: 0-300224 | Loss: 0.625 | 913 ms/step , 6890.18 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 00:53:58 | Epoch: 0 | Step: 108390 | Dataset: 0-300544 | Loss: 0.619 | 914 ms/step , 6884.88 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 00:54:08 | Epoch: 0 | Step: 108400 | Dataset: 0-300864 | Loss: 0.591 | 912 ms/step , 6895.98 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 00:54:09 | Validation | Step: 108400 | Val_loss: 0.903 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:54:18 | Epoch: 0 | Step: 108410 | Dataset: 0-301184 | Loss: 0.658 | 912 ms/step , 6892.79 GFLOP/s , 15286.2 tokens/s INFO:__main__:2024-11-05 00:54:27 | Epoch: 0 | Step: 108420 | Dataset: 0-301504 | Loss: 0.657 | 913 ms/step , 6887.12 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 00:54:36 | Epoch: 0 | Step: 108430 | Dataset: 0-301824 | Loss: 0.650 | 913 ms/step , 6892.41 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 00:54:46 | Epoch: 0 | Step: 108440 | Dataset: 0-302144 | Loss: 0.676 | 912 ms/step , 6897.66 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 00:54:55 | Epoch: 0 | Step: 108450 | Dataset: 0-302464 | Loss: 0.669 | 913 ms/step , 6887.82 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-05 00:55:04 | Epoch: 0 | Step: 108460 | Dataset: 0-302784 | Loss: 0.592 | 913 ms/step , 6885.07 GFLOP/s , 17951.9 tokens/s INFO:__main__:2024-11-05 00:55:13 | Epoch: 0 | Step: 108470 | Dataset: 0-303104 | Loss: 0.670 | 913 ms/step , 6887.79 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 00:55:22 | Epoch: 0 | Step: 108480 | Dataset: 0-303424 | Loss: 0.720 | 912 ms/step , 6893.49 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 00:55:31 | Epoch: 0 | Step: 108490 | Dataset: 0-303744 | Loss: 0.659 | 913 ms/step , 6889.71 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 00:55:40 | Epoch: 0 | Step: 108500 | Dataset: 0-304064 | Loss: 0.691 | 913 ms/step , 6888.37 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 00:55:42 | Validation | Step: 108500 | Val_loss: 0.852 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:55:51 | Epoch: 0 | Step: 108510 | Dataset: 0-304384 | Loss: 0.719 | 913 ms/step , 6891.77 GFLOP/s , 15286.4 tokens/s INFO:__main__:2024-11-05 00:56:00 | Epoch: 0 | Step: 108520 | Dataset: 0-304704 | Loss: 0.643 | 914 ms/step , 6883.60 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 00:56:09 | Epoch: 0 | Step: 108530 | Dataset: 0-305024 | Loss: 0.699 | 913 ms/step , 6889.26 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 00:56:19 | Epoch: 0 | Step: 108540 | Dataset: 0-305344 | Loss: 0.695 | 913 ms/step , 6890.96 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 00:56:28 | Epoch: 0 | Step: 108550 | Dataset: 0-305664 | Loss: 0.653 | 913 ms/step , 6885.58 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 00:56:37 | Epoch: 0 | Step: 108560 | Dataset: 0-305984 | Loss: 0.660 | 913 ms/step , 6889.61 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 00:56:46 | Epoch: 0 | Step: 108570 | Dataset: 0-306304 | Loss: 0.647 | 912 ms/step , 6896.52 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 00:56:55 | Epoch: 0 | Step: 108580 | Dataset: 0-306624 | Loss: 0.688 | 913 ms/step , 6890.63 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 00:57:04 | Epoch: 0 | Step: 108590 | Dataset: 0-306944 | Loss: 0.582 | 912 ms/step , 6897.37 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 00:57:13 | Epoch: 0 | Step: 108600 | Dataset: 0-307264 | Loss: 0.579 | 913 ms/step , 6892.33 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 00:57:15 | Validation | Step: 108600 | Val_loss: 0.909 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:57:24 | Epoch: 0 | Step: 108610 | Dataset: 0-307584 | Loss: 0.636 | 913 ms/step , 6891.58 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-05 00:57:33 | Epoch: 0 | Step: 108620 | Dataset: 0-307904 | Loss: 0.639 | 912 ms/step , 6895.78 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 00:57:42 | Epoch: 0 | Step: 108630 | Dataset: 0-308224 | Loss: 0.649 | 913 ms/step , 6888.39 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 00:57:51 | Epoch: 0 | Step: 108640 | Dataset: 0-308544 | Loss: 0.722 | 913 ms/step , 6892.41 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 00:58:01 | Epoch: 0 | Step: 108650 | Dataset: 0-308864 | Loss: 0.658 | 913 ms/step , 6887.65 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 00:58:10 | Epoch: 0 | Step: 108660 | Dataset: 0-309184 | Loss: 0.660 | 912 ms/step , 6894.83 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 00:58:19 | Epoch: 0 | Step: 108670 | Dataset: 0-309504 | Loss: 0.631 | 913 ms/step , 6886.52 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 00:58:28 | Epoch: 0 | Step: 108680 | Dataset: 0-309824 | Loss: 0.651 | 912 ms/step , 6895.14 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 00:58:37 | Epoch: 0 | Step: 108690 | Dataset: 0-310144 | Loss: 0.640 | 913 ms/step , 6892.04 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 00:58:46 | Epoch: 0 | Step: 108700 | Dataset: 0-310464 | Loss: 0.701 | 914 ms/step , 6884.98 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 00:58:48 | Validation | Step: 108700 | Val_loss: 0.845 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 00:58:57 | Epoch: 0 | Step: 108710 | Dataset: 0-310784 | Loss: 0.641 | 912 ms/step , 6895.02 GFLOP/s , 15284.6 tokens/s INFO:__main__:2024-11-05 00:59:06 | Epoch: 0 | Step: 108720 | Dataset: 0-311104 | Loss: 0.689 | 912 ms/step , 6897.85 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 00:59:15 | Epoch: 0 | Step: 108730 | Dataset: 0-311424 | Loss: 0.672 | 912 ms/step , 6896.72 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-05 00:59:24 | Epoch: 0 | Step: 108740 | Dataset: 0-311744 | Loss: 0.666 | 912 ms/step , 6899.92 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 00:59:33 | Epoch: 0 | Step: 108750 | Dataset: 0-312064 | Loss: 0.638 | 914 ms/step , 6881.13 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 00:59:43 | Epoch: 0 | Step: 108760 | Dataset: 0-312384 | Loss: 0.639 | 912 ms/step , 6894.77 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 00:59:52 | Epoch: 0 | Step: 108770 | Dataset: 0-312704 | Loss: 0.628 | 912 ms/step , 6899.36 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 01:00:01 | Epoch: 0 | Step: 108780 | Dataset: 0-313024 | Loss: 0.738 | 914 ms/step , 6882.91 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 01:00:10 | Epoch: 0 | Step: 108790 | Dataset: 0-313344 | Loss: 0.740 | 914 ms/step , 6884.66 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 01:00:19 | Epoch: 0 | Step: 108800 | Dataset: 0-313664 | Loss: 0.582 | 912 ms/step , 6895.35 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 01:00:21 | Validation | Step: 108800 | Val_loss: 0.841 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:00:30 | Epoch: 0 | Step: 108810 | Dataset: 0-313984 | Loss: 0.566 | 912 ms/step , 6895.25 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-05 01:00:39 | Epoch: 0 | Step: 108820 | Dataset: 0-314304 | Loss: 0.641 | 913 ms/step , 6885.99 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 01:00:48 | Epoch: 0 | Step: 108830 | Dataset: 0-314624 | Loss: 0.645 | 913 ms/step , 6885.92 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 01:00:57 | Epoch: 0 | Step: 108840 | Dataset: 0-314944 | Loss: 0.685 | 914 ms/step , 6884.17 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 01:01:06 | Epoch: 0 | Step: 108850 | Dataset: 0-315264 | Loss: 0.609 | 913 ms/step , 6890.81 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 01:01:15 | Epoch: 0 | Step: 108860 | Dataset: 0-315584 | Loss: 0.582 | 912 ms/step , 6898.32 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 01:01:25 | Epoch: 0 | Step: 108870 | Dataset: 0-315904 | Loss: 0.717 | 912 ms/step , 6892.61 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 01:01:34 | Epoch: 0 | Step: 108880 | Dataset: 0-316224 | Loss: 0.727 | 913 ms/step , 6891.25 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-05 01:01:43 | Epoch: 0 | Step: 108890 | Dataset: 0-316544 | Loss: 0.671 | 912 ms/step , 6895.88 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 01:01:52 | Epoch: 0 | Step: 108900 | Dataset: 0-316864 | Loss: 0.661 | 913 ms/step , 6890.99 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 01:01:54 | Validation | Step: 108900 | Val_loss: 0.885 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:02:03 | Epoch: 0 | Step: 108910 | Dataset: 0-317184 | Loss: 0.682 | 913 ms/step , 6892.32 GFLOP/s , 15282.2 tokens/s INFO:__main__:2024-11-05 01:02:12 | Epoch: 0 | Step: 108920 | Dataset: 0-317504 | Loss: 0.660 | 912 ms/step , 6895.34 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 01:02:21 | Epoch: 0 | Step: 108930 | Dataset: 0-317824 | Loss: 0.726 | 913 ms/step , 6890.77 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 01:02:30 | Epoch: 0 | Step: 108940 | Dataset: 0-318144 | Loss: 0.713 | 912 ms/step , 6899.54 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 01:02:39 | Epoch: 0 | Step: 108950 | Dataset: 0-318464 | Loss: 0.646 | 912 ms/step , 6896.18 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 01:02:48 | Epoch: 0 | Step: 108960 | Dataset: 0-318784 | Loss: 0.664 | 913 ms/step , 6885.84 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 01:02:57 | Epoch: 0 | Step: 108970 | Dataset: 0-319104 | Loss: 0.715 | 912 ms/step , 6893.97 GFLOP/s , 17949.2 tokens/s INFO:__main__:2024-11-05 01:03:07 | Epoch: 0 | Step: 108980 | Dataset: 0-319424 | Loss: 0.661 | 912 ms/step , 6893.31 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 01:03:16 | Epoch: 0 | Step: 108990 | Dataset: 0-319744 | Loss: 0.611 | 912 ms/step , 6895.86 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 01:03:25 | Epoch: 0 | Step: 109000 | Dataset: 0-320064 | Loss: 0.741 | 912 ms/step , 6894.62 GFLOP/s , 17953.3 tokens/s INFO:__main__:2024-11-05 01:03:26 | Validation | Step: 109000 | Val_loss: 0.800 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:03:26 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_010326_step_109000.pt` INFO:__main__:2024-11-05 01:03:37 | Epoch: 0 | Step: 109010 | Dataset: 0-320384 | Loss: 0.743 | 913 ms/step , 6891.25 GFLOP/s , 13805.9 tokens/s INFO:__main__:2024-11-05 01:03:46 | Epoch: 0 | Step: 109020 | Dataset: 0-320704 | Loss: 0.667 | 914 ms/step , 6878.54 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 01:03:55 | Epoch: 0 | Step: 109030 | Dataset: 0-321024 | Loss: 0.676 | 913 ms/step , 6885.18 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 01:04:04 | Epoch: 0 | Step: 109040 | Dataset: 0-321344 | Loss: 0.567 | 913 ms/step , 6887.48 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 01:04:13 | Epoch: 0 | Step: 109050 | Dataset: 0-321664 | Loss: 0.687 | 912 ms/step , 6896.97 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 01:04:22 | Epoch: 0 | Step: 109060 | Dataset: 0-321984 | Loss: 0.724 | 912 ms/step , 6897.19 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 01:04:32 | Epoch: 0 | Step: 109070 | Dataset: 0-322304 | Loss: 0.585 | 911 ms/step , 6906.22 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 01:04:41 | Epoch: 0 | Step: 109080 | Dataset: 0-322624 | Loss: 0.604 | 913 ms/step , 6886.93 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 01:04:50 | Epoch: 0 | Step: 109090 | Dataset: 0-322944 | Loss: 0.631 | 913 ms/step , 6889.07 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 01:04:59 | Epoch: 0 | Step: 109100 | Dataset: 0-323264 | Loss: 0.635 | 913 ms/step , 6886.49 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 01:05:00 | Validation | Step: 109100 | Val_loss: 0.873 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:05:10 | Epoch: 0 | Step: 109110 | Dataset: 0-323584 | Loss: 0.637 | 912 ms/step , 6894.27 GFLOP/s , 15283.3 tokens/s INFO:__main__:2024-11-05 01:05:19 | Epoch: 0 | Step: 109120 | Dataset: 0-323904 | Loss: 0.700 | 912 ms/step , 6894.70 GFLOP/s , 17949.0 tokens/s INFO:__main__:2024-11-05 01:05:28 | Epoch: 0 | Step: 109130 | Dataset: 0-324224 | Loss: 0.667 | 912 ms/step , 6893.08 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-05 01:05:37 | Epoch: 0 | Step: 109140 | Dataset: 0-324544 | Loss: 0.565 | 912 ms/step , 6899.38 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 01:05:46 | Epoch: 0 | Step: 109150 | Dataset: 0-324864 | Loss: 0.636 | 913 ms/step , 6890.89 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 01:05:55 | Epoch: 0 | Step: 109160 | Dataset: 0-325184 | Loss: 0.629 | 913 ms/step , 6886.47 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 01:06:04 | Epoch: 0 | Step: 109170 | Dataset: 0-325504 | Loss: 0.609 | 912 ms/step , 6896.07 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 01:06:14 | Epoch: 0 | Step: 109180 | Dataset: 0-325824 | Loss: 0.673 | 912 ms/step , 6895.75 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 01:06:23 | Epoch: 0 | Step: 109190 | Dataset: 0-326144 | Loss: 0.666 | 911 ms/step , 6904.44 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 01:06:32 | Epoch: 0 | Step: 109200 | Dataset: 0-326464 | Loss: 0.708 | 913 ms/step , 6889.08 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 01:06:33 | Validation | Step: 109200 | Val_loss: 0.876 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:06:43 | Epoch: 0 | Step: 109210 | Dataset: 0-326784 | Loss: 0.736 | 913 ms/step , 6886.66 GFLOP/s , 15284.3 tokens/s INFO:__main__:2024-11-05 01:06:52 | Epoch: 0 | Step: 109220 | Dataset: 0-327104 | Loss: 0.672 | 913 ms/step , 6889.81 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 01:07:01 | Epoch: 0 | Step: 109230 | Dataset: 0-327424 | Loss: 0.724 | 914 ms/step , 6883.30 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 01:07:10 | Epoch: 0 | Step: 109240 | Dataset: 0-327744 | Loss: 0.570 | 913 ms/step , 6886.32 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 01:07:19 | Epoch: 0 | Step: 109250 | Dataset: 0-328064 | Loss: 0.669 | 913 ms/step , 6890.30 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 01:07:28 | Epoch: 0 | Step: 109260 | Dataset: 0-328384 | Loss: 0.720 | 913 ms/step , 6890.57 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-05 01:07:37 | Epoch: 0 | Step: 109270 | Dataset: 0-328704 | Loss: 0.680 | 914 ms/step , 6880.84 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 01:07:46 | Epoch: 0 | Step: 109280 | Dataset: 0-329024 | Loss: 0.624 | 912 ms/step , 6894.42 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 01:07:56 | Epoch: 0 | Step: 109290 | Dataset: 0-329344 | Loss: 0.670 | 913 ms/step , 6885.26 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 01:08:05 | Epoch: 0 | Step: 109300 | Dataset: 0-329664 | Loss: 0.703 | 912 ms/step , 6892.69 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 01:08:06 | Validation | Step: 109300 | Val_loss: 0.873 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:08:15 | Epoch: 0 | Step: 109310 | Dataset: 0-329984 | Loss: 0.574 | 912 ms/step , 6898.28 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-05 01:08:25 | Epoch: 0 | Step: 109320 | Dataset: 0-330304 | Loss: 0.585 | 912 ms/step , 6896.35 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 01:08:34 | Epoch: 0 | Step: 109330 | Dataset: 0-330624 | Loss: 0.621 | 912 ms/step , 6898.25 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 01:08:43 | Epoch: 0 | Step: 109340 | Dataset: 0-330944 | Loss: 0.675 | 912 ms/step , 6899.78 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 01:08:52 | Epoch: 0 | Step: 109350 | Dataset: 0-331264 | Loss: 0.677 | 913 ms/step , 6890.54 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 01:09:01 | Epoch: 0 | Step: 109360 | Dataset: 0-331584 | Loss: 0.713 | 912 ms/step , 6898.35 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 01:09:10 | Epoch: 0 | Step: 109370 | Dataset: 0-331904 | Loss: 0.666 | 912 ms/step , 6897.60 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 01:09:19 | Epoch: 0 | Step: 109380 | Dataset: 0-332224 | Loss: 0.686 | 913 ms/step , 6888.44 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 01:09:28 | Epoch: 0 | Step: 109390 | Dataset: 0-332544 | Loss: 0.678 | 913 ms/step , 6890.56 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 01:09:38 | Epoch: 0 | Step: 109400 | Dataset: 0-332864 | Loss: 0.680 | 913 ms/step , 6892.55 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 01:09:39 | Validation | Step: 109400 | Val_loss: 0.809 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:09:48 | Epoch: 0 | Step: 109410 | Dataset: 0-333184 | Loss: 0.662 | 912 ms/step , 6895.24 GFLOP/s , 15282.5 tokens/s INFO:__main__:2024-11-05 01:09:57 | Epoch: 0 | Step: 109420 | Dataset: 0-333504 | Loss: 0.609 | 913 ms/step , 6887.84 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 01:10:07 | Epoch: 0 | Step: 109430 | Dataset: 0-333824 | Loss: 0.687 | 913 ms/step , 6891.06 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 01:10:16 | Epoch: 0 | Step: 109440 | Dataset: 0-334144 | Loss: 0.614 | 913 ms/step , 6885.10 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 01:10:25 | Epoch: 0 | Step: 109450 | Dataset: 0-334464 | Loss: 0.687 | 913 ms/step , 6889.18 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 01:10:34 | Epoch: 0 | Step: 109460 | Dataset: 0-334784 | Loss: 0.735 | 912 ms/step , 6899.40 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 01:10:43 | Epoch: 0 | Step: 109470 | Dataset: 0-335104 | Loss: 0.596 | 913 ms/step , 6892.46 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 01:10:52 | Epoch: 0 | Step: 109480 | Dataset: 0-335424 | Loss: 0.658 | 912 ms/step , 6893.12 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 01:11:01 | Epoch: 0 | Step: 109490 | Dataset: 0-335744 | Loss: 0.619 | 912 ms/step , 6893.08 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 01:11:10 | Epoch: 0 | Step: 109500 | Dataset: 0-336064 | Loss: 0.652 | 912 ms/step , 6894.09 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 01:11:12 | Validation | Step: 109500 | Val_loss: 0.882 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:11:21 | Epoch: 0 | Step: 109510 | Dataset: 0-336384 | Loss: 0.617 | 912 ms/step , 6895.75 GFLOP/s , 15287.5 tokens/s INFO:__main__:2024-11-05 01:11:30 | Epoch: 0 | Step: 109520 | Dataset: 0-336704 | Loss: 0.599 | 912 ms/step , 6896.79 GFLOP/s , 17950.0 tokens/s INFO:__main__:2024-11-05 01:11:39 | Epoch: 0 | Step: 109530 | Dataset: 0-337024 | Loss: 0.574 | 913 ms/step , 6890.11 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 01:11:49 | Epoch: 0 | Step: 109540 | Dataset: 0-337344 | Loss: 0.672 | 912 ms/step , 6893.83 GFLOP/s , 17952.2 tokens/s INFO:__main__:2024-11-05 01:11:58 | Epoch: 0 | Step: 109550 | Dataset: 0-337664 | Loss: 0.722 | 911 ms/step , 6902.97 GFLOP/s , 17953.3 tokens/s INFO:__main__:2024-11-05 01:12:07 | Epoch: 0 | Step: 109560 | Dataset: 0-337984 | Loss: 0.621 | 913 ms/step , 6888.05 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 01:12:16 | Epoch: 0 | Step: 109570 | Dataset: 0-338304 | Loss: 0.620 | 912 ms/step , 6898.58 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 01:12:25 | Epoch: 0 | Step: 109580 | Dataset: 0-338624 | Loss: 0.620 | 912 ms/step , 6896.98 GFLOP/s , 17955.1 tokens/s INFO:__main__:2024-11-05 01:12:34 | Epoch: 0 | Step: 109590 | Dataset: 0-338944 | Loss: 0.650 | 913 ms/step , 6892.49 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 01:12:43 | Epoch: 0 | Step: 109600 | Dataset: 0-339264 | Loss: 0.642 | 913 ms/step , 6887.17 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 01:12:45 | Validation | Step: 109600 | Val_loss: 0.934 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:12:54 | Epoch: 0 | Step: 109610 | Dataset: 0-339584 | Loss: 0.605 | 912 ms/step , 6898.05 GFLOP/s , 15286.5 tokens/s INFO:__main__:2024-11-05 01:13:03 | Epoch: 0 | Step: 109620 | Dataset: 0-339904 | Loss: 0.700 | 912 ms/step , 6896.48 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 01:13:12 | Epoch: 0 | Step: 109630 | Dataset: 0-340224 | Loss: 0.704 | 913 ms/step , 6888.31 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 01:13:21 | Epoch: 0 | Step: 109640 | Dataset: 0-340544 | Loss: 0.593 | 912 ms/step , 6894.22 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 01:13:31 | Epoch: 0 | Step: 109650 | Dataset: 0-340864 | Loss: 0.594 | 913 ms/step , 6885.10 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 01:13:40 | Epoch: 0 | Step: 109660 | Dataset: 0-341184 | Loss: 0.625 | 913 ms/step , 6889.08 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-05 01:13:49 | Epoch: 0 | Step: 109670 | Dataset: 0-341504 | Loss: 0.723 | 912 ms/step , 6897.62 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 01:13:58 | Epoch: 0 | Step: 109680 | Dataset: 0-341824 | Loss: 0.656 | 913 ms/step , 6890.68 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 01:14:07 | Epoch: 0 | Step: 109690 | Dataset: 0-342144 | Loss: 0.638 | 911 ms/step , 6904.67 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 01:14:16 | Epoch: 0 | Step: 109700 | Dataset: 0-342464 | Loss: 0.626 | 914 ms/step , 6884.01 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 01:14:18 | Validation | Step: 109700 | Val_loss: 0.922 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:14:27 | Epoch: 0 | Step: 109710 | Dataset: 0-342784 | Loss: 0.681 | 913 ms/step , 6887.26 GFLOP/s , 15294.3 tokens/s INFO:__main__:2024-11-05 01:14:36 | Epoch: 0 | Step: 109720 | Dataset: 0-343104 | Loss: 0.507 | 912 ms/step , 6898.02 GFLOP/s , 17951.9 tokens/s INFO:__main__:2024-11-05 01:14:45 | Epoch: 0 | Step: 109730 | Dataset: 0-343424 | Loss: 0.702 | 913 ms/step , 6892.48 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 01:14:54 | Epoch: 0 | Step: 109740 | Dataset: 0-343744 | Loss: 0.681 | 912 ms/step , 6897.68 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 01:15:03 | Epoch: 0 | Step: 109750 | Dataset: 0-344064 | Loss: 0.694 | 912 ms/step , 6894.07 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 01:15:13 | Epoch: 0 | Step: 109760 | Dataset: 0-344384 | Loss: 0.680 | 913 ms/step , 6885.47 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 01:15:22 | Epoch: 0 | Step: 109770 | Dataset: 0-344704 | Loss: 0.607 | 913 ms/step , 6889.75 GFLOP/s , 17949.7 tokens/s INFO:__main__:2024-11-05 01:15:31 | Epoch: 0 | Step: 109780 | Dataset: 0-345024 | Loss: 0.593 | 912 ms/step , 6895.99 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 01:15:40 | Epoch: 0 | Step: 109790 | Dataset: 0-345344 | Loss: 0.614 | 913 ms/step , 6891.90 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-05 01:15:49 | Epoch: 0 | Step: 109800 | Dataset: 0-345664 | Loss: 0.577 | 913 ms/step , 6891.97 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 01:15:51 | Validation | Step: 109800 | Val_loss: 0.936 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:16:00 | Epoch: 0 | Step: 109810 | Dataset: 0-345984 | Loss: 0.612 | 913 ms/step , 6890.68 GFLOP/s , 15287.5 tokens/s INFO:__main__:2024-11-05 01:16:09 | Epoch: 0 | Step: 109820 | Dataset: 0-346304 | Loss: 0.720 | 912 ms/step , 6894.85 GFLOP/s , 17949.8 tokens/s INFO:__main__:2024-11-05 01:16:18 | Epoch: 0 | Step: 109830 | Dataset: 0-346624 | Loss: 0.608 | 912 ms/step , 6898.52 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 01:16:27 | Epoch: 0 | Step: 109840 | Dataset: 0-346944 | Loss: 0.677 | 912 ms/step , 6896.33 GFLOP/s , 17954.1 tokens/s INFO:__main__:2024-11-05 01:16:36 | Epoch: 0 | Step: 109850 | Dataset: 0-347264 | Loss: 0.609 | 913 ms/step , 6887.12 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 01:16:45 | Epoch: 0 | Step: 109860 | Dataset: 0-347584 | Loss: 0.609 | 914 ms/step , 6884.53 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 01:16:55 | Epoch: 0 | Step: 109870 | Dataset: 0-347904 | Loss: 0.705 | 913 ms/step , 6886.26 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 01:17:04 | Epoch: 0 | Step: 109880 | Dataset: 0-348224 | Loss: 0.736 | 915 ms/step , 6876.69 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 01:17:13 | Epoch: 0 | Step: 109890 | Dataset: 0-348544 | Loss: 0.739 | 913 ms/step , 6885.87 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 01:17:22 | Epoch: 0 | Step: 109900 | Dataset: 0-348864 | Loss: 0.633 | 913 ms/step , 6889.60 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 01:17:24 | Validation | Step: 109900 | Val_loss: 0.859 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:17:33 | Epoch: 0 | Step: 109910 | Dataset: 0-349184 | Loss: 0.753 | 912 ms/step , 6893.25 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 01:17:42 | Epoch: 0 | Step: 109920 | Dataset: 0-349504 | Loss: 0.728 | 913 ms/step , 6891.26 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 01:17:51 | Epoch: 0 | Step: 109930 | Dataset: 0-349824 | Loss: 0.658 | 913 ms/step , 6890.61 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 01:18:00 | Epoch: 0 | Step: 109940 | Dataset: 0-350144 | Loss: 0.570 | 913 ms/step , 6892.37 GFLOP/s , 17953.6 tokens/s INFO:__main__:2024-11-05 01:18:09 | Epoch: 0 | Step: 109950 | Dataset: 0-350464 | Loss: 0.614 | 912 ms/step , 6893.50 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 01:18:18 | Epoch: 0 | Step: 109960 | Dataset: 0-350784 | Loss: 0.573 | 912 ms/step , 6897.76 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 01:18:27 | Epoch: 0 | Step: 109970 | Dataset: 0-351104 | Loss: 0.635 | 912 ms/step , 6897.80 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 01:18:37 | Epoch: 0 | Step: 109980 | Dataset: 0-351424 | Loss: 0.705 | 913 ms/step , 6890.56 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 01:18:46 | Epoch: 0 | Step: 109990 | Dataset: 0-351744 | Loss: 0.634 | 912 ms/step , 6893.93 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 01:18:55 | Epoch: 0 | Step: 110000 | Dataset: 0-352064 | Loss: 0.584 | 912 ms/step , 6896.09 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 01:18:56 | Validation | Step: 110000 | Val_loss: 0.840 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:18:56 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_011856_step_110000.pt` INFO:__main__:2024-11-05 01:19:07 | Epoch: 0 | Step: 110010 | Dataset: 0-352384 | Loss: 0.566 | 911 ms/step , 6900.95 GFLOP/s , 13797.1 tokens/s INFO:__main__:2024-11-05 01:19:16 | Epoch: 0 | Step: 110020 | Dataset: 0-352704 | Loss: 0.646 | 913 ms/step , 6887.82 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 01:19:25 | Epoch: 0 | Step: 110030 | Dataset: 0-353024 | Loss: 0.633 | 913 ms/step , 6890.15 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 01:19:34 | Epoch: 0 | Step: 110040 | Dataset: 0-353344 | Loss: 0.660 | 912 ms/step , 6896.79 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 01:19:43 | Epoch: 0 | Step: 110050 | Dataset: 0-353664 | Loss: 0.628 | 913 ms/step , 6891.75 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 01:19:52 | Epoch: 0 | Step: 110060 | Dataset: 0-353984 | Loss: 0.543 | 912 ms/step , 6896.65 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 01:20:01 | Epoch: 0 | Step: 110070 | Dataset: 0-354304 | Loss: 0.640 | 913 ms/step , 6890.41 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 01:20:11 | Epoch: 0 | Step: 110080 | Dataset: 0-354624 | Loss: 0.590 | 912 ms/step , 6893.08 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 01:20:20 | Epoch: 0 | Step: 110090 | Dataset: 0-354944 | Loss: 0.609 | 913 ms/step , 6892.23 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 01:20:29 | Epoch: 0 | Step: 110100 | Dataset: 0-355264 | Loss: 0.608 | 912 ms/step , 6894.37 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 01:20:30 | Validation | Step: 110100 | Val_loss: 0.909 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:20:40 | Epoch: 0 | Step: 110110 | Dataset: 0-355584 | Loss: 0.656 | 913 ms/step , 6892.17 GFLOP/s , 15290.9 tokens/s INFO:__main__:2024-11-05 01:20:49 | Epoch: 0 | Step: 110120 | Dataset: 0-355904 | Loss: 0.547 | 912 ms/step , 6894.94 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 01:20:58 | Epoch: 0 | Step: 110130 | Dataset: 0-356224 | Loss: 0.601 | 913 ms/step , 6892.59 GFLOP/s , 17946.2 tokens/s INFO:__main__:2024-11-05 01:21:07 | Epoch: 0 | Step: 110140 | Dataset: 0-356544 | Loss: 0.693 | 913 ms/step , 6892.37 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 01:21:16 | Epoch: 0 | Step: 110150 | Dataset: 0-356864 | Loss: 0.656 | 912 ms/step , 6897.70 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 01:21:25 | Epoch: 0 | Step: 110160 | Dataset: 0-357184 | Loss: 0.604 | 913 ms/step , 6887.08 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 01:21:34 | Epoch: 0 | Step: 110170 | Dataset: 0-357504 | Loss: 1.001 | 913 ms/step , 6885.76 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 01:21:44 | Epoch: 0 | Step: 110180 | Dataset: 0-357824 | Loss: 0.865 | 912 ms/step , 6898.32 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 01:21:53 | Epoch: 0 | Step: 110190 | Dataset: 0-358144 | Loss: 0.889 | 913 ms/step , 6889.06 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 01:22:02 | Epoch: 0 | Step: 110200 | Dataset: 0-358464 | Loss: 0.732 | 912 ms/step , 6898.96 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 01:22:03 | Validation | Step: 110200 | Val_loss: 0.838 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:22:12 | Epoch: 0 | Step: 110210 | Dataset: 0-358784 | Loss: 0.539 | 912 ms/step , 6896.37 GFLOP/s , 15300.1 tokens/s INFO:__main__:2024-11-05 01:22:22 | Epoch: 0 | Step: 110220 | Dataset: 0-359104 | Loss: 0.529 | 913 ms/step , 6890.01 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 01:22:31 | Epoch: 0 | Step: 110230 | Dataset: 0-359424 | Loss: 0.572 | 912 ms/step , 6896.93 GFLOP/s , 17960.3 tokens/s INFO:__main__:2024-11-05 01:22:40 | Epoch: 0 | Step: 110240 | Dataset: 0-359744 | Loss: 0.442 | 912 ms/step , 6898.95 GFLOP/s , 17959.2 tokens/s INFO:__main__:2024-11-05 01:22:49 | Epoch: 0 | Step: 110250 | Dataset: 0-360064 | Loss: 0.532 | 911 ms/step , 6900.71 GFLOP/s , 17960.5 tokens/s INFO:__main__:2024-11-05 01:22:58 | Epoch: 0 | Step: 110260 | Dataset: 0-360384 | Loss: 0.277 | 912 ms/step , 6896.29 GFLOP/s , 17957.3 tokens/s INFO:__main__:2024-11-05 01:23:07 | Epoch: 0 | Step: 110270 | Dataset: 0-360704 | Loss: 0.491 | 910 ms/step , 6908.30 GFLOP/s , 17962.0 tokens/s INFO:__main__:2024-11-05 01:23:16 | Epoch: 0 | Step: 110280 | Dataset: 0-361024 | Loss: 0.289 | 911 ms/step , 6901.24 GFLOP/s , 17958.4 tokens/s INFO:__main__:2024-11-05 01:23:25 | Epoch: 0 | Step: 110290 | Dataset: 0-361344 | Loss: 0.532 | 911 ms/step , 6905.25 GFLOP/s , 17963.0 tokens/s INFO:__main__:2024-11-05 01:23:35 | Epoch: 0 | Step: 110300 | Dataset: 0-361664 | Loss: 0.489 | 911 ms/step , 6901.06 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-05 01:23:36 | Validation | Step: 110300 | Val_loss: 0.909 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:23:45 | Epoch: 0 | Step: 110310 | Dataset: 0-361984 | Loss: 0.499 | 912 ms/step , 6898.26 GFLOP/s , 15300.2 tokens/s INFO:__main__:2024-11-05 01:23:54 | Epoch: 0 | Step: 110320 | Dataset: 0-362304 | Loss: 0.529 | 912 ms/step , 6899.18 GFLOP/s , 17965.5 tokens/s INFO:__main__:2024-11-05 01:24:04 | Epoch: 0 | Step: 110330 | Dataset: 0-362624 | Loss: 0.423 | 911 ms/step , 6903.52 GFLOP/s , 17965.6 tokens/s INFO:__main__:2024-11-05 01:24:13 | Epoch: 0 | Step: 110340 | Dataset: 0-362944 | Loss: 0.414 | 911 ms/step , 6902.54 GFLOP/s , 17963.6 tokens/s INFO:__main__:2024-11-05 01:24:22 | Epoch: 0 | Step: 110350 | Dataset: 0-363264 | Loss: 0.527 | 912 ms/step , 6899.34 GFLOP/s , 17961.0 tokens/s INFO:__main__:2024-11-05 01:24:31 | Epoch: 0 | Step: 110360 | Dataset: 0-363584 | Loss: 0.355 | 912 ms/step , 6895.23 GFLOP/s , 17960.7 tokens/s INFO:__main__:2024-11-05 01:24:40 | Epoch: 0 | Step: 110370 | Dataset: 0-363904 | Loss: 0.398 | 911 ms/step , 6904.36 GFLOP/s , 17952.7 tokens/s INFO:__main__:2024-11-05 01:24:49 | Epoch: 0 | Step: 110380 | Dataset: 0-364224 | Loss: 0.459 | 913 ms/step , 6887.89 GFLOP/s , 17958.4 tokens/s INFO:__main__:2024-11-05 01:24:58 | Epoch: 0 | Step: 110390 | Dataset: 0-364544 | Loss: 0.557 | 913 ms/step , 6888.83 GFLOP/s , 17960.2 tokens/s INFO:__main__:2024-11-05 01:25:07 | Epoch: 0 | Step: 110400 | Dataset: 0-364864 | Loss: 0.246 | 912 ms/step , 6892.82 GFLOP/s , 17957.8 tokens/s INFO:__main__:2024-11-05 01:25:09 | Validation | Step: 110400 | Val_loss: 0.891 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:25:18 | Epoch: 0 | Step: 110410 | Dataset: 0-365184 | Loss: 0.398 | 912 ms/step , 6893.41 GFLOP/s , 15305.9 tokens/s INFO:__main__:2024-11-05 01:25:27 | Epoch: 0 | Step: 110420 | Dataset: 0-365504 | Loss: 0.548 | 911 ms/step , 6900.41 GFLOP/s , 17951.8 tokens/s INFO:__main__:2024-11-05 01:25:36 | Epoch: 0 | Step: 110430 | Dataset: 0-365824 | Loss: 0.892 | 913 ms/step , 6890.09 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 01:25:45 | Epoch: 0 | Step: 110440 | Dataset: 0-366144 | Loss: 0.922 | 913 ms/step , 6888.40 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 01:25:55 | Epoch: 0 | Step: 110450 | Dataset: 0-366464 | Loss: 0.916 | 912 ms/step , 6894.75 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 01:26:04 | Epoch: 0 | Step: 110460 | Dataset: 0-366784 | Loss: 0.857 | 913 ms/step , 6886.48 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 01:26:13 | Epoch: 0 | Step: 110470 | Dataset: 0-367104 | Loss: 0.891 | 913 ms/step , 6891.07 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 01:26:22 | Epoch: 0 | Step: 110480 | Dataset: 0-367424 | Loss: 0.974 | 914 ms/step , 6884.29 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 01:26:31 | Epoch: 0 | Step: 110490 | Dataset: 0-367744 | Loss: 0.981 | 915 ms/step , 6870.65 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 01:26:40 | Epoch: 0 | Step: 110500 | Dataset: 0-368064 | Loss: 0.912 | 914 ms/step , 6881.19 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 01:26:42 | Validation | Step: 110500 | Val_loss: 0.889 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:26:51 | Epoch: 0 | Step: 110510 | Dataset: 0-368384 | Loss: 0.929 | 914 ms/step , 6883.11 GFLOP/s , 15283.3 tokens/s INFO:__main__:2024-11-05 01:27:00 | Epoch: 0 | Step: 110520 | Dataset: 0-368704 | Loss: 0.948 | 914 ms/step , 6880.72 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 01:27:09 | Epoch: 0 | Step: 110530 | Dataset: 0-369024 | Loss: 0.874 | 912 ms/step , 6894.29 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 01:27:18 | Epoch: 0 | Step: 110540 | Dataset: 0-369344 | Loss: 0.854 | 913 ms/step , 6892.23 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 01:27:28 | Epoch: 0 | Step: 110550 | Dataset: 0-369664 | Loss: 0.999 | 913 ms/step , 6888.03 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 01:27:37 | Epoch: 0 | Step: 110560 | Dataset: 0-369984 | Loss: 0.980 | 912 ms/step , 6894.92 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 01:27:46 | Epoch: 0 | Step: 110570 | Dataset: 0-370304 | Loss: 0.902 | 914 ms/step , 6883.30 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 01:27:55 | Epoch: 0 | Step: 110580 | Dataset: 0-370624 | Loss: 0.967 | 913 ms/step , 6885.80 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 01:28:04 | Epoch: 0 | Step: 110590 | Dataset: 0-370944 | Loss: 0.976 | 914 ms/step , 6883.40 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 01:28:13 | Epoch: 0 | Step: 110600 | Dataset: 0-371264 | Loss: 0.918 | 913 ms/step , 6889.43 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 01:28:15 | Validation | Step: 110600 | Val_loss: 0.903 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:28:24 | Epoch: 0 | Step: 110610 | Dataset: 0-371584 | Loss: 0.958 | 913 ms/step , 6886.41 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-05 01:28:33 | Epoch: 0 | Step: 110620 | Dataset: 0-371904 | Loss: 0.799 | 913 ms/step , 6889.59 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 01:28:42 | Epoch: 0 | Step: 110630 | Dataset: 0-372224 | Loss: 0.866 | 914 ms/step , 6884.36 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 01:28:51 | Epoch: 0 | Step: 110640 | Dataset: 0-372544 | Loss: 0.883 | 914 ms/step , 6879.78 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 01:29:01 | Epoch: 0 | Step: 110650 | Dataset: 0-372864 | Loss: 0.944 | 914 ms/step , 6877.55 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 01:29:10 | Epoch: 0 | Step: 110660 | Dataset: 0-373184 | Loss: 0.841 | 914 ms/step , 6881.80 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 01:29:19 | Epoch: 0 | Step: 110670 | Dataset: 0-373504 | Loss: 0.915 | 914 ms/step , 6878.90 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 01:29:28 | Epoch: 0 | Step: 110680 | Dataset: 0-373824 | Loss: 0.861 | 913 ms/step , 6885.54 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 01:29:37 | Epoch: 0 | Step: 110690 | Dataset: 0-374144 | Loss: 0.840 | 914 ms/step , 6882.97 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 01:29:46 | Epoch: 0 | Step: 110700 | Dataset: 0-374464 | Loss: 0.913 | 915 ms/step , 6876.74 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 01:29:48 | Validation | Step: 110700 | Val_loss: 0.860 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:29:57 | Epoch: 0 | Step: 110710 | Dataset: 0-374784 | Loss: 0.908 | 915 ms/step , 6876.96 GFLOP/s , 15279.5 tokens/s INFO:__main__:2024-11-05 01:30:06 | Epoch: 0 | Step: 110720 | Dataset: 0-375104 | Loss: 0.972 | 914 ms/step , 6880.62 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 01:30:15 | Epoch: 0 | Step: 110730 | Dataset: 0-375424 | Loss: 0.830 | 913 ms/step , 6885.31 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 01:30:24 | Epoch: 0 | Step: 110740 | Dataset: 0-375744 | Loss: 0.828 | 914 ms/step , 6880.02 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 01:30:34 | Epoch: 0 | Step: 110750 | Dataset: 0-376064 | Loss: 0.793 | 913 ms/step , 6886.49 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 01:30:43 | Epoch: 0 | Step: 110760 | Dataset: 0-376384 | Loss: 0.836 | 913 ms/step , 6885.49 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 01:30:52 | Epoch: 0 | Step: 110770 | Dataset: 0-376704 | Loss: 0.919 | 913 ms/step , 6885.88 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 01:31:01 | Epoch: 0 | Step: 110780 | Dataset: 0-377024 | Loss: 0.977 | 914 ms/step , 6880.41 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 01:31:10 | Epoch: 0 | Step: 110790 | Dataset: 0-377344 | Loss: 0.809 | 914 ms/step , 6884.80 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 01:31:19 | Epoch: 0 | Step: 110800 | Dataset: 0-377664 | Loss: 0.893 | 913 ms/step , 6885.89 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 01:31:21 | Validation | Step: 110800 | Val_loss: 0.845 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:31:30 | Epoch: 0 | Step: 110810 | Dataset: 0-377984 | Loss: 0.868 | 913 ms/step , 6888.51 GFLOP/s , 15264.2 tokens/s INFO:__main__:2024-11-05 01:31:39 | Epoch: 0 | Step: 110820 | Dataset: 0-378304 | Loss: 0.856 | 915 ms/step , 6876.06 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 01:31:48 | Epoch: 0 | Step: 110830 | Dataset: 0-378624 | Loss: 0.920 | 913 ms/step , 6888.85 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 01:31:57 | Epoch: 0 | Step: 110840 | Dataset: 0-378944 | Loss: 0.743 | 914 ms/step , 6884.11 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 01:32:07 | Epoch: 0 | Step: 110850 | Dataset: 0-379264 | Loss: 0.726 | 914 ms/step , 6881.52 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 01:32:16 | Epoch: 0 | Step: 110860 | Dataset: 0-379584 | Loss: 0.932 | 915 ms/step , 6871.30 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 01:32:25 | Epoch: 0 | Step: 110870 | Dataset: 0-379904 | Loss: 0.846 | 915 ms/step , 6875.03 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 01:32:34 | Epoch: 0 | Step: 110880 | Dataset: 0-380224 | Loss: 0.783 | 914 ms/step , 6884.09 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 01:32:43 | Epoch: 0 | Step: 110890 | Dataset: 0-380544 | Loss: 0.937 | 914 ms/step , 6883.49 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 01:32:52 | Epoch: 0 | Step: 110900 | Dataset: 0-380864 | Loss: 0.946 | 914 ms/step , 6883.83 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 01:32:54 | Validation | Step: 110900 | Val_loss: 0.833 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:33:03 | Epoch: 0 | Step: 110910 | Dataset: 0-381184 | Loss: 0.793 | 912 ms/step , 6896.95 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-05 01:33:12 | Epoch: 0 | Step: 110920 | Dataset: 0-381504 | Loss: 0.768 | 914 ms/step , 6880.84 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 01:33:21 | Epoch: 0 | Step: 110930 | Dataset: 0-381824 | Loss: 0.956 | 914 ms/step , 6877.89 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 01:33:30 | Epoch: 0 | Step: 110940 | Dataset: 0-382144 | Loss: 0.937 | 914 ms/step , 6878.34 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 01:33:40 | Epoch: 0 | Step: 110950 | Dataset: 0-382464 | Loss: 0.776 | 913 ms/step , 6889.78 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 01:33:49 | Epoch: 0 | Step: 110960 | Dataset: 0-382784 | Loss: 0.825 | 913 ms/step , 6886.43 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 01:33:58 | Epoch: 0 | Step: 110970 | Dataset: 0-383104 | Loss: 0.842 | 914 ms/step , 6881.85 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 01:34:07 | Epoch: 0 | Step: 110980 | Dataset: 0-383424 | Loss: 0.867 | 913 ms/step , 6888.26 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 01:34:16 | Epoch: 0 | Step: 110990 | Dataset: 0-383744 | Loss: 0.928 | 914 ms/step , 6882.72 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 01:34:25 | Epoch: 0 | Step: 111000 | Dataset: 0-384064 | Loss: 0.860 | 914 ms/step , 6879.58 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 01:34:27 | Validation | Step: 111000 | Val_loss: 0.826 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:34:27 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_013427_step_111000.pt` INFO:__main__:2024-11-05 01:34:37 | Epoch: 0 | Step: 111010 | Dataset: 0-384384 | Loss: 0.841 | 915 ms/step , 6872.27 GFLOP/s , 13755.1 tokens/s INFO:__main__:2024-11-05 01:34:46 | Epoch: 0 | Step: 111020 | Dataset: 0-384704 | Loss: 0.882 | 915 ms/step , 6872.50 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 01:34:56 | Epoch: 0 | Step: 111030 | Dataset: 0-385024 | Loss: 0.925 | 915 ms/step , 6875.80 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 01:35:05 | Epoch: 0 | Step: 111040 | Dataset: 0-385344 | Loss: 0.923 | 915 ms/step , 6872.30 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 01:35:14 | Epoch: 0 | Step: 111050 | Dataset: 0-385664 | Loss: 0.910 | 915 ms/step , 6875.17 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 01:35:23 | Epoch: 0 | Step: 111060 | Dataset: 0-385984 | Loss: 0.942 | 914 ms/step , 6882.80 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-05 01:35:32 | Epoch: 0 | Step: 111070 | Dataset: 0-386304 | Loss: 0.873 | 913 ms/step , 6888.72 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 01:35:41 | Epoch: 0 | Step: 111080 | Dataset: 0-386624 | Loss: 0.851 | 913 ms/step , 6887.15 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 01:35:50 | Epoch: 0 | Step: 111090 | Dataset: 0-386944 | Loss: 0.808 | 913 ms/step , 6885.56 GFLOP/s , 17903.8 tokens/s INFO:__main__:2024-11-05 01:36:00 | Epoch: 0 | Step: 111100 | Dataset: 0-387264 | Loss: 0.866 | 915 ms/step , 6871.29 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 01:36:01 | Validation | Step: 111100 | Val_loss: 0.849 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:36:10 | Epoch: 0 | Step: 111110 | Dataset: 0-387584 | Loss: 0.888 | 915 ms/step , 6876.87 GFLOP/s , 15259.5 tokens/s INFO:__main__:2024-11-05 01:36:19 | Epoch: 0 | Step: 111120 | Dataset: 0-387904 | Loss: 0.860 | 914 ms/step , 6878.82 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 01:36:29 | Epoch: 0 | Step: 111130 | Dataset: 0-388224 | Loss: 0.835 | 915 ms/step , 6876.33 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 01:36:38 | Epoch: 0 | Step: 111140 | Dataset: 0-388544 | Loss: 0.772 | 914 ms/step , 6880.01 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 01:36:47 | Epoch: 0 | Step: 111150 | Dataset: 0-388864 | Loss: 0.696 | 914 ms/step , 6884.48 GFLOP/s , 17907.9 tokens/s INFO:__main__:2024-11-05 01:36:56 | Epoch: 0 | Step: 111160 | Dataset: 0-389184 | Loss: 0.837 | 914 ms/step , 6878.64 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 01:37:05 | Epoch: 0 | Step: 111170 | Dataset: 0-389504 | Loss: 0.849 | 914 ms/step , 6882.40 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 01:37:14 | Epoch: 0 | Step: 111180 | Dataset: 0-389824 | Loss: 0.775 | 915 ms/step , 6874.31 GFLOP/s , 17906.7 tokens/s INFO:__main__:2024-11-05 01:37:24 | Epoch: 0 | Step: 111190 | Dataset: 0-390144 | Loss: 0.730 | 913 ms/step , 6886.71 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 01:37:33 | Epoch: 0 | Step: 111200 | Dataset: 0-390464 | Loss: 0.804 | 913 ms/step , 6885.60 GFLOP/s , 17907.7 tokens/s INFO:__main__:2024-11-05 01:37:34 | Validation | Step: 111200 | Val_loss: 0.826 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:37:43 | Epoch: 0 | Step: 111210 | Dataset: 0-390784 | Loss: 0.830 | 914 ms/step , 6878.18 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-05 01:37:53 | Epoch: 0 | Step: 111220 | Dataset: 0-391104 | Loss: 0.813 | 915 ms/step , 6874.42 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 01:38:02 | Epoch: 0 | Step: 111230 | Dataset: 0-391424 | Loss: 0.927 | 914 ms/step , 6883.49 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 01:38:11 | Epoch: 0 | Step: 111240 | Dataset: 0-391744 | Loss: 0.833 | 914 ms/step , 6884.62 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 01:38:20 | Epoch: 0 | Step: 111250 | Dataset: 0-392064 | Loss: 0.802 | 914 ms/step , 6882.39 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 01:38:29 | Epoch: 0 | Step: 111260 | Dataset: 0-392384 | Loss: 0.764 | 914 ms/step , 6883.74 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 01:38:38 | Epoch: 0 | Step: 111270 | Dataset: 0-392704 | Loss: 0.851 | 916 ms/step , 6865.04 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 01:38:47 | Epoch: 0 | Step: 111280 | Dataset: 0-393024 | Loss: 0.798 | 915 ms/step , 6871.12 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 01:38:57 | Epoch: 0 | Step: 111290 | Dataset: 0-393344 | Loss: 0.817 | 914 ms/step , 6881.80 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 01:39:06 | Epoch: 0 | Step: 111300 | Dataset: 0-393664 | Loss: 0.898 | 914 ms/step , 6877.58 GFLOP/s , 17908.0 tokens/s INFO:__main__:2024-11-05 01:39:07 | Validation | Step: 111300 | Val_loss: 0.842 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:39:16 | Epoch: 0 | Step: 111310 | Dataset: 0-393984 | Loss: 0.902 | 915 ms/step , 6877.24 GFLOP/s , 15271.9 tokens/s INFO:__main__:2024-11-05 01:39:26 | Epoch: 0 | Step: 111320 | Dataset: 0-394304 | Loss: 0.870 | 915 ms/step , 6874.29 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 01:39:35 | Epoch: 0 | Step: 111330 | Dataset: 0-394624 | Loss: 0.804 | 914 ms/step , 6881.93 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 01:39:44 | Epoch: 0 | Step: 111340 | Dataset: 0-394944 | Loss: 0.787 | 913 ms/step , 6887.73 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 01:39:53 | Epoch: 0 | Step: 111350 | Dataset: 0-395264 | Loss: 0.855 | 915 ms/step , 6873.66 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 01:40:02 | Epoch: 0 | Step: 111360 | Dataset: 0-395584 | Loss: 0.904 | 914 ms/step , 6879.30 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 01:40:11 | Epoch: 0 | Step: 111370 | Dataset: 0-395904 | Loss: 0.932 | 915 ms/step , 6873.26 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 01:40:20 | Epoch: 0 | Step: 111380 | Dataset: 0-396224 | Loss: 0.868 | 913 ms/step , 6886.03 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 01:40:30 | Epoch: 0 | Step: 111390 | Dataset: 0-396544 | Loss: 0.952 | 915 ms/step , 6876.62 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-05 01:40:39 | Epoch: 0 | Step: 111400 | Dataset: 0-396864 | Loss: 0.832 | 914 ms/step , 6880.89 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 01:40:40 | Validation | Step: 111400 | Val_loss: 0.829 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:40:49 | Epoch: 0 | Step: 111410 | Dataset: 0-397184 | Loss: 0.924 | 915 ms/step , 6871.78 GFLOP/s , 15257.1 tokens/s INFO:__main__:2024-11-05 01:40:59 | Epoch: 0 | Step: 111420 | Dataset: 0-397504 | Loss: 0.688 | 915 ms/step , 6876.53 GFLOP/s , 17907.5 tokens/s INFO:__main__:2024-11-05 01:41:08 | Epoch: 0 | Step: 111430 | Dataset: 0-397824 | Loss: 0.816 | 914 ms/step , 6880.20 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 01:41:17 | Epoch: 0 | Step: 111440 | Dataset: 0-398144 | Loss: 0.797 | 914 ms/step , 6883.68 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 01:41:26 | Epoch: 0 | Step: 111450 | Dataset: 0-398464 | Loss: 0.784 | 913 ms/step , 6890.88 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 01:41:35 | Epoch: 0 | Step: 111460 | Dataset: 0-398784 | Loss: 0.792 | 913 ms/step , 6885.37 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 01:41:44 | Epoch: 0 | Step: 111470 | Dataset: 0-399104 | Loss: 0.776 | 914 ms/step , 6883.50 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 01:41:54 | Epoch: 0 | Step: 111480 | Dataset: 0-399424 | Loss: 0.925 | 916 ms/step , 6868.60 GFLOP/s , 17893.0 tokens/s INFO:__main__:2024-11-05 01:42:03 | Epoch: 0 | Step: 111490 | Dataset: 0-399744 | Loss: 0.860 | 913 ms/step , 6886.02 GFLOP/s , 17902.5 tokens/s INFO:__main__:2024-11-05 01:42:12 | Epoch: 0 | Step: 111500 | Dataset: 0-400064 | Loss: 0.880 | 913 ms/step , 6887.62 GFLOP/s , 17902.4 tokens/s INFO:__main__:2024-11-05 01:42:13 | Validation | Step: 111500 | Val_loss: 0.866 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:42:23 | Epoch: 0 | Step: 111510 | Dataset: 0-400384 | Loss: 0.879 | 914 ms/step , 6885.04 GFLOP/s , 15269.5 tokens/s INFO:__main__:2024-11-05 01:42:32 | Epoch: 0 | Step: 111520 | Dataset: 0-400704 | Loss: 0.843 | 915 ms/step , 6876.50 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 01:42:41 | Epoch: 0 | Step: 111530 | Dataset: 0-401024 | Loss: 0.737 | 913 ms/step , 6886.21 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 01:42:50 | Epoch: 0 | Step: 111540 | Dataset: 0-401344 | Loss: 0.880 | 913 ms/step , 6885.57 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 01:42:59 | Epoch: 0 | Step: 111550 | Dataset: 0-401664 | Loss: 0.895 | 915 ms/step , 6877.47 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-05 01:43:08 | Epoch: 0 | Step: 111560 | Dataset: 0-401984 | Loss: 0.792 | 915 ms/step , 6875.37 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 01:43:17 | Epoch: 0 | Step: 111570 | Dataset: 0-402304 | Loss: 0.888 | 914 ms/step , 6884.07 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 01:43:27 | Epoch: 0 | Step: 111580 | Dataset: 0-402624 | Loss: 0.744 | 914 ms/step , 6883.58 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 01:43:36 | Epoch: 0 | Step: 111590 | Dataset: 0-402944 | Loss: 0.716 | 914 ms/step , 6884.31 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 01:43:45 | Epoch: 0 | Step: 111600 | Dataset: 0-403264 | Loss: 0.811 | 914 ms/step , 6879.37 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 01:43:46 | Validation | Step: 111600 | Val_loss: 0.766 | Best_val_loss: 0.7806 INFO:__main__:2024-11-05 01:43:46 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_014346_step_111600.pt` INFO:__main__:2024-11-05 01:43:57 | Epoch: 0 | Step: 111610 | Dataset: 0-403584 | Loss: 0.856 | 914 ms/step , 6882.00 GFLOP/s , 13779.2 tokens/s INFO:__main__:2024-11-05 01:44:06 | Epoch: 0 | Step: 111620 | Dataset: 0-403904 | Loss: 0.738 | 915 ms/step , 6876.38 GFLOP/s , 17903.1 tokens/s INFO:__main__:2024-11-05 01:44:15 | Epoch: 0 | Step: 111630 | Dataset: 0-404224 | Loss: 0.827 | 915 ms/step , 6871.61 GFLOP/s , 17887.0 tokens/s INFO:__main__:2024-11-05 01:44:24 | Epoch: 0 | Step: 111640 | Dataset: 0-404544 | Loss: 0.986 | 916 ms/step , 6869.89 GFLOP/s , 17890.5 tokens/s INFO:__main__:2024-11-05 01:44:33 | Epoch: 0 | Step: 111650 | Dataset: 0-404864 | Loss: 0.875 | 915 ms/step , 6876.08 GFLOP/s , 17903.2 tokens/s INFO:__main__:2024-11-05 01:44:43 | Epoch: 0 | Step: 111660 | Dataset: 0-405184 | Loss: 0.909 | 916 ms/step , 6865.41 GFLOP/s , 17896.3 tokens/s INFO:__main__:2024-11-05 01:44:52 | Epoch: 0 | Step: 111670 | Dataset: 0-405504 | Loss: 0.885 | 915 ms/step , 6870.70 GFLOP/s , 17896.1 tokens/s INFO:__main__:2024-11-05 01:45:01 | Epoch: 0 | Step: 111680 | Dataset: 0-405824 | Loss: 0.952 | 931 ms/step , 6752.08 GFLOP/s , 17871.4 tokens/s INFO:__main__:2024-11-05 01:45:10 | Epoch: 0 | Step: 111690 | Dataset: 0-406144 | Loss: 0.707 | 930 ms/step , 6764.81 GFLOP/s , 17759.9 tokens/s INFO:__main__:2024-11-05 01:45:19 | Epoch: 0 | Step: 111700 | Dataset: 0-406464 | Loss: 0.911 | 938 ms/step , 6704.62 GFLOP/s , 17701.1 tokens/s INFO:__main__:2024-11-05 01:45:21 | Validation | Step: 111700 | Val_loss: 0.803 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:45:30 | Epoch: 0 | Step: 111710 | Dataset: 0-406784 | Loss: 0.827 | 914 ms/step , 6879.59 GFLOP/s , 15101.7 tokens/s INFO:__main__:2024-11-05 01:45:39 | Epoch: 0 | Step: 111720 | Dataset: 0-407104 | Loss: 0.807 | 914 ms/step , 6881.16 GFLOP/s , 17900.4 tokens/s INFO:__main__:2024-11-05 01:45:48 | Epoch: 0 | Step: 111730 | Dataset: 0-407424 | Loss: 0.955 | 914 ms/step , 6883.85 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 01:45:58 | Epoch: 0 | Step: 111740 | Dataset: 0-407744 | Loss: 0.785 | 914 ms/step , 6880.43 GFLOP/s , 17845.3 tokens/s INFO:__main__:2024-11-05 01:46:07 | Epoch: 0 | Step: 111750 | Dataset: 0-408064 | Loss: 0.800 | 913 ms/step , 6888.73 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 01:46:16 | Epoch: 0 | Step: 111760 | Dataset: 0-408384 | Loss: 0.821 | 914 ms/step , 6883.41 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 01:46:25 | Epoch: 0 | Step: 111770 | Dataset: 0-408704 | Loss: 0.824 | 915 ms/step , 6875.93 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 01:46:34 | Epoch: 0 | Step: 111780 | Dataset: 0-409024 | Loss: 0.804 | 914 ms/step , 6884.06 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 01:46:43 | Epoch: 0 | Step: 111790 | Dataset: 0-409344 | Loss: 0.643 | 914 ms/step , 6878.12 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 01:46:53 | Epoch: 0 | Step: 111800 | Dataset: 0-409664 | Loss: 0.862 | 916 ms/step , 6869.91 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 01:46:54 | Validation | Step: 111800 | Val_loss: 0.812 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:47:03 | Epoch: 0 | Step: 111810 | Dataset: 0-409984 | Loss: 0.840 | 913 ms/step , 6889.65 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 01:47:12 | Epoch: 0 | Step: 111820 | Dataset: 0-410304 | Loss: 0.818 | 914 ms/step , 6881.28 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-05 01:47:22 | Epoch: 0 | Step: 111830 | Dataset: 0-410624 | Loss: 0.898 | 916 ms/step , 6869.72 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 01:47:31 | Epoch: 0 | Step: 111840 | Dataset: 0-410944 | Loss: 0.835 | 915 ms/step , 6875.52 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 01:47:40 | Epoch: 0 | Step: 111850 | Dataset: 0-411264 | Loss: 0.900 | 916 ms/step , 6869.14 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 01:47:49 | Epoch: 0 | Step: 111860 | Dataset: 0-411584 | Loss: 0.920 | 915 ms/step , 6875.16 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 01:47:58 | Epoch: 0 | Step: 111870 | Dataset: 0-411904 | Loss: 0.856 | 915 ms/step , 6873.98 GFLOP/s , 17899.3 tokens/s INFO:__main__:2024-11-05 01:48:07 | Epoch: 0 | Step: 111880 | Dataset: 0-412224 | Loss: 0.796 | 916 ms/step , 6869.33 GFLOP/s , 17901.0 tokens/s INFO:__main__:2024-11-05 01:48:16 | Epoch: 0 | Step: 111890 | Dataset: 0-412544 | Loss: 0.857 | 917 ms/step , 6862.42 GFLOP/s , 17904.1 tokens/s INFO:__main__:2024-11-05 01:48:26 | Epoch: 0 | Step: 111900 | Dataset: 0-412864 | Loss: 0.807 | 914 ms/step , 6880.44 GFLOP/s , 17906.0 tokens/s INFO:__main__:2024-11-05 01:48:27 | Validation | Step: 111900 | Val_loss: 0.808 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:48:36 | Epoch: 0 | Step: 111910 | Dataset: 0-413184 | Loss: 0.838 | 915 ms/step , 6871.73 GFLOP/s , 15264.3 tokens/s INFO:__main__:2024-11-05 01:48:45 | Epoch: 0 | Step: 111920 | Dataset: 0-413504 | Loss: 0.750 | 915 ms/step , 6873.86 GFLOP/s , 17905.9 tokens/s INFO:__main__:2024-11-05 01:48:55 | Epoch: 0 | Step: 111930 | Dataset: 0-413824 | Loss: 0.875 | 915 ms/step , 6877.37 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-05 01:49:04 | Epoch: 0 | Step: 111940 | Dataset: 0-414144 | Loss: 0.734 | 914 ms/step , 6881.22 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 01:49:13 | Epoch: 0 | Step: 111950 | Dataset: 0-414464 | Loss: 0.817 | 914 ms/step , 6878.54 GFLOP/s , 17907.9 tokens/s INFO:__main__:2024-11-05 01:49:22 | Epoch: 0 | Step: 111960 | Dataset: 0-414784 | Loss: 0.728 | 914 ms/step , 6878.45 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-05 01:49:31 | Epoch: 0 | Step: 111970 | Dataset: 0-415104 | Loss: 0.880 | 913 ms/step , 6886.04 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 01:49:40 | Epoch: 0 | Step: 111980 | Dataset: 0-415424 | Loss: 0.800 | 914 ms/step , 6882.59 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 01:49:50 | Epoch: 0 | Step: 111990 | Dataset: 0-415744 | Loss: 0.836 | 915 ms/step , 6874.94 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 01:49:59 | Epoch: 0 | Step: 112000 | Dataset: 0-416064 | Loss: 0.758 | 914 ms/step , 6880.79 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 01:50:00 | Validation | Step: 112000 | Val_loss: 0.831 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:50:00 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_015000_step_112000.pt` INFO:__main__:2024-11-05 01:50:11 | Epoch: 0 | Step: 112010 | Dataset: 0-416384 | Loss: 0.808 | 914 ms/step , 6880.63 GFLOP/s , 13728.0 tokens/s INFO:__main__:2024-11-05 01:50:20 | Epoch: 0 | Step: 112020 | Dataset: 0-416704 | Loss: 0.883 | 914 ms/step , 6878.20 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 01:50:29 | Epoch: 0 | Step: 112030 | Dataset: 0-417024 | Loss: 0.766 | 913 ms/step , 6888.02 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 01:50:38 | Epoch: 0 | Step: 112040 | Dataset: 0-417344 | Loss: 0.689 | 913 ms/step , 6885.23 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 01:50:47 | Epoch: 0 | Step: 112050 | Dataset: 0-417664 | Loss: 0.849 | 915 ms/step , 6875.73 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 01:50:56 | Epoch: 0 | Step: 112060 | Dataset: 0-417984 | Loss: 0.772 | 913 ms/step , 6888.44 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 01:51:05 | Epoch: 0 | Step: 112070 | Dataset: 0-418304 | Loss: 0.928 | 916 ms/step , 6868.03 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 01:51:15 | Epoch: 0 | Step: 112080 | Dataset: 0-418624 | Loss: 0.880 | 916 ms/step , 6869.36 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-05 01:51:24 | Epoch: 0 | Step: 112090 | Dataset: 0-418944 | Loss: 0.780 | 916 ms/step , 6866.24 GFLOP/s , 17904.4 tokens/s INFO:__main__:2024-11-05 01:51:33 | Epoch: 0 | Step: 112100 | Dataset: 0-419264 | Loss: 0.841 | 915 ms/step , 6875.00 GFLOP/s , 17902.5 tokens/s INFO:__main__:2024-11-05 01:51:35 | Validation | Step: 112100 | Val_loss: 0.832 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:51:44 | Epoch: 0 | Step: 112110 | Dataset: 0-419584 | Loss: 0.609 | 913 ms/step , 6887.99 GFLOP/s , 15259.7 tokens/s INFO:__main__:2024-11-05 01:51:53 | Epoch: 0 | Step: 112120 | Dataset: 0-419904 | Loss: 0.829 | 914 ms/step , 6881.60 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 01:52:02 | Epoch: 0 | Step: 112130 | Dataset: 0-420224 | Loss: 0.722 | 914 ms/step , 6880.01 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 01:52:11 | Epoch: 0 | Step: 112140 | Dataset: 0-420544 | Loss: 0.778 | 915 ms/step , 6873.73 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 01:52:20 | Epoch: 0 | Step: 112150 | Dataset: 0-420864 | Loss: 0.844 | 914 ms/step , 6883.67 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 01:52:29 | Epoch: 0 | Step: 112160 | Dataset: 0-421184 | Loss: 0.761 | 913 ms/step , 6890.40 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 01:52:39 | Epoch: 0 | Step: 112170 | Dataset: 0-421504 | Loss: 0.793 | 915 ms/step , 6876.75 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 01:52:48 | Epoch: 0 | Step: 112180 | Dataset: 0-421824 | Loss: 0.806 | 915 ms/step , 6876.59 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 01:52:57 | Epoch: 0 | Step: 112190 | Dataset: 0-422144 | Loss: 0.814 | 914 ms/step , 6883.15 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 01:53:06 | Epoch: 0 | Step: 112200 | Dataset: 0-422464 | Loss: 0.785 | 914 ms/step , 6881.99 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 01:53:08 | Validation | Step: 112200 | Val_loss: 0.734 | Best_val_loss: 0.7660 INFO:__main__:2024-11-05 01:53:08 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_015308_step_112200.pt` INFO:__main__:2024-11-05 01:53:18 | Epoch: 0 | Step: 112210 | Dataset: 0-422784 | Loss: 0.805 | 914 ms/step , 6879.73 GFLOP/s , 13784.8 tokens/s INFO:__main__:2024-11-05 01:53:27 | Epoch: 0 | Step: 112220 | Dataset: 0-423104 | Loss: 0.670 | 912 ms/step , 6897.95 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 01:53:36 | Epoch: 0 | Step: 112230 | Dataset: 0-423424 | Loss: 0.784 | 915 ms/step , 6875.54 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 01:53:45 | Epoch: 0 | Step: 112240 | Dataset: 0-423744 | Loss: 0.846 | 914 ms/step , 6881.05 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 01:53:55 | Epoch: 0 | Step: 112250 | Dataset: 0-424064 | Loss: 0.846 | 923 ms/step , 6816.86 GFLOP/s , 17774.5 tokens/s INFO:__main__:2024-11-05 01:54:04 | Epoch: 0 | Step: 112260 | Dataset: 0-424384 | Loss: 0.889 | 920 ms/step , 6839.03 GFLOP/s , 17742.6 tokens/s INFO:__main__:2024-11-05 01:54:13 | Epoch: 0 | Step: 112270 | Dataset: 0-424704 | Loss: 0.706 | 914 ms/step , 6881.54 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 01:54:22 | Epoch: 0 | Step: 112280 | Dataset: 0-425024 | Loss: 0.872 | 914 ms/step , 6881.90 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 01:54:31 | Epoch: 0 | Step: 112290 | Dataset: 0-425344 | Loss: 0.872 | 915 ms/step , 6870.57 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 01:54:40 | Epoch: 0 | Step: 112300 | Dataset: 0-425664 | Loss: 0.858 | 913 ms/step , 6885.99 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 01:54:42 | Validation | Step: 112300 | Val_loss: 0.812 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 01:54:51 | Epoch: 0 | Step: 112310 | Dataset: 0-425984 | Loss: 0.851 | 913 ms/step , 6888.82 GFLOP/s , 15272.0 tokens/s INFO:__main__:2024-11-05 01:55:00 | Epoch: 0 | Step: 112320 | Dataset: 0-426304 | Loss: 0.888 | 913 ms/step , 6888.41 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 01:55:09 | Epoch: 0 | Step: 112330 | Dataset: 0-426624 | Loss: 0.823 | 914 ms/step , 6883.16 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 01:55:18 | Epoch: 0 | Step: 112340 | Dataset: 0-426944 | Loss: 0.836 | 913 ms/step , 6888.93 GFLOP/s , 17893.2 tokens/s INFO:__main__:2024-11-05 01:55:28 | Epoch: 0 | Step: 112350 | Dataset: 0-427264 | Loss: 0.926 | 914 ms/step , 6882.00 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 01:55:37 | Epoch: 0 | Step: 112360 | Dataset: 0-427584 | Loss: 0.778 | 913 ms/step , 6886.17 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 01:55:46 | Epoch: 0 | Step: 112370 | Dataset: 0-427904 | Loss: 0.796 | 913 ms/step , 6890.84 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 01:55:55 | Epoch: 0 | Step: 112380 | Dataset: 0-428224 | Loss: 0.840 | 914 ms/step , 6882.39 GFLOP/s , 17889.0 tokens/s INFO:__main__:2024-11-05 01:56:04 | Epoch: 0 | Step: 112390 | Dataset: 0-428544 | Loss: 0.879 | 914 ms/step , 6882.43 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 01:56:13 | Epoch: 0 | Step: 112400 | Dataset: 0-428864 | Loss: 0.863 | 915 ms/step , 6876.72 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 01:56:15 | Validation | Step: 112400 | Val_loss: 0.761 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 01:56:24 | Epoch: 0 | Step: 112410 | Dataset: 0-429184 | Loss: 0.861 | 915 ms/step , 6876.95 GFLOP/s , 15268.5 tokens/s INFO:__main__:2024-11-05 01:56:33 | Epoch: 0 | Step: 112420 | Dataset: 0-429504 | Loss: 0.854 | 914 ms/step , 6879.03 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 01:56:42 | Epoch: 0 | Step: 112430 | Dataset: 0-429824 | Loss: 0.772 | 914 ms/step , 6878.16 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 01:56:52 | Epoch: 0 | Step: 112440 | Dataset: 0-430144 | Loss: 0.845 | 914 ms/step , 6884.96 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 01:57:01 | Epoch: 0 | Step: 112450 | Dataset: 0-430464 | Loss: 0.808 | 914 ms/step , 6882.57 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 01:57:10 | Epoch: 0 | Step: 112460 | Dataset: 0-430784 | Loss: 0.835 | 916 ms/step , 6866.77 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 01:57:19 | Epoch: 0 | Step: 112470 | Dataset: 0-431104 | Loss: 0.775 | 916 ms/step , 6869.13 GFLOP/s , 17906.6 tokens/s INFO:__main__:2024-11-05 01:57:28 | Epoch: 0 | Step: 112480 | Dataset: 0-431424 | Loss: 0.807 | 914 ms/step , 6879.64 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 01:57:37 | Epoch: 0 | Step: 112490 | Dataset: 0-431744 | Loss: 0.841 | 913 ms/step , 6887.26 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 01:57:46 | Epoch: 0 | Step: 112500 | Dataset: 0-432064 | Loss: 0.758 | 915 ms/step , 6877.06 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 01:57:48 | Validation | Step: 112500 | Val_loss: 0.854 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 01:57:57 | Epoch: 0 | Step: 112510 | Dataset: 0-432384 | Loss: 0.954 | 914 ms/step , 6880.38 GFLOP/s , 15267.7 tokens/s INFO:__main__:2024-11-05 01:58:06 | Epoch: 0 | Step: 112520 | Dataset: 0-432704 | Loss: 0.822 | 914 ms/step , 6880.09 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 01:58:15 | Epoch: 0 | Step: 112530 | Dataset: 0-433024 | Loss: 0.862 | 913 ms/step , 6890.28 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 01:58:25 | Epoch: 0 | Step: 112540 | Dataset: 0-433344 | Loss: 0.785 | 914 ms/step , 6884.93 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 01:58:34 | Epoch: 0 | Step: 112550 | Dataset: 0-433664 | Loss: 0.852 | 914 ms/step , 6883.25 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 01:58:43 | Epoch: 0 | Step: 112560 | Dataset: 0-433984 | Loss: 0.839 | 914 ms/step , 6878.65 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 01:58:52 | Epoch: 0 | Step: 112570 | Dataset: 0-434304 | Loss: 0.850 | 915 ms/step , 6875.80 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 01:59:01 | Epoch: 0 | Step: 112580 | Dataset: 0-434624 | Loss: 0.781 | 914 ms/step , 6880.01 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 01:59:10 | Epoch: 0 | Step: 112590 | Dataset: 0-434944 | Loss: 0.805 | 913 ms/step , 6890.63 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 01:59:19 | Epoch: 0 | Step: 112600 | Dataset: 0-435264 | Loss: 0.858 | 914 ms/step , 6877.95 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 01:59:21 | Validation | Step: 112600 | Val_loss: 0.838 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 01:59:30 | Epoch: 0 | Step: 112610 | Dataset: 0-435584 | Loss: 0.868 | 915 ms/step , 6877.01 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-05 01:59:39 | Epoch: 0 | Step: 112620 | Dataset: 0-435904 | Loss: 0.838 | 913 ms/step , 6885.57 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 01:59:48 | Epoch: 0 | Step: 112630 | Dataset: 0-436224 | Loss: 0.865 | 914 ms/step , 6883.28 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 01:59:58 | Epoch: 0 | Step: 112640 | Dataset: 0-436544 | Loss: 0.800 | 914 ms/step , 6882.40 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 02:00:07 | Epoch: 0 | Step: 112650 | Dataset: 0-436864 | Loss: 0.829 | 913 ms/step , 6890.53 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 02:00:16 | Epoch: 0 | Step: 112660 | Dataset: 0-437184 | Loss: 0.827 | 915 ms/step , 6874.88 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 02:00:25 | Epoch: 0 | Step: 112670 | Dataset: 0-437504 | Loss: 0.799 | 916 ms/step , 6865.72 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 02:00:34 | Epoch: 0 | Step: 112680 | Dataset: 0-437824 | Loss: 0.784 | 913 ms/step , 6890.44 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 02:00:43 | Epoch: 0 | Step: 112690 | Dataset: 0-438144 | Loss: 0.799 | 915 ms/step , 6872.25 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 02:00:52 | Epoch: 0 | Step: 112700 | Dataset: 0-438464 | Loss: 0.805 | 913 ms/step , 6887.35 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 02:00:54 | Validation | Step: 112700 | Val_loss: 0.844 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 02:01:03 | Epoch: 0 | Step: 112710 | Dataset: 0-438784 | Loss: 0.803 | 913 ms/step , 6885.91 GFLOP/s , 15262.0 tokens/s INFO:__main__:2024-11-05 02:01:12 | Epoch: 0 | Step: 112720 | Dataset: 0-439104 | Loss: 0.780 | 915 ms/step , 6874.93 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 02:01:21 | Epoch: 0 | Step: 112730 | Dataset: 0-439424 | Loss: 0.868 | 913 ms/step , 6885.63 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 02:01:31 | Epoch: 0 | Step: 112740 | Dataset: 0-439744 | Loss: 0.836 | 913 ms/step , 6890.23 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 02:01:40 | Epoch: 0 | Step: 112750 | Dataset: 0-440064 | Loss: 0.776 | 914 ms/step , 6881.48 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 02:01:49 | Epoch: 0 | Step: 112760 | Dataset: 0-440384 | Loss: 0.861 | 913 ms/step , 6888.75 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 02:01:58 | Epoch: 0 | Step: 112770 | Dataset: 0-440704 | Loss: 0.906 | 916 ms/step , 6865.35 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 02:02:07 | Epoch: 0 | Step: 112780 | Dataset: 0-441024 | Loss: 0.838 | 915 ms/step , 6872.48 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 02:02:16 | Epoch: 0 | Step: 112790 | Dataset: 0-441344 | Loss: 0.738 | 913 ms/step , 6885.50 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 02:02:25 | Epoch: 0 | Step: 112800 | Dataset: 0-441664 | Loss: 0.861 | 913 ms/step , 6889.04 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 02:02:27 | Validation | Step: 112800 | Val_loss: 0.835 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 02:02:36 | Epoch: 0 | Step: 112810 | Dataset: 0-441984 | Loss: 0.792 | 913 ms/step , 6885.22 GFLOP/s , 15267.2 tokens/s INFO:__main__:2024-11-05 02:02:45 | Epoch: 0 | Step: 112820 | Dataset: 0-442304 | Loss: 0.804 | 915 ms/step , 6874.63 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 02:02:54 | Epoch: 0 | Step: 112830 | Dataset: 0-442624 | Loss: 0.800 | 913 ms/step , 6890.20 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 02:03:04 | Epoch: 0 | Step: 112840 | Dataset: 0-442944 | Loss: 0.808 | 914 ms/step , 6880.12 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 02:03:13 | Epoch: 0 | Step: 112850 | Dataset: 0-443264 | Loss: 0.844 | 914 ms/step , 6881.75 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 02:03:22 | Epoch: 0 | Step: 112860 | Dataset: 0-443584 | Loss: 0.641 | 913 ms/step , 6889.01 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 02:03:31 | Epoch: 0 | Step: 112870 | Dataset: 0-443904 | Loss: 0.778 | 915 ms/step , 6870.97 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 02:03:40 | Epoch: 0 | Step: 112880 | Dataset: 0-444224 | Loss: 0.746 | 915 ms/step , 6871.07 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 02:03:49 | Epoch: 0 | Step: 112890 | Dataset: 0-444544 | Loss: 0.834 | 913 ms/step , 6891.23 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 02:03:58 | Epoch: 0 | Step: 112900 | Dataset: 0-444864 | Loss: 0.831 | 915 ms/step , 6873.24 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:04:00 | Validation | Step: 112900 | Val_loss: 0.726 | Best_val_loss: 0.7341 INFO:__main__:2024-11-05 02:04:00 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_020400_step_112900.pt` INFO:__main__:2024-11-05 02:04:10 | Epoch: 0 | Step: 112910 | Dataset: 0-445184 | Loss: 0.788 | 914 ms/step , 6878.37 GFLOP/s , 13775.6 tokens/s INFO:__main__:2024-11-05 02:04:20 | Epoch: 0 | Step: 112920 | Dataset: 0-445504 | Loss: 0.867 | 912 ms/step , 6894.23 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:04:29 | Epoch: 0 | Step: 112930 | Dataset: 0-445824 | Loss: 0.854 | 914 ms/step , 6882.31 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 02:04:38 | Epoch: 0 | Step: 112940 | Dataset: 0-446144 | Loss: 0.810 | 914 ms/step , 6881.04 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 02:04:47 | Epoch: 0 | Step: 112950 | Dataset: 0-446464 | Loss: 0.849 | 914 ms/step , 6882.00 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 02:04:56 | Epoch: 0 | Step: 112960 | Dataset: 0-446784 | Loss: 0.900 | 914 ms/step , 6879.85 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 02:05:05 | Epoch: 0 | Step: 112970 | Dataset: 0-447104 | Loss: 0.689 | 913 ms/step , 6885.71 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 02:05:14 | Epoch: 0 | Step: 112980 | Dataset: 0-447424 | Loss: 0.733 | 914 ms/step , 6879.95 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 02:05:24 | Epoch: 0 | Step: 112990 | Dataset: 0-447744 | Loss: 0.860 | 914 ms/step , 6884.65 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 02:05:33 | Epoch: 0 | Step: 113000 | Dataset: 0-448064 | Loss: 0.794 | 913 ms/step , 6888.21 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 02:05:34 | Validation | Step: 113000 | Val_loss: 0.766 | Best_val_loss: 0.7259 INFO:__main__:2024-11-05 02:05:34 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_020534_step_113000.pt` INFO:__main__:2024-11-05 02:05:45 | Epoch: 0 | Step: 113010 | Dataset: 0-448384 | Loss: 0.802 | 915 ms/step , 6874.82 GFLOP/s , 13819.0 tokens/s INFO:__main__:2024-11-05 02:05:54 | Epoch: 0 | Step: 113020 | Dataset: 0-448704 | Loss: 0.796 | 913 ms/step , 6891.34 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 02:06:03 | Epoch: 0 | Step: 113030 | Dataset: 0-449024 | Loss: 0.842 | 915 ms/step , 6874.20 GFLOP/s , 17904.5 tokens/s INFO:__main__:2024-11-05 02:06:12 | Epoch: 0 | Step: 113040 | Dataset: 0-449344 | Loss: 0.838 | 914 ms/step , 6883.11 GFLOP/s , 17864.9 tokens/s INFO:__main__:2024-11-05 02:06:21 | Epoch: 0 | Step: 113050 | Dataset: 0-449664 | Loss: 0.779 | 914 ms/step , 6880.88 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 02:06:30 | Epoch: 0 | Step: 113060 | Dataset: 0-449984 | Loss: 0.842 | 916 ms/step , 6867.77 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 02:06:39 | Epoch: 0 | Step: 113070 | Dataset: 0-450304 | Loss: 0.805 | 914 ms/step , 6884.70 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 02:06:49 | Epoch: 0 | Step: 113080 | Dataset: 0-450624 | Loss: 0.877 | 915 ms/step , 6876.63 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 02:06:58 | Epoch: 0 | Step: 113090 | Dataset: 0-450944 | Loss: 0.829 | 913 ms/step , 6885.48 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 02:07:07 | Epoch: 0 | Step: 113100 | Dataset: 0-451264 | Loss: 0.769 | 914 ms/step , 6878.12 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 02:07:08 | Validation | Step: 113100 | Val_loss: 0.774 | Best_val_loss: 0.7259 INFO:__main__:2024-11-05 02:07:18 | Epoch: 0 | Step: 113110 | Dataset: 0-451584 | Loss: 0.706 | 913 ms/step , 6891.40 GFLOP/s , 15262.4 tokens/s INFO:__main__:2024-11-05 02:07:27 | Epoch: 0 | Step: 113120 | Dataset: 0-451904 | Loss: 0.808 | 912 ms/step , 6897.39 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 02:07:36 | Epoch: 0 | Step: 113130 | Dataset: 0-452224 | Loss: 0.864 | 914 ms/step , 6880.31 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 02:07:45 | Epoch: 0 | Step: 113140 | Dataset: 0-452544 | Loss: 0.849 | 913 ms/step , 6888.15 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 02:07:54 | Epoch: 0 | Step: 113150 | Dataset: 0-452864 | Loss: 0.702 | 913 ms/step , 6889.19 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 02:08:03 | Epoch: 0 | Step: 113160 | Dataset: 0-453184 | Loss: 0.830 | 915 ms/step , 6876.30 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 02:08:12 | Epoch: 0 | Step: 113170 | Dataset: 0-453504 | Loss: 0.754 | 913 ms/step , 6885.66 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 02:08:22 | Epoch: 0 | Step: 113180 | Dataset: 0-453824 | Loss: 0.839 | 914 ms/step , 6880.11 GFLOP/s , 17904.9 tokens/s INFO:__main__:2024-11-05 02:08:31 | Epoch: 0 | Step: 113190 | Dataset: 0-454144 | Loss: 0.709 | 913 ms/step , 6888.33 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 02:08:40 | Epoch: 0 | Step: 113200 | Dataset: 0-454464 | Loss: 0.857 | 915 ms/step , 6875.11 GFLOP/s , 17898.9 tokens/s INFO:__main__:2024-11-05 02:08:41 | Validation | Step: 113200 | Val_loss: 0.785 | Best_val_loss: 0.7259 INFO:__main__:2024-11-05 02:08:51 | Epoch: 0 | Step: 113210 | Dataset: 0-454784 | Loss: 0.711 | 914 ms/step , 6880.10 GFLOP/s , 15259.6 tokens/s INFO:__main__:2024-11-05 02:09:00 | Epoch: 0 | Step: 113220 | Dataset: 0-455104 | Loss: 0.907 | 912 ms/step , 6893.04 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 02:09:09 | Epoch: 0 | Step: 113230 | Dataset: 0-455424 | Loss: 0.908 | 913 ms/step , 6890.66 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 02:09:18 | Epoch: 0 | Step: 113240 | Dataset: 0-455744 | Loss: 0.737 | 913 ms/step , 6890.84 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 02:09:27 | Epoch: 0 | Step: 113250 | Dataset: 0-456064 | Loss: 0.874 | 914 ms/step , 6879.11 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:09:36 | Epoch: 0 | Step: 113260 | Dataset: 0-456384 | Loss: 0.928 | 914 ms/step , 6883.69 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 02:09:45 | Epoch: 0 | Step: 113270 | Dataset: 0-456704 | Loss: 0.982 | 915 ms/step , 6870.57 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 02:09:55 | Epoch: 0 | Step: 113280 | Dataset: 0-457024 | Loss: 0.917 | 913 ms/step , 6889.01 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 02:10:04 | Epoch: 0 | Step: 113290 | Dataset: 0-457344 | Loss: 0.954 | 912 ms/step , 6894.39 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 02:10:13 | Epoch: 0 | Step: 113300 | Dataset: 0-457664 | Loss: 0.910 | 913 ms/step , 6890.55 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 02:10:14 | Validation | Step: 113300 | Val_loss: 0.708 | Best_val_loss: 0.7259 INFO:__main__:2024-11-05 02:10:14 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_021014_step_113300.pt` INFO:__main__:2024-11-05 02:10:25 | Epoch: 0 | Step: 113310 | Dataset: 0-457984 | Loss: 0.950 | 914 ms/step , 6884.06 GFLOP/s , 13802.3 tokens/s INFO:__main__:2024-11-05 02:10:34 | Epoch: 0 | Step: 113320 | Dataset: 0-458304 | Loss: 0.950 | 912 ms/step , 6895.42 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 02:10:43 | Epoch: 0 | Step: 113330 | Dataset: 0-458624 | Loss: 0.921 | 913 ms/step , 6886.91 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 02:10:52 | Epoch: 0 | Step: 113340 | Dataset: 0-458944 | Loss: 1.043 | 914 ms/step , 6880.72 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 02:11:01 | Epoch: 0 | Step: 113350 | Dataset: 0-459264 | Loss: 0.828 | 914 ms/step , 6881.57 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 02:11:10 | Epoch: 0 | Step: 113360 | Dataset: 0-459584 | Loss: 0.906 | 915 ms/step , 6872.58 GFLOP/s , 17903.8 tokens/s INFO:__main__:2024-11-05 02:11:20 | Epoch: 0 | Step: 113370 | Dataset: 0-459904 | Loss: 0.940 | 912 ms/step , 6893.27 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 02:11:29 | Epoch: 0 | Step: 113380 | Dataset: 0-460224 | Loss: 0.936 | 914 ms/step , 6882.46 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 02:11:38 | Epoch: 0 | Step: 113390 | Dataset: 0-460544 | Loss: 0.892 | 913 ms/step , 6891.06 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 02:11:47 | Epoch: 0 | Step: 113400 | Dataset: 0-460864 | Loss: 1.011 | 913 ms/step , 6890.42 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 02:11:49 | Validation | Step: 113400 | Val_loss: 0.883 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:11:58 | Epoch: 0 | Step: 113410 | Dataset: 0-461184 | Loss: 0.868 | 913 ms/step , 6890.34 GFLOP/s , 15272.9 tokens/s INFO:__main__:2024-11-05 02:12:07 | Epoch: 0 | Step: 113420 | Dataset: 0-461504 | Loss: 0.890 | 913 ms/step , 6888.68 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 02:12:16 | Epoch: 0 | Step: 113430 | Dataset: 0-461824 | Loss: 0.935 | 914 ms/step , 6879.94 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 02:12:25 | Epoch: 0 | Step: 113440 | Dataset: 0-462144 | Loss: 0.974 | 913 ms/step , 6887.39 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 02:12:34 | Epoch: 0 | Step: 113450 | Dataset: 0-462464 | Loss: 0.914 | 913 ms/step , 6887.64 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 02:12:43 | Epoch: 0 | Step: 113460 | Dataset: 0-462784 | Loss: 0.882 | 913 ms/step , 6888.31 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:12:53 | Epoch: 0 | Step: 113470 | Dataset: 0-463104 | Loss: 0.955 | 915 ms/step , 6875.55 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 02:13:02 | Epoch: 0 | Step: 113480 | Dataset: 0-463424 | Loss: 0.870 | 914 ms/step , 6881.89 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 02:13:11 | Epoch: 0 | Step: 113490 | Dataset: 0-463744 | Loss: 0.923 | 914 ms/step , 6881.82 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 02:13:20 | Epoch: 0 | Step: 113500 | Dataset: 0-464064 | Loss: 0.957 | 914 ms/step , 6880.62 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 02:13:22 | Validation | Step: 113500 | Val_loss: 0.821 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:13:31 | Epoch: 0 | Step: 113510 | Dataset: 0-464384 | Loss: 0.875 | 914 ms/step , 6881.46 GFLOP/s , 15274.3 tokens/s INFO:__main__:2024-11-05 02:13:40 | Epoch: 0 | Step: 113520 | Dataset: 0-464704 | Loss: 0.946 | 913 ms/step , 6887.50 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 02:13:49 | Epoch: 0 | Step: 113530 | Dataset: 0-465024 | Loss: 0.775 | 912 ms/step , 6893.24 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 02:13:58 | Epoch: 0 | Step: 113540 | Dataset: 0-465344 | Loss: 0.932 | 913 ms/step , 6890.39 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 02:14:07 | Epoch: 0 | Step: 113550 | Dataset: 0-465664 | Loss: 0.913 | 913 ms/step , 6888.79 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 02:14:16 | Epoch: 0 | Step: 113560 | Dataset: 0-465984 | Loss: 0.952 | 913 ms/step , 6890.78 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 02:14:26 | Epoch: 0 | Step: 113570 | Dataset: 0-466304 | Loss: 0.969 | 913 ms/step , 6891.48 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 02:14:35 | Epoch: 0 | Step: 113580 | Dataset: 0-466624 | Loss: 0.997 | 913 ms/step , 6890.65 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 02:14:44 | Epoch: 0 | Step: 113590 | Dataset: 0-466944 | Loss: 0.845 | 912 ms/step , 6896.65 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 02:14:53 | Epoch: 0 | Step: 113600 | Dataset: 0-467264 | Loss: 0.939 | 913 ms/step , 6889.34 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 02:14:55 | Validation | Step: 113600 | Val_loss: 0.863 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:15:04 | Epoch: 0 | Step: 113610 | Dataset: 0-467584 | Loss: 0.899 | 913 ms/step , 6889.16 GFLOP/s , 15282.3 tokens/s INFO:__main__:2024-11-05 02:15:13 | Epoch: 0 | Step: 113620 | Dataset: 0-467904 | Loss: 0.779 | 913 ms/step , 6891.49 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 02:15:22 | Epoch: 0 | Step: 113630 | Dataset: 0-468224 | Loss: 1.012 | 914 ms/step , 6879.56 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 02:15:31 | Epoch: 0 | Step: 113640 | Dataset: 0-468544 | Loss: 0.835 | 912 ms/step , 6893.01 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 02:15:40 | Epoch: 0 | Step: 113650 | Dataset: 0-468864 | Loss: 0.823 | 912 ms/step , 6895.25 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 02:15:49 | Epoch: 0 | Step: 113660 | Dataset: 0-469184 | Loss: 0.968 | 916 ms/step , 6868.40 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 02:15:58 | Epoch: 0 | Step: 113670 | Dataset: 0-469504 | Loss: 0.815 | 914 ms/step , 6883.86 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 02:16:08 | Epoch: 0 | Step: 113680 | Dataset: 0-469824 | Loss: 0.814 | 914 ms/step , 6879.94 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 02:16:17 | Epoch: 0 | Step: 113690 | Dataset: 0-470144 | Loss: 0.786 | 912 ms/step , 6893.09 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 02:16:26 | Epoch: 0 | Step: 113700 | Dataset: 0-470464 | Loss: 0.904 | 914 ms/step , 6879.96 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 02:16:28 | Validation | Step: 113700 | Val_loss: 0.889 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:16:37 | Epoch: 0 | Step: 113710 | Dataset: 0-470784 | Loss: 0.980 | 914 ms/step , 6882.96 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-05 02:16:46 | Epoch: 0 | Step: 113720 | Dataset: 0-471104 | Loss: 0.892 | 915 ms/step , 6877.39 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 02:16:55 | Epoch: 0 | Step: 113730 | Dataset: 0-471424 | Loss: 0.891 | 913 ms/step , 6886.43 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 02:17:04 | Epoch: 0 | Step: 113740 | Dataset: 0-471744 | Loss: 0.835 | 913 ms/step , 6885.94 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 02:17:13 | Epoch: 0 | Step: 113750 | Dataset: 0-472064 | Loss: 0.901 | 913 ms/step , 6889.94 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 02:17:22 | Epoch: 0 | Step: 113760 | Dataset: 0-472384 | Loss: 0.805 | 913 ms/step , 6891.46 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 02:17:31 | Epoch: 0 | Step: 113770 | Dataset: 0-472704 | Loss: 0.810 | 913 ms/step , 6887.40 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 02:17:41 | Epoch: 0 | Step: 113780 | Dataset: 0-473024 | Loss: 0.829 | 914 ms/step , 6884.07 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 02:17:50 | Epoch: 0 | Step: 113790 | Dataset: 0-473344 | Loss: 0.679 | 913 ms/step , 6890.19 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 02:17:59 | Epoch: 0 | Step: 113800 | Dataset: 0-473664 | Loss: 0.816 | 912 ms/step , 6897.11 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 02:18:00 | Validation | Step: 113800 | Val_loss: 0.794 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:18:10 | Epoch: 0 | Step: 113810 | Dataset: 0-473984 | Loss: 0.909 | 913 ms/step , 6885.59 GFLOP/s , 15290.7 tokens/s INFO:__main__:2024-11-05 02:18:19 | Epoch: 0 | Step: 113820 | Dataset: 0-474304 | Loss: 0.914 | 913 ms/step , 6887.84 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 02:18:28 | Epoch: 0 | Step: 113830 | Dataset: 0-474624 | Loss: 0.954 | 914 ms/step , 6881.62 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 02:18:37 | Epoch: 0 | Step: 113840 | Dataset: 0-474944 | Loss: 0.785 | 914 ms/step , 6881.87 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 02:18:46 | Epoch: 0 | Step: 113850 | Dataset: 0-475264 | Loss: 0.908 | 914 ms/step , 6882.01 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 02:18:55 | Epoch: 0 | Step: 113860 | Dataset: 0-475584 | Loss: 0.837 | 912 ms/step , 6894.98 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 02:19:04 | Epoch: 0 | Step: 113870 | Dataset: 0-475904 | Loss: 0.812 | 913 ms/step , 6889.04 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 02:19:14 | Epoch: 0 | Step: 113880 | Dataset: 0-476224 | Loss: 0.906 | 915 ms/step , 6874.11 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 02:19:23 | Epoch: 0 | Step: 113890 | Dataset: 0-476544 | Loss: 0.880 | 913 ms/step , 6889.25 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 02:19:32 | Epoch: 0 | Step: 113900 | Dataset: 0-476864 | Loss: 0.865 | 913 ms/step , 6888.61 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 02:19:33 | Validation | Step: 113900 | Val_loss: 0.833 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:19:43 | Epoch: 0 | Step: 113910 | Dataset: 0-477184 | Loss: 0.961 | 913 ms/step , 6888.62 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 02:19:52 | Epoch: 0 | Step: 113920 | Dataset: 0-477504 | Loss: 0.802 | 913 ms/step , 6885.15 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 02:20:01 | Epoch: 0 | Step: 113930 | Dataset: 0-477824 | Loss: 0.815 | 914 ms/step , 6882.70 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 02:20:10 | Epoch: 0 | Step: 113940 | Dataset: 0-478144 | Loss: 0.824 | 914 ms/step , 6885.00 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 02:20:19 | Epoch: 0 | Step: 113950 | Dataset: 0-478464 | Loss: 0.793 | 913 ms/step , 6888.76 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 02:20:28 | Epoch: 0 | Step: 113960 | Dataset: 0-478784 | Loss: 0.891 | 914 ms/step , 6882.45 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 02:20:37 | Epoch: 0 | Step: 113970 | Dataset: 0-479104 | Loss: 0.846 | 913 ms/step , 6891.87 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 02:20:47 | Epoch: 0 | Step: 113980 | Dataset: 0-479424 | Loss: 0.795 | 913 ms/step , 6886.26 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 02:20:56 | Epoch: 0 | Step: 113990 | Dataset: 0-479744 | Loss: 0.802 | 913 ms/step , 6886.27 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 02:21:05 | Epoch: 0 | Step: 114000 | Dataset: 0-480064 | Loss: 1.024 | 913 ms/step , 6885.39 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 02:21:06 | Validation | Step: 114000 | Val_loss: 0.846 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:21:06 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_022106_step_114000.pt` INFO:__main__:2024-11-05 02:21:17 | Epoch: 0 | Step: 114010 | Dataset: 0-480384 | Loss: 0.684 | 914 ms/step , 6883.68 GFLOP/s , 13799.5 tokens/s INFO:__main__:2024-11-05 02:21:26 | Epoch: 0 | Step: 114020 | Dataset: 0-480704 | Loss: 0.808 | 913 ms/step , 6888.16 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 02:21:35 | Epoch: 0 | Step: 114030 | Dataset: 0-481024 | Loss: 0.831 | 913 ms/step , 6892.50 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 02:21:44 | Epoch: 0 | Step: 114040 | Dataset: 0-481344 | Loss: 0.874 | 913 ms/step , 6887.79 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 02:21:53 | Epoch: 0 | Step: 114050 | Dataset: 0-481664 | Loss: 0.911 | 913 ms/step , 6892.24 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 02:22:02 | Epoch: 0 | Step: 114060 | Dataset: 0-481984 | Loss: 0.824 | 915 ms/step , 6874.93 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 02:22:12 | Epoch: 0 | Step: 114070 | Dataset: 0-482304 | Loss: 0.906 | 916 ms/step , 6867.37 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 02:22:21 | Epoch: 0 | Step: 114080 | Dataset: 0-482624 | Loss: 0.774 | 913 ms/step , 6887.49 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 02:22:30 | Epoch: 0 | Step: 114090 | Dataset: 0-482944 | Loss: 0.817 | 915 ms/step , 6874.27 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 02:22:39 | Epoch: 0 | Step: 114100 | Dataset: 0-483264 | Loss: 0.814 | 914 ms/step , 6881.27 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 02:22:41 | Validation | Step: 114100 | Val_loss: 0.759 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:22:50 | Epoch: 0 | Step: 114110 | Dataset: 0-483584 | Loss: 0.769 | 914 ms/step , 6884.12 GFLOP/s , 15268.1 tokens/s INFO:__main__:2024-11-05 02:22:59 | Epoch: 0 | Step: 114120 | Dataset: 0-483904 | Loss: 0.817 | 914 ms/step , 6883.73 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 02:23:08 | Epoch: 0 | Step: 114130 | Dataset: 0-484224 | Loss: 0.879 | 913 ms/step , 6885.84 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 02:23:17 | Epoch: 0 | Step: 114140 | Dataset: 0-484544 | Loss: 0.860 | 914 ms/step , 6878.39 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 02:23:26 | Epoch: 0 | Step: 114150 | Dataset: 0-484864 | Loss: 0.653 | 912 ms/step , 6893.68 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 02:23:35 | Epoch: 0 | Step: 114160 | Dataset: 0-485184 | Loss: 0.801 | 912 ms/step , 6895.95 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 02:23:45 | Epoch: 0 | Step: 114170 | Dataset: 0-485504 | Loss: 0.806 | 913 ms/step , 6892.48 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 02:23:54 | Epoch: 0 | Step: 114180 | Dataset: 0-485824 | Loss: 0.745 | 914 ms/step , 6884.94 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 02:24:03 | Epoch: 0 | Step: 114190 | Dataset: 0-486144 | Loss: 0.833 | 914 ms/step , 6884.01 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 02:24:12 | Epoch: 0 | Step: 114200 | Dataset: 0-486464 | Loss: 0.891 | 912 ms/step , 6893.17 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 02:24:14 | Validation | Step: 114200 | Val_loss: 0.806 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:24:23 | Epoch: 0 | Step: 114210 | Dataset: 0-486784 | Loss: 0.821 | 914 ms/step , 6878.83 GFLOP/s , 15265.0 tokens/s INFO:__main__:2024-11-05 02:24:32 | Epoch: 0 | Step: 114220 | Dataset: 0-487104 | Loss: 0.783 | 914 ms/step , 6879.52 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 02:24:41 | Epoch: 0 | Step: 114230 | Dataset: 0-487424 | Loss: 0.838 | 914 ms/step , 6882.64 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 02:24:50 | Epoch: 0 | Step: 114240 | Dataset: 0-487744 | Loss: 0.830 | 914 ms/step , 6884.82 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 02:24:59 | Epoch: 0 | Step: 114250 | Dataset: 0-488064 | Loss: 0.878 | 914 ms/step , 6879.00 GFLOP/s , 17908.0 tokens/s INFO:__main__:2024-11-05 02:25:08 | Epoch: 0 | Step: 114260 | Dataset: 0-488384 | Loss: 0.715 | 914 ms/step , 6883.94 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 02:25:18 | Epoch: 0 | Step: 114270 | Dataset: 0-488704 | Loss: 0.819 | 915 ms/step , 6870.93 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 02:25:27 | Epoch: 0 | Step: 114280 | Dataset: 0-489024 | Loss: 0.817 | 915 ms/step , 6874.16 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 02:25:36 | Epoch: 0 | Step: 114290 | Dataset: 0-489344 | Loss: 0.783 | 913 ms/step , 6885.68 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 02:25:45 | Epoch: 0 | Step: 114300 | Dataset: 0-489664 | Loss: 0.797 | 914 ms/step , 6882.24 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 02:25:47 | Validation | Step: 114300 | Val_loss: 0.872 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:25:56 | Epoch: 0 | Step: 114310 | Dataset: 0-489984 | Loss: 0.825 | 913 ms/step , 6886.46 GFLOP/s , 15262.0 tokens/s INFO:__main__:2024-11-05 02:26:05 | Epoch: 0 | Step: 114320 | Dataset: 0-490304 | Loss: 0.908 | 915 ms/step , 6874.17 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 02:26:14 | Epoch: 0 | Step: 114330 | Dataset: 0-490624 | Loss: 0.903 | 915 ms/step , 6876.18 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-05 02:26:23 | Epoch: 0 | Step: 114340 | Dataset: 0-490944 | Loss: 0.870 | 915 ms/step , 6875.57 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 02:26:32 | Epoch: 0 | Step: 114350 | Dataset: 0-491264 | Loss: 0.783 | 915 ms/step , 6870.68 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 02:26:41 | Epoch: 0 | Step: 114360 | Dataset: 0-491584 | Loss: 0.835 | 913 ms/step , 6886.64 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 02:26:51 | Epoch: 0 | Step: 114370 | Dataset: 0-491904 | Loss: 0.714 | 915 ms/step , 6872.96 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 02:27:00 | Epoch: 0 | Step: 114380 | Dataset: 0-492224 | Loss: 0.832 | 913 ms/step , 6885.90 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 02:27:09 | Epoch: 0 | Step: 114390 | Dataset: 0-492544 | Loss: 0.856 | 915 ms/step , 6876.13 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 02:27:18 | Epoch: 0 | Step: 114400 | Dataset: 0-492864 | Loss: 0.824 | 913 ms/step , 6891.50 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 02:27:20 | Validation | Step: 114400 | Val_loss: 0.911 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:27:29 | Epoch: 0 | Step: 114410 | Dataset: 0-493184 | Loss: 0.806 | 914 ms/step , 6885.01 GFLOP/s , 15251.7 tokens/s INFO:__main__:2024-11-05 02:27:38 | Epoch: 0 | Step: 114420 | Dataset: 0-493504 | Loss: 0.865 | 913 ms/step , 6890.31 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 02:27:47 | Epoch: 0 | Step: 114430 | Dataset: 0-493824 | Loss: 0.827 | 916 ms/step , 6869.54 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 02:27:56 | Epoch: 0 | Step: 114440 | Dataset: 0-494144 | Loss: 0.720 | 913 ms/step , 6885.82 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 02:28:05 | Epoch: 0 | Step: 114450 | Dataset: 0-494464 | Loss: 0.820 | 914 ms/step , 6884.92 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 02:28:15 | Epoch: 0 | Step: 114460 | Dataset: 0-494784 | Loss: 0.845 | 914 ms/step , 6881.96 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 02:28:24 | Epoch: 0 | Step: 114470 | Dataset: 0-495104 | Loss: 0.719 | 914 ms/step , 6884.38 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 02:28:33 | Epoch: 0 | Step: 114480 | Dataset: 0-495424 | Loss: 0.709 | 913 ms/step , 6887.34 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 02:28:42 | Epoch: 0 | Step: 114490 | Dataset: 0-495744 | Loss: 0.772 | 913 ms/step , 6886.43 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 02:28:51 | Epoch: 0 | Step: 114500 | Dataset: 0-496064 | Loss: 0.742 | 915 ms/step , 6874.76 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 02:28:53 | Validation | Step: 114500 | Val_loss: 0.808 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:29:02 | Epoch: 0 | Step: 114510 | Dataset: 0-496384 | Loss: 0.813 | 914 ms/step , 6884.86 GFLOP/s , 15254.8 tokens/s INFO:__main__:2024-11-05 02:29:11 | Epoch: 0 | Step: 114520 | Dataset: 0-496704 | Loss: 0.825 | 914 ms/step , 6884.30 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 02:29:20 | Epoch: 0 | Step: 114530 | Dataset: 0-497024 | Loss: 0.854 | 914 ms/step , 6880.86 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 02:29:29 | Epoch: 0 | Step: 114540 | Dataset: 0-497344 | Loss: 0.868 | 913 ms/step , 6888.84 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 02:29:38 | Epoch: 0 | Step: 114550 | Dataset: 0-497664 | Loss: 0.812 | 914 ms/step , 6877.60 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 02:29:48 | Epoch: 0 | Step: 114560 | Dataset: 0-497984 | Loss: 0.730 | 915 ms/step , 6876.74 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 02:29:57 | Epoch: 0 | Step: 114570 | Dataset: 0-498304 | Loss: 0.811 | 914 ms/step , 6879.59 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 02:30:06 | Epoch: 0 | Step: 114580 | Dataset: 0-498624 | Loss: 0.688 | 914 ms/step , 6882.55 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 02:30:15 | Epoch: 0 | Step: 114590 | Dataset: 0-498944 | Loss: 0.816 | 914 ms/step , 6880.00 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 02:30:24 | Epoch: 0 | Step: 114600 | Dataset: 0-499264 | Loss: 0.720 | 913 ms/step , 6888.03 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 02:30:26 | Validation | Step: 114600 | Val_loss: 0.871 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:30:35 | Epoch: 0 | Step: 114610 | Dataset: 0-499584 | Loss: 0.798 | 913 ms/step , 6886.15 GFLOP/s , 15257.6 tokens/s INFO:__main__:2024-11-05 02:30:44 | Epoch: 0 | Step: 114620 | Dataset: 0-499904 | Loss: 0.793 | 914 ms/step , 6882.25 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 02:30:53 | Epoch: 0 | Step: 114630 | Dataset: 0-500224 | Loss: 0.797 | 915 ms/step , 6875.31 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 02:31:02 | Epoch: 0 | Step: 114640 | Dataset: 0-500544 | Loss: 0.773 | 915 ms/step , 6876.62 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-05 02:31:11 | Epoch: 0 | Step: 114650 | Dataset: 0-500864 | Loss: 0.824 | 914 ms/step , 6883.61 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 02:31:21 | Epoch: 0 | Step: 114660 | Dataset: 0-501184 | Loss: 0.750 | 914 ms/step , 6880.78 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 02:31:30 | Epoch: 0 | Step: 114670 | Dataset: 0-501504 | Loss: 0.762 | 914 ms/step , 6884.69 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-05 02:31:39 | Epoch: 0 | Step: 114680 | Dataset: 0-501824 | Loss: 0.714 | 913 ms/step , 6886.19 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 02:31:48 | Epoch: 0 | Step: 114690 | Dataset: 0-502144 | Loss: 0.852 | 914 ms/step , 6877.81 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 02:31:57 | Epoch: 0 | Step: 114700 | Dataset: 0-502464 | Loss: 0.881 | 915 ms/step , 6871.46 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 02:31:59 | Validation | Step: 114700 | Val_loss: 0.832 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:32:08 | Epoch: 0 | Step: 114710 | Dataset: 0-502784 | Loss: 0.808 | 914 ms/step , 6878.85 GFLOP/s , 15270.2 tokens/s INFO:__main__:2024-11-05 02:32:17 | Epoch: 0 | Step: 114720 | Dataset: 0-503104 | Loss: 0.788 | 915 ms/step , 6873.32 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 02:32:26 | Epoch: 0 | Step: 114730 | Dataset: 0-503424 | Loss: 0.709 | 913 ms/step , 6885.74 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 02:32:35 | Epoch: 0 | Step: 114740 | Dataset: 0-503744 | Loss: 0.840 | 915 ms/step , 6877.32 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-05 02:32:45 | Epoch: 0 | Step: 114750 | Dataset: 0-504064 | Loss: 0.782 | 913 ms/step , 6886.23 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:32:54 | Epoch: 0 | Step: 114760 | Dataset: 0-504384 | Loss: 0.668 | 913 ms/step , 6888.91 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 02:33:03 | Epoch: 0 | Step: 114770 | Dataset: 0-504704 | Loss: 0.843 | 914 ms/step , 6879.21 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 02:33:12 | Epoch: 0 | Step: 114780 | Dataset: 0-505024 | Loss: 0.803 | 914 ms/step , 6881.80 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 02:33:21 | Epoch: 0 | Step: 114790 | Dataset: 0-505344 | Loss: 0.873 | 913 ms/step , 6891.21 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 02:33:30 | Epoch: 0 | Step: 114800 | Dataset: 0-505664 | Loss: 0.800 | 915 ms/step , 6872.13 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 02:33:32 | Validation | Step: 114800 | Val_loss: 0.804 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:33:41 | Epoch: 0 | Step: 114810 | Dataset: 0-505984 | Loss: 0.848 | 913 ms/step , 6885.99 GFLOP/s , 15263.5 tokens/s INFO:__main__:2024-11-05 02:33:50 | Epoch: 0 | Step: 114820 | Dataset: 0-506304 | Loss: 0.774 | 914 ms/step , 6878.56 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 02:33:59 | Epoch: 0 | Step: 114830 | Dataset: 0-506624 | Loss: 0.763 | 913 ms/step , 6889.16 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 02:34:08 | Epoch: 0 | Step: 114840 | Dataset: 0-506944 | Loss: 0.828 | 915 ms/step , 6876.84 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 02:34:18 | Epoch: 0 | Step: 114850 | Dataset: 0-507264 | Loss: 0.716 | 914 ms/step , 6881.60 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 02:34:27 | Epoch: 0 | Step: 114860 | Dataset: 0-507584 | Loss: 0.847 | 915 ms/step , 6876.01 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 02:34:36 | Epoch: 0 | Step: 114870 | Dataset: 0-507904 | Loss: 0.806 | 914 ms/step , 6881.94 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 02:34:45 | Epoch: 0 | Step: 114880 | Dataset: 0-508224 | Loss: 0.774 | 915 ms/step , 6875.81 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 02:34:54 | Epoch: 0 | Step: 114890 | Dataset: 0-508544 | Loss: 0.797 | 914 ms/step , 6883.77 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 02:35:03 | Epoch: 0 | Step: 114900 | Dataset: 0-508864 | Loss: 0.828 | 914 ms/step , 6878.60 GFLOP/s , 17906.2 tokens/s INFO:__main__:2024-11-05 02:35:05 | Validation | Step: 114900 | Val_loss: 0.861 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:35:14 | Epoch: 0 | Step: 114910 | Dataset: 0-509184 | Loss: 0.877 | 915 ms/step , 6875.45 GFLOP/s , 15262.0 tokens/s INFO:__main__:2024-11-05 02:35:23 | Epoch: 0 | Step: 114920 | Dataset: 0-509504 | Loss: 0.782 | 914 ms/step , 6883.45 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 02:35:32 | Epoch: 0 | Step: 114930 | Dataset: 0-509824 | Loss: 0.824 | 913 ms/step , 6890.06 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 02:35:41 | Epoch: 0 | Step: 114940 | Dataset: 0-510144 | Loss: 0.790 | 914 ms/step , 6879.47 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 02:35:51 | Epoch: 0 | Step: 114950 | Dataset: 0-510464 | Loss: 0.762 | 913 ms/step , 6886.50 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 02:36:00 | Epoch: 0 | Step: 114960 | Dataset: 0-510784 | Loss: 0.748 | 914 ms/step , 6879.95 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 02:36:09 | Epoch: 0 | Step: 114970 | Dataset: 0-511104 | Loss: 0.684 | 913 ms/step , 6890.48 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 02:36:18 | Epoch: 0 | Step: 114980 | Dataset: 0-511424 | Loss: 0.748 | 914 ms/step , 6881.37 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 02:36:27 | Epoch: 0 | Step: 114990 | Dataset: 0-511744 | Loss: 0.817 | 913 ms/step , 6891.43 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 02:36:36 | Epoch: 0 | Step: 115000 | Dataset: 0-512064 | Loss: 0.901 | 914 ms/step , 6883.07 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 02:36:38 | Validation | Step: 115000 | Val_loss: 0.807 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:36:38 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_023638_step_115000.pt` INFO:__main__:2024-11-05 02:36:48 | Epoch: 0 | Step: 115010 | Dataset: 0-512384 | Loss: 0.696 | 913 ms/step , 6886.74 GFLOP/s , 13795.2 tokens/s INFO:__main__:2024-11-05 02:36:57 | Epoch: 0 | Step: 115020 | Dataset: 0-512704 | Loss: 0.791 | 914 ms/step , 6877.87 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 02:37:06 | Epoch: 0 | Step: 115030 | Dataset: 0-513024 | Loss: 0.751 | 913 ms/step , 6886.79 GFLOP/s , 17909.2 tokens/s INFO:__main__:2024-11-05 02:37:16 | Epoch: 0 | Step: 115040 | Dataset: 0-513344 | Loss: 0.806 | 913 ms/step , 6886.74 GFLOP/s , 17900.1 tokens/s INFO:__main__:2024-11-05 02:37:25 | Epoch: 0 | Step: 115050 | Dataset: 0-513664 | Loss: 0.810 | 914 ms/step , 6879.65 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 02:37:34 | Epoch: 0 | Step: 115060 | Dataset: 0-513984 | Loss: 0.774 | 914 ms/step , 6884.50 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 02:37:43 | Epoch: 0 | Step: 115070 | Dataset: 0-514304 | Loss: 0.598 | 913 ms/step , 6886.22 GFLOP/s , 17904.9 tokens/s INFO:__main__:2024-11-05 02:37:52 | Epoch: 0 | Step: 115080 | Dataset: 0-514624 | Loss: 0.842 | 915 ms/step , 6876.89 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 02:38:01 | Epoch: 0 | Step: 115090 | Dataset: 0-514944 | Loss: 0.712 | 916 ms/step , 6869.22 GFLOP/s , 17907.7 tokens/s INFO:__main__:2024-11-05 02:38:10 | Epoch: 0 | Step: 115100 | Dataset: 0-515264 | Loss: 0.782 | 914 ms/step , 6880.11 GFLOP/s , 17906.7 tokens/s INFO:__main__:2024-11-05 02:38:12 | Validation | Step: 115100 | Val_loss: 0.819 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:38:21 | Epoch: 0 | Step: 115110 | Dataset: 0-515584 | Loss: 0.784 | 915 ms/step , 6871.10 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-05 02:38:30 | Epoch: 0 | Step: 115120 | Dataset: 0-515904 | Loss: 0.818 | 913 ms/step , 6885.51 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 02:38:40 | Epoch: 0 | Step: 115130 | Dataset: 0-516224 | Loss: 0.722 | 913 ms/step , 6890.24 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 02:38:49 | Epoch: 0 | Step: 115140 | Dataset: 0-516544 | Loss: 0.827 | 914 ms/step , 6881.89 GFLOP/s , 17905.5 tokens/s INFO:__main__:2024-11-05 02:38:58 | Epoch: 0 | Step: 115150 | Dataset: 0-516864 | Loss: 0.871 | 914 ms/step , 6882.35 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 02:39:07 | Epoch: 0 | Step: 115160 | Dataset: 0-517184 | Loss: 0.738 | 914 ms/step , 6880.15 GFLOP/s , 17905.5 tokens/s INFO:__main__:2024-11-05 02:39:16 | Epoch: 0 | Step: 115170 | Dataset: 0-517504 | Loss: 0.837 | 915 ms/step , 6872.15 GFLOP/s , 17904.8 tokens/s INFO:__main__:2024-11-05 02:39:25 | Epoch: 0 | Step: 115180 | Dataset: 0-517824 | Loss: 0.758 | 914 ms/step , 6881.50 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-05 02:39:34 | Epoch: 0 | Step: 115190 | Dataset: 0-518144 | Loss: 0.819 | 913 ms/step , 6888.29 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 02:39:44 | Epoch: 0 | Step: 115200 | Dataset: 0-518464 | Loss: 0.663 | 913 ms/step , 6885.48 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 02:39:45 | Validation | Step: 115200 | Val_loss: 0.781 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:39:54 | Epoch: 0 | Step: 115210 | Dataset: 0-518784 | Loss: 0.759 | 913 ms/step , 6888.44 GFLOP/s , 15266.4 tokens/s INFO:__main__:2024-11-05 02:40:03 | Epoch: 0 | Step: 115220 | Dataset: 0-519104 | Loss: 0.855 | 914 ms/step , 6884.00 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 02:40:13 | Epoch: 0 | Step: 115230 | Dataset: 0-519424 | Loss: 0.830 | 914 ms/step , 6884.62 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 02:40:22 | Epoch: 0 | Step: 115240 | Dataset: 0-519744 | Loss: 0.829 | 914 ms/step , 6882.25 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 02:40:31 | Epoch: 0 | Step: 115250 | Dataset: 0-520064 | Loss: 0.759 | 915 ms/step , 6876.89 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 02:40:40 | Epoch: 0 | Step: 115260 | Dataset: 0-520384 | Loss: 0.761 | 914 ms/step , 6878.30 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 02:40:49 | Epoch: 0 | Step: 115270 | Dataset: 0-520704 | Loss: 0.773 | 915 ms/step , 6876.36 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 02:40:58 | Epoch: 0 | Step: 115280 | Dataset: 0-521024 | Loss: 0.762 | 914 ms/step , 6885.00 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:41:07 | Epoch: 0 | Step: 115290 | Dataset: 0-521344 | Loss: 0.711 | 911 ms/step , 6902.06 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 02:41:17 | Epoch: 0 | Step: 115300 | Dataset: 0-521664 | Loss: 0.882 | 913 ms/step , 6885.90 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:41:18 | Validation | Step: 115300 | Val_loss: 0.846 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:41:27 | Epoch: 0 | Step: 115310 | Dataset: 0-521984 | Loss: 0.829 | 914 ms/step , 6880.13 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-05 02:41:36 | Epoch: 0 | Step: 115320 | Dataset: 0-522304 | Loss: 0.902 | 913 ms/step , 6886.37 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 02:41:46 | Epoch: 0 | Step: 115330 | Dataset: 0-522624 | Loss: 0.793 | 915 ms/step , 6874.05 GFLOP/s , 17907.9 tokens/s INFO:__main__:2024-11-05 02:41:55 | Epoch: 0 | Step: 115340 | Dataset: 0-522944 | Loss: 0.795 | 915 ms/step , 6875.56 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 02:42:04 | Epoch: 0 | Step: 115350 | Dataset: 0-523264 | Loss: 0.686 | 912 ms/step , 6895.92 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 02:42:13 | Epoch: 0 | Step: 115360 | Dataset: 0-523584 | Loss: 0.876 | 913 ms/step , 6887.78 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 02:42:22 | Epoch: 0 | Step: 115370 | Dataset: 0-523904 | Loss: 0.754 | 913 ms/step , 6885.41 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 02:42:31 | Epoch: 0 | Step: 115380 | Dataset: 0-524224 | Loss: 0.778 | 913 ms/step , 6890.71 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 02:42:40 | Epoch: 0 | Step: 115390 | Dataset: 0-524544 | Loss: 0.823 | 915 ms/step , 6877.02 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 02:42:50 | Epoch: 0 | Step: 115400 | Dataset: 0-524864 | Loss: 0.891 | 914 ms/step , 6883.52 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 02:42:51 | Validation | Step: 115400 | Val_loss: 0.816 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:43:00 | Epoch: 0 | Step: 115410 | Dataset: 0-525184 | Loss: 0.816 | 914 ms/step , 6882.13 GFLOP/s , 15258.6 tokens/s INFO:__main__:2024-11-05 02:43:09 | Epoch: 0 | Step: 115420 | Dataset: 0-525504 | Loss: 0.753 | 915 ms/step , 6876.40 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 02:43:19 | Epoch: 0 | Step: 115430 | Dataset: 0-525824 | Loss: 0.724 | 914 ms/step , 6881.40 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 02:43:28 | Epoch: 0 | Step: 115440 | Dataset: 0-526144 | Loss: 0.656 | 914 ms/step , 6882.54 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 02:43:37 | Epoch: 0 | Step: 115450 | Dataset: 0-526464 | Loss: 0.846 | 913 ms/step , 6885.72 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 02:43:46 | Epoch: 0 | Step: 115460 | Dataset: 0-526784 | Loss: 0.755 | 915 ms/step , 6870.68 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 02:43:55 | Epoch: 0 | Step: 115470 | Dataset: 0-527104 | Loss: 0.789 | 915 ms/step , 6872.75 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 02:44:04 | Epoch: 0 | Step: 115480 | Dataset: 0-527424 | Loss: 0.824 | 915 ms/step , 6872.32 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 02:44:14 | Epoch: 0 | Step: 115490 | Dataset: 0-527744 | Loss: 0.802 | 915 ms/step , 6871.64 GFLOP/s , 17904.8 tokens/s INFO:__main__:2024-11-05 02:44:23 | Epoch: 0 | Step: 115500 | Dataset: 0-528064 | Loss: 0.777 | 915 ms/step , 6873.90 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 02:44:24 | Validation | Step: 115500 | Val_loss: 0.853 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:44:33 | Epoch: 0 | Step: 115510 | Dataset: 0-528384 | Loss: 0.787 | 914 ms/step , 6882.21 GFLOP/s , 15260.3 tokens/s INFO:__main__:2024-11-05 02:44:43 | Epoch: 0 | Step: 115520 | Dataset: 0-528704 | Loss: 0.868 | 915 ms/step , 6876.33 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 02:44:52 | Epoch: 0 | Step: 115530 | Dataset: 0-529024 | Loss: 0.839 | 914 ms/step , 6881.25 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:45:01 | Epoch: 0 | Step: 115540 | Dataset: 0-529344 | Loss: 0.809 | 914 ms/step , 6882.44 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 02:45:10 | Epoch: 0 | Step: 115550 | Dataset: 0-529664 | Loss: 0.698 | 914 ms/step , 6883.84 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 02:45:19 | Epoch: 0 | Step: 115560 | Dataset: 0-529984 | Loss: 0.872 | 914 ms/step , 6881.33 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 02:45:28 | Epoch: 0 | Step: 115570 | Dataset: 0-530304 | Loss: 0.803 | 914 ms/step , 6880.56 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 02:45:37 | Epoch: 0 | Step: 115580 | Dataset: 0-530624 | Loss: 0.868 | 912 ms/step , 6893.87 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 02:45:47 | Epoch: 0 | Step: 115590 | Dataset: 0-530944 | Loss: 0.843 | 915 ms/step , 6876.46 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 02:45:56 | Epoch: 0 | Step: 115600 | Dataset: 0-531264 | Loss: 0.851 | 915 ms/step , 6877.35 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 02:45:57 | Validation | Step: 115600 | Val_loss: 0.791 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:46:06 | Epoch: 0 | Step: 115610 | Dataset: 0-531584 | Loss: 0.882 | 915 ms/step , 6875.56 GFLOP/s , 15267.8 tokens/s INFO:__main__:2024-11-05 02:46:16 | Epoch: 0 | Step: 115620 | Dataset: 0-531904 | Loss: 0.880 | 915 ms/step , 6876.51 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 02:46:25 | Epoch: 0 | Step: 115630 | Dataset: 0-532224 | Loss: 0.694 | 914 ms/step , 6879.61 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 02:46:34 | Epoch: 0 | Step: 115640 | Dataset: 0-532544 | Loss: 0.785 | 913 ms/step , 6892.26 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 02:46:43 | Epoch: 0 | Step: 115650 | Dataset: 0-532864 | Loss: 0.793 | 913 ms/step , 6889.25 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 02:46:52 | Epoch: 0 | Step: 115660 | Dataset: 0-533184 | Loss: 0.748 | 914 ms/step , 6884.03 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 02:47:01 | Epoch: 0 | Step: 115670 | Dataset: 0-533504 | Loss: 0.719 | 914 ms/step , 6882.75 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 02:47:10 | Epoch: 0 | Step: 115680 | Dataset: 0-533824 | Loss: 0.725 | 916 ms/step , 6866.43 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 02:47:20 | Epoch: 0 | Step: 115690 | Dataset: 0-534144 | Loss: 0.838 | 914 ms/step , 6880.26 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 02:47:29 | Epoch: 0 | Step: 115700 | Dataset: 0-534464 | Loss: 0.789 | 915 ms/step , 6875.49 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 02:47:30 | Validation | Step: 115700 | Val_loss: 0.816 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:47:39 | Epoch: 0 | Step: 115710 | Dataset: 0-534784 | Loss: 0.782 | 914 ms/step , 6883.75 GFLOP/s , 15261.2 tokens/s INFO:__main__:2024-11-05 02:47:49 | Epoch: 0 | Step: 115720 | Dataset: 0-535104 | Loss: 0.775 | 914 ms/step , 6883.56 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 02:47:58 | Epoch: 0 | Step: 115730 | Dataset: 0-535424 | Loss: 0.809 | 915 ms/step , 6877.28 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 02:48:07 | Epoch: 0 | Step: 115740 | Dataset: 0-535744 | Loss: 0.790 | 914 ms/step , 6884.50 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 02:48:16 | Epoch: 0 | Step: 115750 | Dataset: 0-536064 | Loss: 0.730 | 914 ms/step , 6879.40 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 02:48:25 | Epoch: 0 | Step: 115760 | Dataset: 0-536384 | Loss: 0.893 | 915 ms/step , 6872.55 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 02:48:34 | Epoch: 0 | Step: 115770 | Dataset: 0-536704 | Loss: 0.749 | 914 ms/step , 6881.52 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 02:48:43 | Epoch: 0 | Step: 115780 | Dataset: 0-537024 | Loss: 0.752 | 914 ms/step , 6881.96 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 02:48:53 | Epoch: 0 | Step: 115790 | Dataset: 0-537344 | Loss: 0.699 | 913 ms/step , 6887.88 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:49:02 | Epoch: 0 | Step: 115800 | Dataset: 0-537664 | Loss: 0.854 | 913 ms/step , 6887.26 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 02:49:03 | Validation | Step: 115800 | Val_loss: 0.858 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:49:12 | Epoch: 0 | Step: 115810 | Dataset: 0-537984 | Loss: 0.831 | 914 ms/step , 6878.77 GFLOP/s , 15263.4 tokens/s INFO:__main__:2024-11-05 02:49:22 | Epoch: 0 | Step: 115820 | Dataset: 0-538304 | Loss: 0.816 | 914 ms/step , 6880.58 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 02:49:31 | Epoch: 0 | Step: 115830 | Dataset: 0-538624 | Loss: 0.672 | 914 ms/step , 6881.45 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 02:49:40 | Epoch: 0 | Step: 115840 | Dataset: 0-538944 | Loss: 0.777 | 913 ms/step , 6886.87 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 02:49:49 | Epoch: 0 | Step: 115850 | Dataset: 0-539264 | Loss: 0.812 | 915 ms/step , 6874.95 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 02:49:58 | Epoch: 0 | Step: 115860 | Dataset: 0-539584 | Loss: 0.765 | 913 ms/step , 6891.57 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 02:50:07 | Epoch: 0 | Step: 115870 | Dataset: 0-539904 | Loss: 0.746 | 914 ms/step , 6878.04 GFLOP/s , 17908.5 tokens/s INFO:__main__:2024-11-05 02:50:16 | Epoch: 0 | Step: 115880 | Dataset: 0-540224 | Loss: 0.826 | 916 ms/step , 6862.92 GFLOP/s , 17902.4 tokens/s INFO:__main__:2024-11-05 02:50:26 | Epoch: 0 | Step: 115890 | Dataset: 0-540544 | Loss: 0.859 | 913 ms/step , 6889.07 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:50:35 | Epoch: 0 | Step: 115900 | Dataset: 0-540864 | Loss: 0.776 | 913 ms/step , 6891.17 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 02:50:36 | Validation | Step: 115900 | Val_loss: 0.822 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:50:45 | Epoch: 0 | Step: 115910 | Dataset: 0-541184 | Loss: 0.833 | 915 ms/step , 6872.04 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-05 02:50:55 | Epoch: 0 | Step: 115920 | Dataset: 0-541504 | Loss: 0.799 | 913 ms/step , 6889.19 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 02:51:04 | Epoch: 0 | Step: 115930 | Dataset: 0-541824 | Loss: 0.787 | 914 ms/step , 6878.54 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 02:51:13 | Epoch: 0 | Step: 115940 | Dataset: 0-542144 | Loss: 0.845 | 914 ms/step , 6883.16 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 02:51:22 | Epoch: 0 | Step: 115950 | Dataset: 0-542464 | Loss: 0.750 | 914 ms/step , 6883.36 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 02:51:31 | Epoch: 0 | Step: 115960 | Dataset: 0-542784 | Loss: 0.808 | 914 ms/step , 6883.55 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 02:51:40 | Epoch: 0 | Step: 115970 | Dataset: 0-543104 | Loss: 0.776 | 915 ms/step , 6873.65 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 02:51:50 | Epoch: 0 | Step: 115980 | Dataset: 0-543424 | Loss: 0.889 | 913 ms/step , 6886.26 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 02:51:59 | Epoch: 0 | Step: 115990 | Dataset: 0-543744 | Loss: 0.739 | 914 ms/step , 6882.58 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 02:52:08 | Epoch: 0 | Step: 116000 | Dataset: 0-544064 | Loss: 0.835 | 914 ms/step , 6880.11 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 02:52:09 | Validation | Step: 116000 | Val_loss: 0.894 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:52:09 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_025209_step_116000.pt` INFO:__main__:2024-11-05 02:52:20 | Epoch: 0 | Step: 116010 | Dataset: 0-544384 | Loss: 0.856 | 914 ms/step , 6882.77 GFLOP/s , 13797.9 tokens/s INFO:__main__:2024-11-05 02:52:29 | Epoch: 0 | Step: 116020 | Dataset: 0-544704 | Loss: 0.841 | 916 ms/step , 6865.81 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 02:52:38 | Epoch: 0 | Step: 116030 | Dataset: 0-545024 | Loss: 0.776 | 914 ms/step , 6880.15 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 02:52:47 | Epoch: 0 | Step: 116040 | Dataset: 0-545344 | Loss: 0.724 | 913 ms/step , 6891.72 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 02:52:56 | Epoch: 0 | Step: 116050 | Dataset: 0-545664 | Loss: 0.752 | 915 ms/step , 6874.76 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 02:53:05 | Epoch: 0 | Step: 116060 | Dataset: 0-545984 | Loss: 0.707 | 913 ms/step , 6886.10 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 02:53:15 | Epoch: 0 | Step: 116070 | Dataset: 0-546304 | Loss: 0.788 | 913 ms/step , 6886.05 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 02:53:24 | Epoch: 0 | Step: 116080 | Dataset: 0-546624 | Loss: 0.773 | 914 ms/step , 6879.86 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 02:53:33 | Epoch: 0 | Step: 116090 | Dataset: 0-546944 | Loss: 0.755 | 912 ms/step , 6896.17 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 02:53:42 | Epoch: 0 | Step: 116100 | Dataset: 0-547264 | Loss: 0.876 | 913 ms/step , 6889.04 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 02:53:44 | Validation | Step: 116100 | Val_loss: 0.870 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:53:53 | Epoch: 0 | Step: 116110 | Dataset: 0-547584 | Loss: 0.839 | 913 ms/step , 6886.81 GFLOP/s , 15275.0 tokens/s INFO:__main__:2024-11-05 02:54:02 | Epoch: 0 | Step: 116120 | Dataset: 0-547904 | Loss: 0.721 | 913 ms/step , 6885.63 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 02:54:11 | Epoch: 0 | Step: 116130 | Dataset: 0-548224 | Loss: 0.829 | 913 ms/step , 6885.76 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 02:54:20 | Epoch: 0 | Step: 116140 | Dataset: 0-548544 | Loss: 0.785 | 912 ms/step , 6892.80 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 02:54:29 | Epoch: 0 | Step: 116150 | Dataset: 0-548864 | Loss: 0.699 | 913 ms/step , 6891.88 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 02:54:38 | Epoch: 0 | Step: 116160 | Dataset: 0-549184 | Loss: 0.765 | 913 ms/step , 6885.75 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 02:54:48 | Epoch: 0 | Step: 116170 | Dataset: 0-549504 | Loss: 0.750 | 914 ms/step , 6883.52 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 02:54:57 | Epoch: 0 | Step: 116180 | Dataset: 0-549824 | Loss: 0.768 | 914 ms/step , 6879.23 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 02:55:06 | Epoch: 0 | Step: 116190 | Dataset: 0-550144 | Loss: 0.766 | 914 ms/step , 6881.24 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 02:55:15 | Epoch: 0 | Step: 116200 | Dataset: 0-550464 | Loss: 0.874 | 913 ms/step , 6886.54 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 02:55:17 | Validation | Step: 116200 | Val_loss: 0.837 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:55:26 | Epoch: 0 | Step: 116210 | Dataset: 0-550784 | Loss: 0.765 | 913 ms/step , 6888.90 GFLOP/s , 15262.2 tokens/s INFO:__main__:2024-11-05 02:55:35 | Epoch: 0 | Step: 116220 | Dataset: 0-551104 | Loss: 0.754 | 914 ms/step , 6882.71 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 02:55:44 | Epoch: 0 | Step: 116230 | Dataset: 0-551424 | Loss: 0.790 | 914 ms/step , 6882.05 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 02:55:53 | Epoch: 0 | Step: 116240 | Dataset: 0-551744 | Loss: 0.860 | 914 ms/step , 6881.17 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 02:56:02 | Epoch: 0 | Step: 116250 | Dataset: 0-552064 | Loss: 0.667 | 914 ms/step , 6879.94 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 02:56:11 | Epoch: 0 | Step: 116260 | Dataset: 0-552384 | Loss: 0.840 | 914 ms/step , 6881.95 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 02:56:21 | Epoch: 0 | Step: 116270 | Dataset: 0-552704 | Loss: 0.808 | 915 ms/step , 6873.73 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-05 02:56:30 | Epoch: 0 | Step: 116280 | Dataset: 0-553024 | Loss: 0.937 | 916 ms/step , 6869.53 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 02:56:39 | Epoch: 0 | Step: 116290 | Dataset: 0-553344 | Loss: 0.868 | 913 ms/step , 6886.12 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 02:56:48 | Epoch: 0 | Step: 116300 | Dataset: 0-553664 | Loss: 0.807 | 915 ms/step , 6876.94 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 02:56:50 | Validation | Step: 116300 | Val_loss: 0.811 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:56:59 | Epoch: 0 | Step: 116310 | Dataset: 0-553984 | Loss: 0.846 | 914 ms/step , 6883.12 GFLOP/s , 15260.4 tokens/s INFO:__main__:2024-11-05 02:57:08 | Epoch: 0 | Step: 116320 | Dataset: 0-554304 | Loss: 0.840 | 913 ms/step , 6889.07 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 02:57:17 | Epoch: 0 | Step: 116330 | Dataset: 0-554624 | Loss: 0.785 | 914 ms/step , 6884.16 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 02:57:26 | Epoch: 0 | Step: 116340 | Dataset: 0-554944 | Loss: 0.836 | 914 ms/step , 6878.97 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 02:57:35 | Epoch: 0 | Step: 116350 | Dataset: 0-555264 | Loss: 0.817 | 914 ms/step , 6883.83 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 02:57:44 | Epoch: 0 | Step: 116360 | Dataset: 0-555584 | Loss: 0.713 | 913 ms/step , 6892.38 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 02:57:54 | Epoch: 0 | Step: 116370 | Dataset: 0-555904 | Loss: 0.820 | 913 ms/step , 6888.18 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 02:58:03 | Epoch: 0 | Step: 116380 | Dataset: 0-556224 | Loss: 0.945 | 912 ms/step , 6895.44 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 02:58:12 | Epoch: 0 | Step: 116390 | Dataset: 0-556544 | Loss: 0.792 | 914 ms/step , 6880.92 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 02:58:21 | Epoch: 0 | Step: 116400 | Dataset: 0-556864 | Loss: 0.748 | 915 ms/step , 6871.99 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 02:58:23 | Validation | Step: 116400 | Val_loss: 0.517 | Best_val_loss: 0.7084 INFO:__main__:2024-11-05 02:58:23 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_025823_step_116400.pt` INFO:__main__:2024-11-05 02:58:33 | Epoch: 0 | Step: 116410 | Dataset: 0-557184 | Loss: 0.791 | 916 ms/step , 6869.44 GFLOP/s , 13789.9 tokens/s INFO:__main__:2024-11-05 02:58:42 | Epoch: 0 | Step: 116420 | Dataset: 0-557504 | Loss: 0.740 | 914 ms/step , 6878.02 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 02:58:51 | Epoch: 0 | Step: 116430 | Dataset: 0-557824 | Loss: 0.779 | 913 ms/step , 6891.95 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 02:59:00 | Epoch: 0 | Step: 116440 | Dataset: 0-558144 | Loss: 0.764 | 913 ms/step , 6888.43 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 02:59:09 | Epoch: 0 | Step: 116450 | Dataset: 0-558464 | Loss: 0.739 | 913 ms/step , 6888.34 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 02:59:19 | Epoch: 0 | Step: 116460 | Dataset: 0-558784 | Loss: 0.800 | 915 ms/step , 6876.05 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 02:59:28 | Epoch: 0 | Step: 116470 | Dataset: 0-559104 | Loss: 0.722 | 913 ms/step , 6887.18 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 02:59:37 | Epoch: 0 | Step: 116480 | Dataset: 0-559424 | Loss: 0.760 | 915 ms/step , 6876.13 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 02:59:46 | Epoch: 0 | Step: 116490 | Dataset: 0-559744 | Loss: 0.716 | 913 ms/step , 6889.55 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 02:59:55 | Epoch: 0 | Step: 116500 | Dataset: 0-560064 | Loss: 0.858 | 914 ms/step , 6882.42 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 02:59:57 | Validation | Step: 116500 | Val_loss: 0.465 | Best_val_loss: 0.5172 INFO:__main__:2024-11-05 02:59:57 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_025957_step_116500.pt` INFO:__main__:2024-11-05 03:00:07 | Epoch: 0 | Step: 116510 | Dataset: 0-560384 | Loss: 0.804 | 915 ms/step , 6876.96 GFLOP/s , 13841.1 tokens/s INFO:__main__:2024-11-05 03:00:16 | Epoch: 0 | Step: 116520 | Dataset: 0-560704 | Loss: 0.747 | 913 ms/step , 6887.31 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 03:00:25 | Epoch: 0 | Step: 116530 | Dataset: 0-561024 | Loss: 0.842 | 914 ms/step , 6880.56 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 03:00:34 | Epoch: 0 | Step: 116540 | Dataset: 0-561344 | Loss: 0.801 | 913 ms/step , 6888.64 GFLOP/s , 17892.5 tokens/s INFO:__main__:2024-11-05 03:00:44 | Epoch: 0 | Step: 116550 | Dataset: 0-561664 | Loss: 0.765 | 913 ms/step , 6888.80 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 03:00:53 | Epoch: 0 | Step: 116560 | Dataset: 0-561984 | Loss: 0.823 | 914 ms/step , 6879.30 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 03:01:02 | Epoch: 0 | Step: 116570 | Dataset: 0-562304 | Loss: 0.676 | 913 ms/step , 6892.39 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 03:01:11 | Epoch: 0 | Step: 116580 | Dataset: 0-562624 | Loss: 0.754 | 914 ms/step , 6882.77 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 03:01:20 | Epoch: 0 | Step: 116590 | Dataset: 0-562944 | Loss: 0.658 | 913 ms/step , 6888.69 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 03:01:29 | Epoch: 0 | Step: 116600 | Dataset: 0-563264 | Loss: 0.759 | 914 ms/step , 6881.67 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 03:01:31 | Validation | Step: 116600 | Val_loss: 0.448 | Best_val_loss: 0.4645 INFO:__main__:2024-11-05 03:01:31 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_030131_step_116600.pt` INFO:__main__:2024-11-05 03:01:41 | Epoch: 0 | Step: 116610 | Dataset: 0-563584 | Loss: 0.832 | 921 ms/step , 6832.67 GFLOP/s , 13834.4 tokens/s INFO:__main__:2024-11-05 03:01:50 | Epoch: 0 | Step: 116620 | Dataset: 0-563904 | Loss: 0.796 | 914 ms/step , 6883.12 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 03:01:59 | Epoch: 0 | Step: 116630 | Dataset: 0-564224 | Loss: 0.752 | 914 ms/step , 6880.62 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 03:02:09 | Epoch: 0 | Step: 116640 | Dataset: 0-564544 | Loss: 0.700 | 914 ms/step , 6882.60 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-05 03:02:18 | Epoch: 0 | Step: 116650 | Dataset: 0-564864 | Loss: 0.794 | 914 ms/step , 6877.82 GFLOP/s , 17899.0 tokens/s INFO:__main__:2024-11-05 03:02:27 | Epoch: 0 | Step: 116660 | Dataset: 0-565184 | Loss: 0.756 | 915 ms/step , 6874.43 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 03:02:36 | Epoch: 0 | Step: 116670 | Dataset: 0-565504 | Loss: 0.740 | 914 ms/step , 6881.45 GFLOP/s , 17908.8 tokens/s INFO:__main__:2024-11-05 03:02:45 | Epoch: 0 | Step: 116680 | Dataset: 0-565824 | Loss: 0.693 | 913 ms/step , 6888.42 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 03:02:54 | Epoch: 0 | Step: 116690 | Dataset: 0-566144 | Loss: 0.834 | 914 ms/step , 6880.65 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-05 03:03:03 | Epoch: 0 | Step: 116700 | Dataset: 0-566464 | Loss: 0.785 | 914 ms/step , 6883.48 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 03:03:05 | Validation | Step: 116700 | Val_loss: 0.776 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:03:14 | Epoch: 0 | Step: 116710 | Dataset: 0-566784 | Loss: 0.750 | 913 ms/step , 6886.73 GFLOP/s , 15259.3 tokens/s INFO:__main__:2024-11-05 03:03:23 | Epoch: 0 | Step: 116720 | Dataset: 0-567104 | Loss: 0.838 | 914 ms/step , 6877.93 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 03:03:33 | Epoch: 0 | Step: 116730 | Dataset: 0-567424 | Loss: 0.788 | 915 ms/step , 6875.91 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 03:03:42 | Epoch: 0 | Step: 116740 | Dataset: 0-567744 | Loss: 0.759 | 914 ms/step , 6883.00 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 03:03:51 | Epoch: 0 | Step: 116750 | Dataset: 0-568064 | Loss: 0.815 | 915 ms/step , 6871.64 GFLOP/s , 17903.3 tokens/s INFO:__main__:2024-11-05 03:04:00 | Epoch: 0 | Step: 116760 | Dataset: 0-568384 | Loss: 0.726 | 913 ms/step , 6888.14 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 03:04:09 | Epoch: 0 | Step: 116770 | Dataset: 0-568704 | Loss: 0.819 | 914 ms/step , 6877.73 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 03:04:18 | Epoch: 0 | Step: 116780 | Dataset: 0-569024 | Loss: 0.827 | 914 ms/step , 6883.80 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 03:04:27 | Epoch: 0 | Step: 116790 | Dataset: 0-569344 | Loss: 0.883 | 916 ms/step , 6864.40 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 03:04:37 | Epoch: 0 | Step: 116800 | Dataset: 0-569664 | Loss: 0.740 | 914 ms/step , 6879.53 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 03:04:38 | Validation | Step: 116800 | Val_loss: 0.729 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:04:47 | Epoch: 0 | Step: 116810 | Dataset: 0-569984 | Loss: 0.829 | 913 ms/step , 6886.62 GFLOP/s , 15253.0 tokens/s INFO:__main__:2024-11-05 03:04:56 | Epoch: 0 | Step: 116820 | Dataset: 0-570304 | Loss: 0.834 | 914 ms/step , 6882.10 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 03:05:06 | Epoch: 0 | Step: 116830 | Dataset: 0-570624 | Loss: 0.690 | 915 ms/step , 6874.64 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 03:05:15 | Epoch: 0 | Step: 116840 | Dataset: 0-570944 | Loss: 0.783 | 916 ms/step , 6862.79 GFLOP/s , 17904.3 tokens/s INFO:__main__:2024-11-05 03:05:24 | Epoch: 0 | Step: 116850 | Dataset: 0-571264 | Loss: 0.818 | 914 ms/step , 6880.59 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 03:05:33 | Epoch: 0 | Step: 116860 | Dataset: 0-571584 | Loss: 0.779 | 914 ms/step , 6880.79 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-05 03:05:42 | Epoch: 0 | Step: 116870 | Dataset: 0-571904 | Loss: 0.734 | 915 ms/step , 6877.27 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 03:05:51 | Epoch: 0 | Step: 116880 | Dataset: 0-572224 | Loss: 0.777 | 917 ms/step , 6861.83 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 03:06:00 | Epoch: 0 | Step: 116890 | Dataset: 0-572544 | Loss: 0.717 | 914 ms/step , 6878.94 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 03:06:10 | Epoch: 0 | Step: 116900 | Dataset: 0-572864 | Loss: 0.791 | 913 ms/step , 6885.63 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 03:06:11 | Validation | Step: 116900 | Val_loss: 0.770 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:06:20 | Epoch: 0 | Step: 116910 | Dataset: 0-573184 | Loss: 0.743 | 913 ms/step , 6887.02 GFLOP/s , 15269.4 tokens/s INFO:__main__:2024-11-05 03:06:29 | Epoch: 0 | Step: 116920 | Dataset: 0-573504 | Loss: 0.817 | 914 ms/step , 6884.85 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 03:06:39 | Epoch: 0 | Step: 116930 | Dataset: 0-573824 | Loss: 0.779 | 913 ms/step , 6889.83 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 03:06:48 | Epoch: 0 | Step: 116940 | Dataset: 0-574144 | Loss: 0.760 | 914 ms/step , 6878.09 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 03:06:57 | Epoch: 0 | Step: 116950 | Dataset: 0-574464 | Loss: 0.916 | 914 ms/step , 6878.89 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 03:07:06 | Epoch: 0 | Step: 116960 | Dataset: 0-574784 | Loss: 0.857 | 915 ms/step , 6875.96 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:07:15 | Epoch: 0 | Step: 116970 | Dataset: 0-575104 | Loss: 0.798 | 914 ms/step , 6882.20 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 03:07:24 | Epoch: 0 | Step: 116980 | Dataset: 0-575424 | Loss: 0.780 | 912 ms/step , 6893.88 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 03:07:33 | Epoch: 0 | Step: 116990 | Dataset: 0-575744 | Loss: 0.767 | 915 ms/step , 6876.69 GFLOP/s , 17902.4 tokens/s INFO:__main__:2024-11-05 03:07:43 | Epoch: 0 | Step: 117000 | Dataset: 0-576064 | Loss: 0.782 | 915 ms/step , 6877.05 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 03:07:44 | Validation | Step: 117000 | Val_loss: 0.760 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:07:44 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_030744_step_117000.pt` INFO:__main__:2024-11-05 03:07:55 | Epoch: 0 | Step: 117010 | Dataset: 0-576384 | Loss: 0.738 | 914 ms/step , 6884.06 GFLOP/s , 13773.1 tokens/s INFO:__main__:2024-11-05 03:08:04 | Epoch: 0 | Step: 117020 | Dataset: 0-576704 | Loss: 0.733 | 917 ms/step , 6861.49 GFLOP/s , 17884.0 tokens/s INFO:__main__:2024-11-05 03:08:13 | Epoch: 0 | Step: 117030 | Dataset: 0-577024 | Loss: 0.728 | 915 ms/step , 6872.30 GFLOP/s , 17885.3 tokens/s INFO:__main__:2024-11-05 03:08:22 | Epoch: 0 | Step: 117040 | Dataset: 0-577344 | Loss: 0.801 | 915 ms/step , 6877.10 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-05 03:08:31 | Epoch: 0 | Step: 117050 | Dataset: 0-577664 | Loss: 0.705 | 914 ms/step , 6880.44 GFLOP/s , 17903.5 tokens/s INFO:__main__:2024-11-05 03:08:40 | Epoch: 0 | Step: 117060 | Dataset: 0-577984 | Loss: 0.873 | 915 ms/step , 6875.50 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-05 03:08:49 | Epoch: 0 | Step: 117070 | Dataset: 0-578304 | Loss: 0.804 | 914 ms/step , 6882.23 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 03:08:59 | Epoch: 0 | Step: 117080 | Dataset: 0-578624 | Loss: 0.737 | 912 ms/step , 6892.95 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 03:09:08 | Epoch: 0 | Step: 117090 | Dataset: 0-578944 | Loss: 0.733 | 914 ms/step , 6884.29 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 03:09:17 | Epoch: 0 | Step: 117100 | Dataset: 0-579264 | Loss: 0.746 | 912 ms/step , 6896.12 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 03:09:18 | Validation | Step: 117100 | Val_loss: 0.771 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:09:28 | Epoch: 0 | Step: 117110 | Dataset: 0-579584 | Loss: 0.781 | 913 ms/step , 6886.43 GFLOP/s , 15259.8 tokens/s INFO:__main__:2024-11-05 03:09:37 | Epoch: 0 | Step: 117120 | Dataset: 0-579904 | Loss: 0.708 | 914 ms/step , 6879.84 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 03:09:46 | Epoch: 0 | Step: 117130 | Dataset: 0-580224 | Loss: 0.899 | 915 ms/step , 6872.16 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 03:09:55 | Epoch: 0 | Step: 117140 | Dataset: 0-580544 | Loss: 0.804 | 914 ms/step , 6880.85 GFLOP/s , 17906.7 tokens/s INFO:__main__:2024-11-05 03:10:04 | Epoch: 0 | Step: 117150 | Dataset: 0-580864 | Loss: 0.750 | 914 ms/step , 6882.25 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 03:10:13 | Epoch: 0 | Step: 117160 | Dataset: 0-581184 | Loss: 0.694 | 913 ms/step , 6887.59 GFLOP/s , 17904.2 tokens/s INFO:__main__:2024-11-05 03:10:22 | Epoch: 0 | Step: 117170 | Dataset: 0-581504 | Loss: 0.810 | 913 ms/step , 6885.41 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 03:10:32 | Epoch: 0 | Step: 117180 | Dataset: 0-581824 | Loss: 0.814 | 914 ms/step , 6883.73 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 03:10:41 | Epoch: 0 | Step: 117190 | Dataset: 0-582144 | Loss: 0.749 | 913 ms/step , 6887.32 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 03:10:50 | Epoch: 0 | Step: 117200 | Dataset: 0-582464 | Loss: 0.717 | 913 ms/step , 6886.72 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 03:10:52 | Validation | Step: 117200 | Val_loss: 0.809 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:11:01 | Epoch: 0 | Step: 117210 | Dataset: 0-582784 | Loss: 0.760 | 913 ms/step , 6886.86 GFLOP/s , 15258.1 tokens/s INFO:__main__:2024-11-05 03:11:10 | Epoch: 0 | Step: 117220 | Dataset: 0-583104 | Loss: 0.765 | 913 ms/step , 6889.82 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 03:11:19 | Epoch: 0 | Step: 117230 | Dataset: 0-583424 | Loss: 0.792 | 913 ms/step , 6886.27 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 03:11:28 | Epoch: 0 | Step: 117240 | Dataset: 0-583744 | Loss: 0.820 | 915 ms/step , 6872.08 GFLOP/s , 17906.2 tokens/s INFO:__main__:2024-11-05 03:11:37 | Epoch: 0 | Step: 117250 | Dataset: 0-584064 | Loss: 0.834 | 916 ms/step , 6866.52 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 03:11:46 | Epoch: 0 | Step: 117260 | Dataset: 0-584384 | Loss: 0.773 | 913 ms/step , 6889.27 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 03:11:56 | Epoch: 0 | Step: 117270 | Dataset: 0-584704 | Loss: 0.785 | 914 ms/step , 6881.43 GFLOP/s , 17906.2 tokens/s INFO:__main__:2024-11-05 03:12:05 | Epoch: 0 | Step: 117280 | Dataset: 0-585024 | Loss: 0.797 | 914 ms/step , 6880.47 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 03:12:14 | Epoch: 0 | Step: 117290 | Dataset: 0-585344 | Loss: 0.870 | 914 ms/step , 6879.74 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 03:12:23 | Epoch: 0 | Step: 117300 | Dataset: 0-585664 | Loss: 0.764 | 914 ms/step , 6877.81 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 03:12:25 | Validation | Step: 117300 | Val_loss: 0.746 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:12:34 | Epoch: 0 | Step: 117310 | Dataset: 0-585984 | Loss: 0.785 | 913 ms/step , 6885.89 GFLOP/s , 15268.5 tokens/s INFO:__main__:2024-11-05 03:12:43 | Epoch: 0 | Step: 117320 | Dataset: 0-586304 | Loss: 0.839 | 916 ms/step , 6868.02 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-05 03:12:52 | Epoch: 0 | Step: 117330 | Dataset: 0-586624 | Loss: 0.765 | 915 ms/step , 6876.97 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 03:13:01 | Epoch: 0 | Step: 117340 | Dataset: 0-586944 | Loss: 0.714 | 913 ms/step , 6885.27 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 03:13:10 | Epoch: 0 | Step: 117350 | Dataset: 0-587264 | Loss: 0.712 | 915 ms/step , 6877.09 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-05 03:13:19 | Epoch: 0 | Step: 117360 | Dataset: 0-587584 | Loss: 0.856 | 915 ms/step , 6877.30 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-05 03:13:29 | Epoch: 0 | Step: 117370 | Dataset: 0-587904 | Loss: 0.840 | 914 ms/step , 6879.53 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-05 03:13:38 | Epoch: 0 | Step: 117380 | Dataset: 0-588224 | Loss: 0.748 | 914 ms/step , 6878.97 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 03:13:47 | Epoch: 0 | Step: 117390 | Dataset: 0-588544 | Loss: 0.721 | 913 ms/step , 6887.42 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 03:13:56 | Epoch: 0 | Step: 117400 | Dataset: 0-588864 | Loss: 0.692 | 914 ms/step , 6883.83 GFLOP/s , 17911.8 tokens/s INFO:__main__:2024-11-05 03:13:58 | Validation | Step: 117400 | Val_loss: 0.771 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:14:07 | Epoch: 0 | Step: 117410 | Dataset: 0-589184 | Loss: 0.685 | 915 ms/step , 6870.89 GFLOP/s , 15252.5 tokens/s INFO:__main__:2024-11-05 03:14:16 | Epoch: 0 | Step: 117420 | Dataset: 0-589504 | Loss: 0.765 | 914 ms/step , 6883.78 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 03:14:25 | Epoch: 0 | Step: 117430 | Dataset: 0-589824 | Loss: 0.801 | 913 ms/step , 6886.02 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 03:14:34 | Epoch: 0 | Step: 117440 | Dataset: 0-590144 | Loss: 0.714 | 914 ms/step , 6882.36 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 03:14:43 | Epoch: 0 | Step: 117450 | Dataset: 0-590464 | Loss: 0.781 | 914 ms/step , 6882.29 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 03:14:52 | Epoch: 0 | Step: 117460 | Dataset: 0-590784 | Loss: 0.721 | 914 ms/step , 6884.26 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 03:15:02 | Epoch: 0 | Step: 117470 | Dataset: 0-591104 | Loss: 0.842 | 915 ms/step , 6877.30 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 03:15:11 | Epoch: 0 | Step: 117480 | Dataset: 0-591424 | Loss: 0.836 | 914 ms/step , 6880.38 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 03:15:20 | Epoch: 0 | Step: 117490 | Dataset: 0-591744 | Loss: 0.827 | 914 ms/step , 6881.56 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 03:15:29 | Epoch: 0 | Step: 117500 | Dataset: 0-592064 | Loss: 0.832 | 915 ms/step , 6874.80 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 03:15:31 | Validation | Step: 117500 | Val_loss: 0.813 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:15:40 | Epoch: 0 | Step: 117510 | Dataset: 0-592384 | Loss: 0.721 | 916 ms/step , 6866.32 GFLOP/s , 15261.5 tokens/s INFO:__main__:2024-11-05 03:15:49 | Epoch: 0 | Step: 117520 | Dataset: 0-592704 | Loss: 0.876 | 914 ms/step , 6883.24 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 03:15:58 | Epoch: 0 | Step: 117530 | Dataset: 0-593024 | Loss: 0.774 | 915 ms/step , 6873.39 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 03:16:07 | Epoch: 0 | Step: 117540 | Dataset: 0-593344 | Loss: 0.887 | 914 ms/step , 6882.36 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 03:16:16 | Epoch: 0 | Step: 117550 | Dataset: 0-593664 | Loss: 0.810 | 915 ms/step , 6872.05 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 03:16:26 | Epoch: 0 | Step: 117560 | Dataset: 0-593984 | Loss: 0.810 | 914 ms/step , 6883.55 GFLOP/s , 17905.1 tokens/s INFO:__main__:2024-11-05 03:16:35 | Epoch: 0 | Step: 117570 | Dataset: 0-594304 | Loss: 0.908 | 915 ms/step , 6870.58 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 03:16:44 | Epoch: 0 | Step: 117580 | Dataset: 0-594624 | Loss: 0.811 | 915 ms/step , 6873.36 GFLOP/s , 17897.4 tokens/s INFO:__main__:2024-11-05 03:16:53 | Epoch: 0 | Step: 117590 | Dataset: 0-594944 | Loss: 0.743 | 914 ms/step , 6877.59 GFLOP/s , 17898.6 tokens/s INFO:__main__:2024-11-05 03:17:02 | Epoch: 0 | Step: 117600 | Dataset: 0-595264 | Loss: 0.763 | 914 ms/step , 6880.16 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 03:17:04 | Validation | Step: 117600 | Val_loss: 0.786 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:17:13 | Epoch: 0 | Step: 117610 | Dataset: 0-595584 | Loss: 0.856 | 914 ms/step , 6877.83 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-05 03:17:22 | Epoch: 0 | Step: 117620 | Dataset: 0-595904 | Loss: 0.879 | 913 ms/step , 6891.11 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 03:17:31 | Epoch: 0 | Step: 117630 | Dataset: 0-596224 | Loss: 0.771 | 915 ms/step , 6872.69 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-05 03:17:40 | Epoch: 0 | Step: 117640 | Dataset: 0-596544 | Loss: 0.765 | 912 ms/step , 6894.03 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 03:17:49 | Epoch: 0 | Step: 117650 | Dataset: 0-596864 | Loss: 0.783 | 915 ms/step , 6876.30 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-05 03:17:59 | Epoch: 0 | Step: 117660 | Dataset: 0-597184 | Loss: 0.861 | 914 ms/step , 6882.25 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-05 03:18:08 | Epoch: 0 | Step: 117670 | Dataset: 0-597504 | Loss: 0.798 | 914 ms/step , 6880.47 GFLOP/s , 17905.7 tokens/s INFO:__main__:2024-11-05 03:18:17 | Epoch: 0 | Step: 117680 | Dataset: 0-597824 | Loss: 0.806 | 915 ms/step , 6872.96 GFLOP/s , 17905.3 tokens/s INFO:__main__:2024-11-05 03:18:26 | Epoch: 0 | Step: 117690 | Dataset: 0-598144 | Loss: 0.816 | 914 ms/step , 6884.41 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 03:18:35 | Epoch: 0 | Step: 117700 | Dataset: 0-598464 | Loss: 0.720 | 915 ms/step , 6876.39 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 03:18:37 | Validation | Step: 117700 | Val_loss: 0.730 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:18:46 | Epoch: 0 | Step: 117710 | Dataset: 0-598784 | Loss: 0.828 | 913 ms/step , 6889.97 GFLOP/s , 15270.1 tokens/s INFO:__main__:2024-11-05 03:18:55 | Epoch: 0 | Step: 117720 | Dataset: 0-599104 | Loss: 0.736 | 915 ms/step , 6875.25 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 03:19:04 | Epoch: 0 | Step: 117730 | Dataset: 0-599424 | Loss: 0.856 | 915 ms/step , 6876.83 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 03:19:13 | Epoch: 0 | Step: 117740 | Dataset: 0-599744 | Loss: 0.705 | 913 ms/step , 6886.82 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 03:19:22 | Epoch: 0 | Step: 117750 | Dataset: 0-600064 | Loss: 0.857 | 914 ms/step , 6883.37 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 03:19:32 | Epoch: 0 | Step: 117760 | Dataset: 0-600384 | Loss: 0.830 | 914 ms/step , 6883.20 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 03:19:41 | Epoch: 0 | Step: 117770 | Dataset: 0-600704 | Loss: 0.819 | 914 ms/step , 6877.71 GFLOP/s , 17903.9 tokens/s INFO:__main__:2024-11-05 03:19:50 | Epoch: 0 | Step: 117780 | Dataset: 0-601024 | Loss: 0.793 | 915 ms/step , 6871.89 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 03:19:59 | Epoch: 0 | Step: 117790 | Dataset: 0-601344 | Loss: 0.831 | 914 ms/step , 6882.66 GFLOP/s , 17902.7 tokens/s INFO:__main__:2024-11-05 03:20:08 | Epoch: 0 | Step: 117800 | Dataset: 0-601664 | Loss: 0.677 | 913 ms/step , 6890.43 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 03:20:10 | Validation | Step: 117800 | Val_loss: 0.744 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:20:19 | Epoch: 0 | Step: 117810 | Dataset: 0-601984 | Loss: 0.811 | 914 ms/step , 6879.32 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 03:20:28 | Epoch: 0 | Step: 117820 | Dataset: 0-602304 | Loss: 0.866 | 915 ms/step , 6874.94 GFLOP/s , 17897.6 tokens/s INFO:__main__:2024-11-05 03:20:37 | Epoch: 0 | Step: 117830 | Dataset: 0-602624 | Loss: 0.814 | 914 ms/step , 6883.20 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 03:20:46 | Epoch: 0 | Step: 117840 | Dataset: 0-602944 | Loss: 0.770 | 914 ms/step , 6880.94 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 03:20:56 | Epoch: 0 | Step: 117850 | Dataset: 0-603264 | Loss: 0.764 | 915 ms/step , 6872.72 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 03:21:05 | Epoch: 0 | Step: 117860 | Dataset: 0-603584 | Loss: 0.843 | 914 ms/step , 6879.49 GFLOP/s , 17911.8 tokens/s INFO:__main__:2024-11-05 03:21:14 | Epoch: 0 | Step: 117870 | Dataset: 0-603904 | Loss: 0.809 | 914 ms/step , 6882.43 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 03:21:23 | Epoch: 0 | Step: 117880 | Dataset: 0-604224 | Loss: 0.828 | 916 ms/step , 6865.09 GFLOP/s , 17905.1 tokens/s INFO:__main__:2024-11-05 03:21:32 | Epoch: 0 | Step: 117890 | Dataset: 0-604544 | Loss: 0.930 | 915 ms/step , 6875.82 GFLOP/s , 17907.0 tokens/s INFO:__main__:2024-11-05 03:21:41 | Epoch: 0 | Step: 117900 | Dataset: 0-604864 | Loss: 0.786 | 914 ms/step , 6879.83 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-05 03:21:43 | Validation | Step: 117900 | Val_loss: 0.758 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:21:52 | Epoch: 0 | Step: 117910 | Dataset: 0-605184 | Loss: 0.813 | 915 ms/step , 6871.31 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-05 03:22:01 | Epoch: 0 | Step: 117920 | Dataset: 0-605504 | Loss: 0.784 | 916 ms/step , 6864.08 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 03:22:10 | Epoch: 0 | Step: 117930 | Dataset: 0-605824 | Loss: 0.746 | 914 ms/step , 6880.57 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 03:22:19 | Epoch: 0 | Step: 117940 | Dataset: 0-606144 | Loss: 0.859 | 913 ms/step , 6886.42 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 03:22:29 | Epoch: 0 | Step: 117950 | Dataset: 0-606464 | Loss: 0.842 | 913 ms/step , 6885.07 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 03:22:38 | Epoch: 0 | Step: 117960 | Dataset: 0-606784 | Loss: 0.868 | 914 ms/step , 6880.87 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 03:22:47 | Epoch: 0 | Step: 117970 | Dataset: 0-607104 | Loss: 0.775 | 913 ms/step , 6887.17 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 03:22:56 | Epoch: 0 | Step: 117980 | Dataset: 0-607424 | Loss: 0.834 | 913 ms/step , 6886.25 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 03:23:05 | Epoch: 0 | Step: 117990 | Dataset: 0-607744 | Loss: 0.758 | 914 ms/step , 6879.11 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 03:23:14 | Epoch: 0 | Step: 118000 | Dataset: 0-608064 | Loss: 0.766 | 913 ms/step , 6889.89 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 03:23:16 | Validation | Step: 118000 | Val_loss: 0.860 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:23:16 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_032316_step_118000.pt` INFO:__main__:2024-11-05 03:23:26 | Epoch: 0 | Step: 118010 | Dataset: 0-608384 | Loss: 0.836 | 914 ms/step , 6881.57 GFLOP/s , 13845.4 tokens/s INFO:__main__:2024-11-05 03:23:35 | Epoch: 0 | Step: 118020 | Dataset: 0-608704 | Loss: 0.883 | 915 ms/step , 6876.17 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 03:23:44 | Epoch: 0 | Step: 118030 | Dataset: 0-609024 | Loss: 0.648 | 913 ms/step , 6887.04 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 03:23:54 | Epoch: 0 | Step: 118040 | Dataset: 0-609344 | Loss: 0.789 | 914 ms/step , 6883.09 GFLOP/s , 17900.1 tokens/s INFO:__main__:2024-11-05 03:24:03 | Epoch: 0 | Step: 118050 | Dataset: 0-609664 | Loss: 0.866 | 915 ms/step , 6877.11 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 03:24:12 | Epoch: 0 | Step: 118060 | Dataset: 0-609984 | Loss: 0.935 | 914 ms/step , 6881.83 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 03:24:21 | Epoch: 0 | Step: 118070 | Dataset: 0-610304 | Loss: 0.771 | 914 ms/step , 6880.57 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 03:24:30 | Epoch: 0 | Step: 118080 | Dataset: 0-610624 | Loss: 0.813 | 915 ms/step , 6875.63 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 03:24:39 | Epoch: 0 | Step: 118090 | Dataset: 0-610944 | Loss: 0.812 | 914 ms/step , 6877.86 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 03:24:48 | Epoch: 0 | Step: 118100 | Dataset: 0-611264 | Loss: 0.875 | 914 ms/step , 6883.46 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 03:24:50 | Validation | Step: 118100 | Val_loss: 0.868 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:24:59 | Epoch: 0 | Step: 118110 | Dataset: 0-611584 | Loss: 0.877 | 914 ms/step , 6879.51 GFLOP/s , 15265.0 tokens/s INFO:__main__:2024-11-05 03:25:08 | Epoch: 0 | Step: 118120 | Dataset: 0-611904 | Loss: 0.953 | 916 ms/step , 6866.32 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 03:25:17 | Epoch: 0 | Step: 118130 | Dataset: 0-612224 | Loss: 0.935 | 913 ms/step , 6885.17 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 03:25:27 | Epoch: 0 | Step: 118140 | Dataset: 0-612544 | Loss: 1.001 | 915 ms/step , 6875.21 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 03:25:36 | Epoch: 0 | Step: 118150 | Dataset: 0-612864 | Loss: 0.834 | 915 ms/step , 6874.98 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 03:25:45 | Epoch: 0 | Step: 118160 | Dataset: 0-613184 | Loss: 0.878 | 915 ms/step , 6877.36 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 03:25:54 | Epoch: 0 | Step: 118170 | Dataset: 0-613504 | Loss: 0.873 | 914 ms/step , 6883.28 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 03:26:03 | Epoch: 0 | Step: 118180 | Dataset: 0-613824 | Loss: 0.804 | 914 ms/step , 6879.71 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 03:26:12 | Epoch: 0 | Step: 118190 | Dataset: 0-614144 | Loss: 0.974 | 914 ms/step , 6882.55 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 03:26:21 | Epoch: 0 | Step: 118200 | Dataset: 0-614464 | Loss: 0.832 | 914 ms/step , 6881.59 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 03:26:23 | Validation | Step: 118200 | Val_loss: 0.863 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:26:32 | Epoch: 0 | Step: 118210 | Dataset: 0-614784 | Loss: 0.805 | 913 ms/step , 6885.93 GFLOP/s , 15268.3 tokens/s INFO:__main__:2024-11-05 03:26:41 | Epoch: 0 | Step: 118220 | Dataset: 0-615104 | Loss: 0.769 | 913 ms/step , 6887.46 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 03:26:50 | Epoch: 0 | Step: 118230 | Dataset: 0-615424 | Loss: 0.852 | 912 ms/step , 6896.85 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 03:27:00 | Epoch: 0 | Step: 118240 | Dataset: 0-615744 | Loss: 0.883 | 915 ms/step , 6871.83 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 03:27:09 | Epoch: 0 | Step: 118250 | Dataset: 0-616064 | Loss: 0.548 | 912 ms/step , 6895.94 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 03:27:18 | Epoch: 0 | Step: 118260 | Dataset: 0-616384 | Loss: 0.788 | 914 ms/step , 6881.57 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 03:27:27 | Epoch: 0 | Step: 118270 | Dataset: 0-616704 | Loss: 0.700 | 913 ms/step , 6891.54 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 03:27:36 | Epoch: 0 | Step: 118280 | Dataset: 0-617024 | Loss: 0.943 | 913 ms/step , 6887.21 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 03:27:45 | Epoch: 0 | Step: 118290 | Dataset: 0-617344 | Loss: 0.831 | 915 ms/step , 6874.80 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 03:27:54 | Epoch: 0 | Step: 118300 | Dataset: 0-617664 | Loss: 0.848 | 914 ms/step , 6883.53 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 03:27:56 | Validation | Step: 118300 | Val_loss: 0.866 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:28:05 | Epoch: 0 | Step: 118310 | Dataset: 0-617984 | Loss: 0.714 | 913 ms/step , 6888.26 GFLOP/s , 15265.7 tokens/s INFO:__main__:2024-11-05 03:28:14 | Epoch: 0 | Step: 118320 | Dataset: 0-618304 | Loss: 0.770 | 914 ms/step , 6883.87 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 03:28:23 | Epoch: 0 | Step: 118330 | Dataset: 0-618624 | Loss: 0.864 | 915 ms/step , 6873.54 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 03:28:33 | Epoch: 0 | Step: 118340 | Dataset: 0-618944 | Loss: 0.714 | 912 ms/step , 6895.40 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 03:28:42 | Epoch: 0 | Step: 118350 | Dataset: 0-619264 | Loss: 0.617 | 913 ms/step , 6888.38 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 03:28:51 | Epoch: 0 | Step: 118360 | Dataset: 0-619584 | Loss: 0.760 | 912 ms/step , 6893.95 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 03:29:00 | Epoch: 0 | Step: 118370 | Dataset: 0-619904 | Loss: 0.711 | 912 ms/step , 6899.16 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 03:29:09 | Epoch: 0 | Step: 118380 | Dataset: 0-620224 | Loss: 0.749 | 914 ms/step , 6882.15 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 03:29:18 | Epoch: 0 | Step: 118390 | Dataset: 0-620544 | Loss: 0.739 | 914 ms/step , 6883.80 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 03:29:27 | Epoch: 0 | Step: 118400 | Dataset: 0-620864 | Loss: 0.744 | 913 ms/step , 6887.42 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 03:29:29 | Validation | Step: 118400 | Val_loss: 0.862 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:29:38 | Epoch: 0 | Step: 118410 | Dataset: 0-621184 | Loss: 0.830 | 913 ms/step , 6890.62 GFLOP/s , 15293.8 tokens/s INFO:__main__:2024-11-05 03:29:47 | Epoch: 0 | Step: 118420 | Dataset: 0-621504 | Loss: 0.565 | 913 ms/step , 6891.14 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 03:29:56 | Epoch: 0 | Step: 118430 | Dataset: 0-621824 | Loss: 0.889 | 913 ms/step , 6885.40 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 03:30:06 | Epoch: 0 | Step: 118440 | Dataset: 0-622144 | Loss: 0.870 | 917 ms/step , 6861.14 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 03:30:15 | Epoch: 0 | Step: 118450 | Dataset: 0-622464 | Loss: 0.792 | 915 ms/step , 6874.93 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 03:30:24 | Epoch: 0 | Step: 118460 | Dataset: 0-622784 | Loss: 0.760 | 913 ms/step , 6889.54 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 03:30:33 | Epoch: 0 | Step: 118470 | Dataset: 0-623104 | Loss: 0.855 | 915 ms/step , 6870.33 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 03:30:42 | Epoch: 0 | Step: 118480 | Dataset: 0-623424 | Loss: 0.857 | 914 ms/step , 6882.85 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 03:30:51 | Epoch: 0 | Step: 118490 | Dataset: 0-623744 | Loss: 0.989 | 915 ms/step , 6871.89 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 03:31:00 | Epoch: 0 | Step: 118500 | Dataset: 0-624064 | Loss: 0.805 | 913 ms/step , 6886.58 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 03:31:02 | Validation | Step: 118500 | Val_loss: 0.852 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:31:11 | Epoch: 0 | Step: 118510 | Dataset: 0-624384 | Loss: 0.751 | 913 ms/step , 6891.64 GFLOP/s , 15277.2 tokens/s INFO:__main__:2024-11-05 03:31:20 | Epoch: 0 | Step: 118520 | Dataset: 0-624704 | Loss: 0.808 | 914 ms/step , 6883.54 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 03:31:29 | Epoch: 0 | Step: 118530 | Dataset: 0-625024 | Loss: 0.734 | 913 ms/step , 6890.43 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 03:31:39 | Epoch: 0 | Step: 118540 | Dataset: 0-625344 | Loss: 0.758 | 913 ms/step , 6889.24 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 03:31:48 | Epoch: 0 | Step: 118550 | Dataset: 0-625664 | Loss: 0.803 | 913 ms/step , 6886.67 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 03:31:57 | Epoch: 0 | Step: 118560 | Dataset: 0-625984 | Loss: 0.881 | 914 ms/step , 6883.37 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 03:32:06 | Epoch: 0 | Step: 118570 | Dataset: 0-626304 | Loss: 0.772 | 914 ms/step , 6883.75 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 03:32:15 | Epoch: 0 | Step: 118580 | Dataset: 0-626624 | Loss: 0.750 | 913 ms/step , 6889.61 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 03:32:24 | Epoch: 0 | Step: 118590 | Dataset: 0-626944 | Loss: 0.935 | 914 ms/step , 6880.49 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 03:32:33 | Epoch: 0 | Step: 118600 | Dataset: 0-627264 | Loss: 0.691 | 913 ms/step , 6889.09 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 03:32:35 | Validation | Step: 118600 | Val_loss: 0.877 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:32:44 | Epoch: 0 | Step: 118610 | Dataset: 0-627584 | Loss: 0.716 | 914 ms/step , 6881.23 GFLOP/s , 15267.9 tokens/s INFO:__main__:2024-11-05 03:32:53 | Epoch: 0 | Step: 118620 | Dataset: 0-627904 | Loss: 0.975 | 915 ms/step , 6871.10 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 03:33:02 | Epoch: 0 | Step: 118630 | Dataset: 0-628224 | Loss: 0.880 | 915 ms/step , 6874.22 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 03:33:12 | Epoch: 0 | Step: 118640 | Dataset: 0-628544 | Loss: 0.820 | 913 ms/step , 6892.57 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 03:33:21 | Epoch: 0 | Step: 118650 | Dataset: 0-628864 | Loss: 0.768 | 913 ms/step , 6887.53 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 03:33:30 | Epoch: 0 | Step: 118660 | Dataset: 0-629184 | Loss: 0.768 | 914 ms/step , 6878.51 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 03:33:39 | Epoch: 0 | Step: 118670 | Dataset: 0-629504 | Loss: 0.810 | 913 ms/step , 6887.43 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 03:33:48 | Epoch: 0 | Step: 118680 | Dataset: 0-629824 | Loss: 0.856 | 914 ms/step , 6879.68 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 03:33:57 | Epoch: 0 | Step: 118690 | Dataset: 0-630144 | Loss: 0.856 | 913 ms/step , 6890.61 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 03:34:06 | Epoch: 0 | Step: 118700 | Dataset: 0-630464 | Loss: 0.829 | 914 ms/step , 6880.39 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 03:34:08 | Validation | Step: 118700 | Val_loss: 0.892 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:34:17 | Epoch: 0 | Step: 118710 | Dataset: 0-630784 | Loss: 0.728 | 912 ms/step , 6892.95 GFLOP/s , 15262.7 tokens/s INFO:__main__:2024-11-05 03:34:26 | Epoch: 0 | Step: 118720 | Dataset: 0-631104 | Loss: 0.733 | 913 ms/step , 6891.96 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 03:34:35 | Epoch: 0 | Step: 118730 | Dataset: 0-631424 | Loss: 0.745 | 914 ms/step , 6877.90 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 03:34:45 | Epoch: 0 | Step: 118740 | Dataset: 0-631744 | Loss: 0.685 | 912 ms/step , 6894.58 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 03:34:54 | Epoch: 0 | Step: 118750 | Dataset: 0-632064 | Loss: 0.799 | 914 ms/step , 6878.96 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 03:35:03 | Epoch: 0 | Step: 118760 | Dataset: 0-632384 | Loss: 0.767 | 914 ms/step , 6884.78 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 03:35:12 | Epoch: 0 | Step: 118770 | Dataset: 0-632704 | Loss: 0.801 | 913 ms/step , 6886.98 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 03:35:21 | Epoch: 0 | Step: 118780 | Dataset: 0-633024 | Loss: 0.859 | 915 ms/step , 6874.96 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 03:35:30 | Epoch: 0 | Step: 118790 | Dataset: 0-633344 | Loss: 0.893 | 916 ms/step , 6866.38 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 03:35:39 | Epoch: 0 | Step: 118800 | Dataset: 0-633664 | Loss: 0.762 | 914 ms/step , 6882.28 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 03:35:41 | Validation | Step: 118800 | Val_loss: 0.852 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:35:50 | Epoch: 0 | Step: 118810 | Dataset: 0-633984 | Loss: 0.753 | 915 ms/step , 6874.48 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 03:35:59 | Epoch: 0 | Step: 118820 | Dataset: 0-634304 | Loss: 0.877 | 914 ms/step , 6884.01 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 03:36:08 | Epoch: 0 | Step: 118830 | Dataset: 0-634624 | Loss: 0.808 | 914 ms/step , 6883.86 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 03:36:18 | Epoch: 0 | Step: 118840 | Dataset: 0-634944 | Loss: 0.836 | 915 ms/step , 6875.49 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 03:36:27 | Epoch: 0 | Step: 118850 | Dataset: 0-635264 | Loss: 0.689 | 913 ms/step , 6888.00 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 03:36:36 | Epoch: 0 | Step: 118860 | Dataset: 0-635584 | Loss: 0.843 | 913 ms/step , 6889.62 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 03:36:45 | Epoch: 0 | Step: 118870 | Dataset: 0-635904 | Loss: 0.780 | 915 ms/step , 6874.49 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 03:36:54 | Epoch: 0 | Step: 118880 | Dataset: 0-636224 | Loss: 0.750 | 913 ms/step , 6888.24 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 03:37:03 | Epoch: 0 | Step: 118890 | Dataset: 0-636544 | Loss: 0.793 | 915 ms/step , 6875.95 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 03:37:12 | Epoch: 0 | Step: 118900 | Dataset: 0-636864 | Loss: 0.803 | 913 ms/step , 6886.04 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 03:37:14 | Validation | Step: 118900 | Val_loss: 0.842 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:37:23 | Epoch: 0 | Step: 118910 | Dataset: 0-637184 | Loss: 0.917 | 916 ms/step , 6863.68 GFLOP/s , 15267.7 tokens/s INFO:__main__:2024-11-05 03:37:32 | Epoch: 0 | Step: 118920 | Dataset: 0-637504 | Loss: 0.878 | 913 ms/step , 6891.49 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 03:37:41 | Epoch: 0 | Step: 118930 | Dataset: 0-637824 | Loss: 0.751 | 911 ms/step , 6900.60 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 03:37:51 | Epoch: 0 | Step: 118940 | Dataset: 0-638144 | Loss: 0.827 | 913 ms/step , 6886.82 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 03:38:00 | Epoch: 0 | Step: 118950 | Dataset: 0-638464 | Loss: 0.774 | 913 ms/step , 6888.53 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 03:38:09 | Epoch: 0 | Step: 118960 | Dataset: 0-638784 | Loss: 0.614 | 913 ms/step , 6890.98 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 03:38:18 | Epoch: 0 | Step: 118970 | Dataset: 0-639104 | Loss: 0.775 | 913 ms/step , 6890.90 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 03:38:27 | Epoch: 0 | Step: 118980 | Dataset: 0-639424 | Loss: 0.898 | 914 ms/step , 6879.53 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 03:38:36 | Epoch: 0 | Step: 118990 | Dataset: 0-639744 | Loss: 0.557 | 913 ms/step , 6889.44 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 03:38:45 | Epoch: 0 | Step: 119000 | Dataset: 0-640064 | Loss: 0.950 | 913 ms/step , 6891.09 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 03:38:47 | Validation | Step: 119000 | Val_loss: 0.870 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:38:47 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_033847_step_119000.pt` INFO:__main__:2024-11-05 03:38:57 | Epoch: 0 | Step: 119010 | Dataset: 0-640384 | Loss: 0.671 | 915 ms/step , 6876.58 GFLOP/s , 13811.4 tokens/s INFO:__main__:2024-11-05 03:39:06 | Epoch: 0 | Step: 119020 | Dataset: 0-640704 | Loss: 0.860 | 915 ms/step , 6876.99 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 03:39:16 | Epoch: 0 | Step: 119030 | Dataset: 0-641024 | Loss: 0.922 | 916 ms/step , 6869.66 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 03:39:25 | Epoch: 0 | Step: 119040 | Dataset: 0-641344 | Loss: 0.806 | 912 ms/step , 6895.75 GFLOP/s , 17890.3 tokens/s INFO:__main__:2024-11-05 03:39:34 | Epoch: 0 | Step: 119050 | Dataset: 0-641664 | Loss: 0.515 | 912 ms/step , 6898.41 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 03:39:43 | Epoch: 0 | Step: 119060 | Dataset: 0-641984 | Loss: 0.681 | 913 ms/step , 6885.49 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 03:39:52 | Epoch: 0 | Step: 119070 | Dataset: 0-642304 | Loss: 0.857 | 915 ms/step , 6870.71 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 03:40:01 | Epoch: 0 | Step: 119080 | Dataset: 0-642624 | Loss: 0.892 | 914 ms/step , 6879.06 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 03:40:10 | Epoch: 0 | Step: 119090 | Dataset: 0-642944 | Loss: 0.711 | 913 ms/step , 6891.16 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 03:40:20 | Epoch: 0 | Step: 119100 | Dataset: 0-643264 | Loss: 0.869 | 915 ms/step , 6876.53 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 03:40:21 | Validation | Step: 119100 | Val_loss: 0.843 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:40:30 | Epoch: 0 | Step: 119110 | Dataset: 0-643584 | Loss: 0.740 | 914 ms/step , 6884.13 GFLOP/s , 15263.9 tokens/s INFO:__main__:2024-11-05 03:40:39 | Epoch: 0 | Step: 119120 | Dataset: 0-643904 | Loss: 0.846 | 914 ms/step , 6883.62 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 03:40:49 | Epoch: 0 | Step: 119130 | Dataset: 0-644224 | Loss: 0.747 | 913 ms/step , 6889.62 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 03:40:58 | Epoch: 0 | Step: 119140 | Dataset: 0-644544 | Loss: 0.737 | 914 ms/step , 6881.47 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 03:41:07 | Epoch: 0 | Step: 119150 | Dataset: 0-644864 | Loss: 0.846 | 914 ms/step , 6880.99 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 03:41:16 | Epoch: 0 | Step: 119160 | Dataset: 0-645184 | Loss: 0.827 | 914 ms/step , 6882.26 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 03:41:25 | Epoch: 0 | Step: 119170 | Dataset: 0-645504 | Loss: 0.824 | 913 ms/step , 6888.16 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 03:41:34 | Epoch: 0 | Step: 119180 | Dataset: 0-645824 | Loss: 0.920 | 915 ms/step , 6875.27 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 03:41:43 | Epoch: 0 | Step: 119190 | Dataset: 0-646144 | Loss: 0.814 | 913 ms/step , 6885.71 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 03:41:53 | Epoch: 0 | Step: 119200 | Dataset: 0-646464 | Loss: 0.866 | 913 ms/step , 6887.10 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 03:41:54 | Validation | Step: 119200 | Val_loss: 0.853 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:42:03 | Epoch: 0 | Step: 119210 | Dataset: 0-646784 | Loss: 0.996 | 914 ms/step , 6881.26 GFLOP/s , 15262.7 tokens/s INFO:__main__:2024-11-05 03:42:12 | Epoch: 0 | Step: 119220 | Dataset: 0-647104 | Loss: 0.678 | 914 ms/step , 6881.60 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 03:42:22 | Epoch: 0 | Step: 119230 | Dataset: 0-647424 | Loss: 0.829 | 913 ms/step , 6890.84 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 03:42:31 | Epoch: 0 | Step: 119240 | Dataset: 0-647744 | Loss: 0.955 | 914 ms/step , 6883.49 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 03:42:40 | Epoch: 0 | Step: 119250 | Dataset: 0-648064 | Loss: 0.863 | 914 ms/step , 6880.35 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 03:42:49 | Epoch: 0 | Step: 119260 | Dataset: 0-648384 | Loss: 0.861 | 914 ms/step , 6881.38 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 03:42:58 | Epoch: 0 | Step: 119270 | Dataset: 0-648704 | Loss: 0.980 | 914 ms/step , 6882.09 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 03:43:07 | Epoch: 0 | Step: 119280 | Dataset: 0-649024 | Loss: 0.865 | 915 ms/step , 6874.24 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 03:43:16 | Epoch: 0 | Step: 119290 | Dataset: 0-649344 | Loss: 0.753 | 914 ms/step , 6883.47 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 03:43:26 | Epoch: 0 | Step: 119300 | Dataset: 0-649664 | Loss: 0.671 | 913 ms/step , 6890.40 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 03:43:27 | Validation | Step: 119300 | Val_loss: 0.840 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:43:36 | Epoch: 0 | Step: 119310 | Dataset: 0-649984 | Loss: 0.740 | 914 ms/step , 6884.37 GFLOP/s , 15265.9 tokens/s INFO:__main__:2024-11-05 03:43:45 | Epoch: 0 | Step: 119320 | Dataset: 0-650304 | Loss: 0.627 | 913 ms/step , 6891.38 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 03:43:55 | Epoch: 0 | Step: 119330 | Dataset: 0-650624 | Loss: 0.756 | 914 ms/step , 6884.78 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 03:44:04 | Epoch: 0 | Step: 119340 | Dataset: 0-650944 | Loss: 0.595 | 912 ms/step , 6895.15 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 03:44:13 | Epoch: 0 | Step: 119350 | Dataset: 0-651264 | Loss: 0.697 | 913 ms/step , 6890.55 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 03:44:22 | Epoch: 0 | Step: 119360 | Dataset: 0-651584 | Loss: 0.777 | 913 ms/step , 6885.60 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 03:44:31 | Epoch: 0 | Step: 119370 | Dataset: 0-651904 | Loss: 0.828 | 913 ms/step , 6891.31 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 03:44:40 | Epoch: 0 | Step: 119380 | Dataset: 0-652224 | Loss: 0.808 | 914 ms/step , 6880.06 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 03:44:49 | Epoch: 0 | Step: 119390 | Dataset: 0-652544 | Loss: 0.854 | 912 ms/step , 6893.68 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 03:44:59 | Epoch: 0 | Step: 119400 | Dataset: 0-652864 | Loss: 0.781 | 912 ms/step , 6893.53 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 03:45:00 | Validation | Step: 119400 | Val_loss: 0.862 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:45:09 | Epoch: 0 | Step: 119410 | Dataset: 0-653184 | Loss: 0.707 | 912 ms/step , 6894.88 GFLOP/s , 15289.6 tokens/s INFO:__main__:2024-11-05 03:45:18 | Epoch: 0 | Step: 119420 | Dataset: 0-653504 | Loss: 0.923 | 914 ms/step , 6881.90 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 03:45:27 | Epoch: 0 | Step: 119430 | Dataset: 0-653824 | Loss: 0.841 | 914 ms/step , 6883.61 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 03:45:37 | Epoch: 0 | Step: 119440 | Dataset: 0-654144 | Loss: 0.872 | 913 ms/step , 6891.41 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 03:45:46 | Epoch: 0 | Step: 119450 | Dataset: 0-654464 | Loss: 0.903 | 914 ms/step , 6882.40 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 03:45:55 | Epoch: 0 | Step: 119460 | Dataset: 0-654784 | Loss: 0.965 | 913 ms/step , 6888.99 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 03:46:04 | Epoch: 0 | Step: 119470 | Dataset: 0-655104 | Loss: 0.847 | 913 ms/step , 6891.27 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 03:46:13 | Epoch: 0 | Step: 119480 | Dataset: 0-655424 | Loss: 0.837 | 913 ms/step , 6889.53 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:46:22 | Epoch: 0 | Step: 119490 | Dataset: 0-655744 | Loss: 0.931 | 914 ms/step , 6882.29 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 03:46:31 | Epoch: 0 | Step: 119500 | Dataset: 0-656064 | Loss: 0.790 | 912 ms/step , 6897.78 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 03:46:33 | Validation | Step: 119500 | Val_loss: 0.906 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:46:42 | Epoch: 0 | Step: 119510 | Dataset: 0-656384 | Loss: 0.891 | 913 ms/step , 6889.30 GFLOP/s , 15271.5 tokens/s INFO:__main__:2024-11-05 03:46:51 | Epoch: 0 | Step: 119520 | Dataset: 0-656704 | Loss: 0.926 | 913 ms/step , 6892.53 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 03:47:00 | Epoch: 0 | Step: 119530 | Dataset: 0-657024 | Loss: 0.916 | 914 ms/step , 6884.69 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 03:47:10 | Epoch: 0 | Step: 119540 | Dataset: 0-657344 | Loss: 0.777 | 913 ms/step , 6885.57 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 03:47:19 | Epoch: 0 | Step: 119550 | Dataset: 0-657664 | Loss: 0.801 | 912 ms/step , 6894.22 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 03:47:28 | Epoch: 0 | Step: 119560 | Dataset: 0-657984 | Loss: 1.050 | 914 ms/step , 6884.53 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 03:47:37 | Epoch: 0 | Step: 119570 | Dataset: 0-658304 | Loss: 0.767 | 913 ms/step , 6891.33 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:47:46 | Epoch: 0 | Step: 119580 | Dataset: 0-658624 | Loss: 0.876 | 914 ms/step , 6883.73 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 03:47:55 | Epoch: 0 | Step: 119590 | Dataset: 0-658944 | Loss: 0.865 | 913 ms/step , 6890.83 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 03:48:04 | Epoch: 0 | Step: 119600 | Dataset: 0-659264 | Loss: 0.997 | 913 ms/step , 6886.24 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 03:48:06 | Validation | Step: 119600 | Val_loss: 0.847 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:48:15 | Epoch: 0 | Step: 119610 | Dataset: 0-659584 | Loss: 0.792 | 914 ms/step , 6884.61 GFLOP/s , 15273.0 tokens/s INFO:__main__:2024-11-05 03:48:24 | Epoch: 0 | Step: 119620 | Dataset: 0-659904 | Loss: 0.818 | 913 ms/step , 6892.03 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 03:48:33 | Epoch: 0 | Step: 119630 | Dataset: 0-660224 | Loss: 0.707 | 913 ms/step , 6885.50 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 03:48:43 | Epoch: 0 | Step: 119640 | Dataset: 0-660544 | Loss: 0.792 | 912 ms/step , 6893.64 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 03:48:52 | Epoch: 0 | Step: 119650 | Dataset: 0-660864 | Loss: 0.943 | 913 ms/step , 6888.12 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 03:49:01 | Epoch: 0 | Step: 119660 | Dataset: 0-661184 | Loss: 0.740 | 912 ms/step , 6894.67 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 03:49:10 | Epoch: 0 | Step: 119670 | Dataset: 0-661504 | Loss: 0.906 | 914 ms/step , 6881.71 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 03:49:19 | Epoch: 0 | Step: 119680 | Dataset: 0-661824 | Loss: 0.812 | 914 ms/step , 6878.48 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 03:49:28 | Epoch: 0 | Step: 119690 | Dataset: 0-662144 | Loss: 0.855 | 913 ms/step , 6887.86 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 03:49:37 | Epoch: 0 | Step: 119700 | Dataset: 0-662464 | Loss: 0.883 | 913 ms/step , 6886.69 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 03:49:39 | Validation | Step: 119700 | Val_loss: 0.881 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:49:48 | Epoch: 0 | Step: 119710 | Dataset: 0-662784 | Loss: 0.835 | 913 ms/step , 6890.29 GFLOP/s , 15276.1 tokens/s INFO:__main__:2024-11-05 03:49:57 | Epoch: 0 | Step: 119720 | Dataset: 0-663104 | Loss: 0.883 | 913 ms/step , 6885.92 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 03:50:06 | Epoch: 0 | Step: 119730 | Dataset: 0-663424 | Loss: 0.763 | 913 ms/step , 6886.30 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 03:50:15 | Epoch: 0 | Step: 119740 | Dataset: 0-663744 | Loss: 0.828 | 914 ms/step , 6879.77 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:50:25 | Epoch: 0 | Step: 119750 | Dataset: 0-664064 | Loss: 0.900 | 914 ms/step , 6881.38 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:50:34 | Epoch: 0 | Step: 119760 | Dataset: 0-664384 | Loss: 0.917 | 913 ms/step , 6887.24 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 03:50:43 | Epoch: 0 | Step: 119770 | Dataset: 0-664704 | Loss: 0.830 | 913 ms/step , 6892.52 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 03:50:52 | Epoch: 0 | Step: 119780 | Dataset: 0-665024 | Loss: 0.774 | 912 ms/step , 6897.10 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 03:51:01 | Epoch: 0 | Step: 119790 | Dataset: 0-665344 | Loss: 0.858 | 913 ms/step , 6892.27 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 03:51:10 | Epoch: 0 | Step: 119800 | Dataset: 0-665664 | Loss: 0.767 | 913 ms/step , 6888.84 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 03:51:12 | Validation | Step: 119800 | Val_loss: 0.874 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:51:21 | Epoch: 0 | Step: 119810 | Dataset: 0-665984 | Loss: 0.883 | 913 ms/step , 6891.64 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-05 03:51:30 | Epoch: 0 | Step: 119820 | Dataset: 0-666304 | Loss: 0.888 | 914 ms/step , 6877.61 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 03:51:39 | Epoch: 0 | Step: 119830 | Dataset: 0-666624 | Loss: 0.901 | 913 ms/step , 6891.90 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 03:51:48 | Epoch: 0 | Step: 119840 | Dataset: 0-666944 | Loss: 0.950 | 913 ms/step , 6885.38 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 03:51:58 | Epoch: 0 | Step: 119850 | Dataset: 0-667264 | Loss: 0.764 | 915 ms/step , 6877.33 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 03:52:07 | Epoch: 0 | Step: 119860 | Dataset: 0-667584 | Loss: 0.815 | 914 ms/step , 6882.91 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 03:52:16 | Epoch: 0 | Step: 119870 | Dataset: 0-667904 | Loss: 0.810 | 914 ms/step , 6881.95 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 03:52:25 | Epoch: 0 | Step: 119880 | Dataset: 0-668224 | Loss: 0.861 | 913 ms/step , 6888.66 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 03:52:34 | Epoch: 0 | Step: 119890 | Dataset: 0-668544 | Loss: 0.813 | 913 ms/step , 6887.44 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 03:52:43 | Epoch: 0 | Step: 119900 | Dataset: 0-668864 | Loss: 1.001 | 914 ms/step , 6884.97 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 03:52:45 | Validation | Step: 119900 | Val_loss: 0.880 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:52:54 | Epoch: 0 | Step: 119910 | Dataset: 0-669184 | Loss: 0.821 | 914 ms/step , 6884.82 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-05 03:53:03 | Epoch: 0 | Step: 119920 | Dataset: 0-669504 | Loss: 0.803 | 912 ms/step , 6894.51 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 03:53:12 | Epoch: 0 | Step: 119930 | Dataset: 0-669824 | Loss: 0.899 | 913 ms/step , 6886.64 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 03:53:21 | Epoch: 0 | Step: 119940 | Dataset: 0-670144 | Loss: 0.943 | 913 ms/step , 6887.21 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 03:53:31 | Epoch: 0 | Step: 119950 | Dataset: 0-670464 | Loss: 0.896 | 914 ms/step , 6882.08 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 03:53:40 | Epoch: 0 | Step: 119960 | Dataset: 0-670784 | Loss: 0.766 | 912 ms/step , 6893.27 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 03:53:49 | Epoch: 0 | Step: 119970 | Dataset: 0-671104 | Loss: 0.924 | 913 ms/step , 6887.80 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 03:53:58 | Epoch: 0 | Step: 119980 | Dataset: 0-671424 | Loss: 0.632 | 912 ms/step , 6896.81 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 03:54:07 | Epoch: 0 | Step: 119990 | Dataset: 0-671744 | Loss: 0.910 | 912 ms/step , 6894.15 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 03:54:16 | Epoch: 0 | Step: 120000 | Dataset: 0-672064 | Loss: 0.842 | 912 ms/step , 6893.41 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 03:54:18 | Validation | Step: 120000 | Val_loss: 0.874 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:54:18 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_035418_step_120000.pt` INFO:__main__:2024-11-05 03:54:28 | Epoch: 0 | Step: 120010 | Dataset: 0-672384 | Loss: 0.767 | 913 ms/step , 6891.25 GFLOP/s , 13811.9 tokens/s INFO:__main__:2024-11-05 03:54:37 | Epoch: 0 | Step: 120020 | Dataset: 0-672704 | Loss: 0.931 | 913 ms/step , 6887.74 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 03:54:46 | Epoch: 0 | Step: 120030 | Dataset: 0-673024 | Loss: 0.895 | 913 ms/step , 6889.00 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 03:54:55 | Epoch: 0 | Step: 120040 | Dataset: 0-673344 | Loss: 0.808 | 913 ms/step , 6890.17 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 03:55:05 | Epoch: 0 | Step: 120050 | Dataset: 0-673664 | Loss: 0.536 | 913 ms/step , 6890.90 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 03:55:14 | Epoch: 0 | Step: 120060 | Dataset: 0-673984 | Loss: 0.759 | 913 ms/step , 6886.58 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 03:55:23 | Epoch: 0 | Step: 120070 | Dataset: 0-674304 | Loss: 0.683 | 913 ms/step , 6888.86 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 03:55:32 | Epoch: 0 | Step: 120080 | Dataset: 0-674624 | Loss: 0.705 | 915 ms/step , 6873.54 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 03:55:41 | Epoch: 0 | Step: 120090 | Dataset: 0-674944 | Loss: 0.645 | 912 ms/step , 6896.93 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 03:55:50 | Epoch: 0 | Step: 120100 | Dataset: 0-675264 | Loss: 0.662 | 913 ms/step , 6891.27 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 03:55:52 | Validation | Step: 120100 | Val_loss: 0.861 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:56:01 | Epoch: 0 | Step: 120110 | Dataset: 0-675584 | Loss: 0.746 | 914 ms/step , 6883.06 GFLOP/s , 15271.1 tokens/s INFO:__main__:2024-11-05 03:56:10 | Epoch: 0 | Step: 120120 | Dataset: 0-675904 | Loss: 0.781 | 914 ms/step , 6877.63 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 03:56:19 | Epoch: 0 | Step: 120130 | Dataset: 0-676224 | Loss: 0.836 | 912 ms/step , 6893.91 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 03:56:28 | Epoch: 0 | Step: 120140 | Dataset: 0-676544 | Loss: 0.726 | 912 ms/step , 6892.79 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 03:56:38 | Epoch: 0 | Step: 120150 | Dataset: 0-676864 | Loss: 0.738 | 913 ms/step , 6889.53 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 03:56:47 | Epoch: 0 | Step: 120160 | Dataset: 0-677184 | Loss: 0.630 | 913 ms/step , 6886.60 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 03:56:56 | Epoch: 0 | Step: 120170 | Dataset: 0-677504 | Loss: 0.727 | 914 ms/step , 6881.61 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 03:57:05 | Epoch: 0 | Step: 120180 | Dataset: 0-677824 | Loss: 0.961 | 913 ms/step , 6891.63 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 03:57:14 | Epoch: 0 | Step: 120190 | Dataset: 0-678144 | Loss: 0.749 | 913 ms/step , 6887.51 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 03:57:23 | Epoch: 0 | Step: 120200 | Dataset: 0-678464 | Loss: 0.764 | 913 ms/step , 6891.49 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 03:57:25 | Validation | Step: 120200 | Val_loss: 0.851 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:57:34 | Epoch: 0 | Step: 120210 | Dataset: 0-678784 | Loss: 0.854 | 912 ms/step , 6894.21 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 03:57:43 | Epoch: 0 | Step: 120220 | Dataset: 0-679104 | Loss: 0.729 | 913 ms/step , 6891.89 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 03:57:52 | Epoch: 0 | Step: 120230 | Dataset: 0-679424 | Loss: 0.749 | 913 ms/step , 6889.51 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 03:58:01 | Epoch: 0 | Step: 120240 | Dataset: 0-679744 | Loss: 0.806 | 914 ms/step , 6883.54 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 03:58:11 | Epoch: 0 | Step: 120250 | Dataset: 0-680064 | Loss: 0.832 | 914 ms/step , 6882.78 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 03:58:20 | Epoch: 0 | Step: 120260 | Dataset: 0-680384 | Loss: 0.802 | 914 ms/step , 6878.24 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 03:58:29 | Epoch: 0 | Step: 120270 | Dataset: 0-680704 | Loss: 0.809 | 914 ms/step , 6882.89 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 03:58:38 | Epoch: 0 | Step: 120280 | Dataset: 0-681024 | Loss: 0.838 | 914 ms/step , 6881.22 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 03:58:47 | Epoch: 0 | Step: 120290 | Dataset: 0-681344 | Loss: 0.518 | 913 ms/step , 6889.62 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 03:58:56 | Epoch: 0 | Step: 120300 | Dataset: 0-681664 | Loss: 0.816 | 914 ms/step , 6881.99 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 03:58:58 | Validation | Step: 120300 | Val_loss: 0.561 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 03:59:07 | Epoch: 0 | Step: 120310 | Dataset: 0-681984 | Loss: 0.901 | 913 ms/step , 6890.70 GFLOP/s , 15293.7 tokens/s INFO:__main__:2024-11-05 03:59:16 | Epoch: 0 | Step: 120320 | Dataset: 0-682304 | Loss: 0.578 | 912 ms/step , 6894.87 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 03:59:25 | Epoch: 0 | Step: 120330 | Dataset: 0-682624 | Loss: 0.851 | 914 ms/step , 6884.12 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 03:59:34 | Epoch: 0 | Step: 120340 | Dataset: 0-682944 | Loss: 0.828 | 913 ms/step , 6886.25 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 03:59:43 | Epoch: 0 | Step: 120350 | Dataset: 0-683264 | Loss: 0.759 | 913 ms/step , 6889.69 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 03:59:53 | Epoch: 0 | Step: 120360 | Dataset: 0-683584 | Loss: 0.863 | 915 ms/step , 6876.39 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 04:00:02 | Epoch: 0 | Step: 120370 | Dataset: 0-683904 | Loss: 0.932 | 914 ms/step , 6879.02 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 04:00:11 | Epoch: 0 | Step: 120380 | Dataset: 0-684224 | Loss: 0.833 | 914 ms/step , 6881.89 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 04:00:20 | Epoch: 0 | Step: 120390 | Dataset: 0-684544 | Loss: 0.742 | 913 ms/step , 6892.22 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 04:00:29 | Epoch: 0 | Step: 120400 | Dataset: 0-684864 | Loss: 0.826 | 914 ms/step , 6881.88 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 04:00:31 | Validation | Step: 120400 | Val_loss: 0.458 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:00:40 | Epoch: 0 | Step: 120410 | Dataset: 0-685184 | Loss: 0.912 | 914 ms/step , 6884.51 GFLOP/s , 15266.9 tokens/s INFO:__main__:2024-11-05 04:00:49 | Epoch: 0 | Step: 120420 | Dataset: 0-685504 | Loss: 0.843 | 914 ms/step , 6878.23 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 04:00:58 | Epoch: 0 | Step: 120430 | Dataset: 0-685824 | Loss: 0.816 | 913 ms/step , 6886.15 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 04:01:07 | Epoch: 0 | Step: 120440 | Dataset: 0-686144 | Loss: 0.792 | 914 ms/step , 6878.30 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 04:01:16 | Epoch: 0 | Step: 120450 | Dataset: 0-686464 | Loss: 0.776 | 913 ms/step , 6888.98 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 04:01:26 | Epoch: 0 | Step: 120460 | Dataset: 0-686784 | Loss: 0.746 | 914 ms/step , 6877.76 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 04:01:35 | Epoch: 0 | Step: 120470 | Dataset: 0-687104 | Loss: 0.824 | 914 ms/step , 6883.13 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 04:01:44 | Epoch: 0 | Step: 120480 | Dataset: 0-687424 | Loss: 0.873 | 915 ms/step , 6874.99 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 04:01:53 | Epoch: 0 | Step: 120490 | Dataset: 0-687744 | Loss: 0.794 | 914 ms/step , 6882.05 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 04:02:02 | Epoch: 0 | Step: 120500 | Dataset: 0-688064 | Loss: 0.837 | 915 ms/step , 6876.65 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 04:02:04 | Validation | Step: 120500 | Val_loss: 0.510 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:02:13 | Epoch: 0 | Step: 120510 | Dataset: 0-688384 | Loss: 0.900 | 914 ms/step , 6878.43 GFLOP/s , 15282.0 tokens/s INFO:__main__:2024-11-05 04:02:22 | Epoch: 0 | Step: 120520 | Dataset: 0-688704 | Loss: 0.878 | 915 ms/step , 6873.06 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 04:02:31 | Epoch: 0 | Step: 120530 | Dataset: 0-689024 | Loss: 0.882 | 915 ms/step , 6871.24 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 04:02:40 | Epoch: 0 | Step: 120540 | Dataset: 0-689344 | Loss: 0.815 | 913 ms/step , 6887.44 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 04:02:49 | Epoch: 0 | Step: 120550 | Dataset: 0-689664 | Loss: 0.932 | 915 ms/step , 6872.79 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 04:02:59 | Epoch: 0 | Step: 120560 | Dataset: 0-689984 | Loss: 0.707 | 913 ms/step , 6885.14 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 04:03:08 | Epoch: 0 | Step: 120570 | Dataset: 0-690304 | Loss: 0.790 | 915 ms/step , 6874.41 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 04:03:17 | Epoch: 0 | Step: 120580 | Dataset: 0-690624 | Loss: 0.747 | 913 ms/step , 6887.85 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 04:03:26 | Epoch: 0 | Step: 120590 | Dataset: 0-690944 | Loss: 0.809 | 914 ms/step , 6880.73 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 04:03:35 | Epoch: 0 | Step: 120600 | Dataset: 0-691264 | Loss: 0.665 | 914 ms/step , 6880.15 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 04:03:37 | Validation | Step: 120600 | Val_loss: 0.614 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:03:46 | Epoch: 0 | Step: 120610 | Dataset: 0-691584 | Loss: 0.882 | 915 ms/step , 6874.50 GFLOP/s , 15282.9 tokens/s INFO:__main__:2024-11-05 04:03:55 | Epoch: 0 | Step: 120620 | Dataset: 0-691904 | Loss: 0.728 | 913 ms/step , 6887.87 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 04:04:04 | Epoch: 0 | Step: 120630 | Dataset: 0-692224 | Loss: 0.795 | 915 ms/step , 6877.49 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 04:04:13 | Epoch: 0 | Step: 120640 | Dataset: 0-692544 | Loss: 0.878 | 915 ms/step , 6875.04 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 04:04:22 | Epoch: 0 | Step: 120650 | Dataset: 0-692864 | Loss: 0.694 | 914 ms/step , 6877.93 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 04:04:32 | Epoch: 0 | Step: 120660 | Dataset: 0-693184 | Loss: 0.580 | 915 ms/step , 6876.12 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 04:04:41 | Epoch: 0 | Step: 120670 | Dataset: 0-693504 | Loss: 0.758 | 914 ms/step , 6884.72 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 04:04:50 | Epoch: 0 | Step: 120680 | Dataset: 0-693824 | Loss: 0.778 | 914 ms/step , 6883.50 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 04:04:59 | Epoch: 0 | Step: 120690 | Dataset: 0-694144 | Loss: 0.895 | 916 ms/step , 6869.08 GFLOP/s , 17912.4 tokens/s INFO:__main__:2024-11-05 04:05:08 | Epoch: 0 | Step: 120700 | Dataset: 0-694464 | Loss: 0.832 | 913 ms/step , 6887.89 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 04:05:10 | Validation | Step: 120700 | Val_loss: 0.807 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:05:19 | Epoch: 0 | Step: 120710 | Dataset: 0-694784 | Loss: 0.771 | 913 ms/step , 6887.91 GFLOP/s , 15272.1 tokens/s INFO:__main__:2024-11-05 04:05:28 | Epoch: 0 | Step: 120720 | Dataset: 0-695104 | Loss: 0.798 | 912 ms/step , 6892.89 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 04:05:37 | Epoch: 0 | Step: 120730 | Dataset: 0-695424 | Loss: 0.735 | 914 ms/step , 6879.88 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 04:05:46 | Epoch: 0 | Step: 120740 | Dataset: 0-695744 | Loss: 0.717 | 914 ms/step , 6877.54 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 04:05:55 | Epoch: 0 | Step: 120750 | Dataset: 0-696064 | Loss: 0.690 | 913 ms/step , 6885.29 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 04:06:05 | Epoch: 0 | Step: 120760 | Dataset: 0-696384 | Loss: 0.878 | 915 ms/step , 6877.36 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 04:06:14 | Epoch: 0 | Step: 120770 | Dataset: 0-696704 | Loss: 0.826 | 913 ms/step , 6885.26 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 04:06:23 | Epoch: 0 | Step: 120780 | Dataset: 0-697024 | Loss: 0.784 | 913 ms/step , 6888.43 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 04:06:32 | Epoch: 0 | Step: 120790 | Dataset: 0-697344 | Loss: 0.909 | 914 ms/step , 6877.65 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 04:06:41 | Epoch: 0 | Step: 120800 | Dataset: 0-697664 | Loss: 0.872 | 915 ms/step , 6872.82 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 04:06:43 | Validation | Step: 120800 | Val_loss: 0.767 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:06:52 | Epoch: 0 | Step: 120810 | Dataset: 0-697984 | Loss: 0.772 | 914 ms/step , 6884.83 GFLOP/s , 15272.8 tokens/s INFO:__main__:2024-11-05 04:07:01 | Epoch: 0 | Step: 120820 | Dataset: 0-698304 | Loss: 0.881 | 914 ms/step , 6878.87 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 04:07:10 | Epoch: 0 | Step: 120830 | Dataset: 0-698624 | Loss: 0.668 | 913 ms/step , 6889.05 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 04:07:19 | Epoch: 0 | Step: 120840 | Dataset: 0-698944 | Loss: 0.925 | 913 ms/step , 6889.63 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 04:07:28 | Epoch: 0 | Step: 120850 | Dataset: 0-699264 | Loss: 0.769 | 914 ms/step , 6883.89 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 04:07:38 | Epoch: 0 | Step: 120860 | Dataset: 0-699584 | Loss: 0.694 | 915 ms/step , 6877.44 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 04:07:47 | Epoch: 0 | Step: 120870 | Dataset: 0-699904 | Loss: 0.490 | 912 ms/step , 6899.95 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 04:07:56 | Epoch: 0 | Step: 120880 | Dataset: 0-700224 | Loss: 0.880 | 915 ms/step , 6872.96 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 04:08:05 | Epoch: 0 | Step: 120890 | Dataset: 0-700544 | Loss: 0.757 | 915 ms/step , 6871.00 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 04:08:14 | Epoch: 0 | Step: 120900 | Dataset: 0-700864 | Loss: 0.929 | 914 ms/step , 6881.60 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 04:08:16 | Validation | Step: 120900 | Val_loss: 0.795 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:08:25 | Epoch: 0 | Step: 120910 | Dataset: 0-701184 | Loss: 0.737 | 913 ms/step , 6888.01 GFLOP/s , 15264.3 tokens/s INFO:__main__:2024-11-05 04:08:34 | Epoch: 0 | Step: 120920 | Dataset: 0-701504 | Loss: 0.773 | 913 ms/step , 6890.61 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 04:08:43 | Epoch: 0 | Step: 120930 | Dataset: 0-701824 | Loss: 0.863 | 914 ms/step , 6880.08 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 04:08:52 | Epoch: 0 | Step: 120940 | Dataset: 0-702144 | Loss: 0.747 | 914 ms/step , 6882.38 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 04:09:02 | Epoch: 0 | Step: 120950 | Dataset: 0-702464 | Loss: 0.978 | 916 ms/step , 6865.93 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 04:09:11 | Epoch: 0 | Step: 120960 | Dataset: 0-702784 | Loss: 0.797 | 912 ms/step , 6893.71 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 04:09:20 | Epoch: 0 | Step: 120970 | Dataset: 0-703104 | Loss: 0.526 | 913 ms/step , 6889.97 GFLOP/s , 17911.8 tokens/s INFO:__main__:2024-11-05 04:09:29 | Epoch: 0 | Step: 120980 | Dataset: 0-703424 | Loss: 0.838 | 914 ms/step , 6884.48 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 04:09:38 | Epoch: 0 | Step: 120990 | Dataset: 0-703744 | Loss: 0.726 | 913 ms/step , 6886.02 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 04:09:47 | Epoch: 0 | Step: 121000 | Dataset: 0-704064 | Loss: 0.666 | 913 ms/step , 6887.93 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 04:09:49 | Validation | Step: 121000 | Val_loss: 0.792 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:09:49 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_040949_step_121000.pt` INFO:__main__:2024-11-05 04:09:59 | Epoch: 0 | Step: 121010 | Dataset: 0-704384 | Loss: 0.936 | 914 ms/step , 6879.77 GFLOP/s , 13814.8 tokens/s INFO:__main__:2024-11-05 04:10:08 | Epoch: 0 | Step: 121020 | Dataset: 0-704704 | Loss: 0.848 | 914 ms/step , 6884.13 GFLOP/s , 17905.3 tokens/s INFO:__main__:2024-11-05 04:10:17 | Epoch: 0 | Step: 121030 | Dataset: 0-705024 | Loss: 0.620 | 913 ms/step , 6888.58 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 04:10:27 | Epoch: 0 | Step: 121040 | Dataset: 0-705344 | Loss: 0.791 | 914 ms/step , 6881.93 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 04:10:36 | Epoch: 0 | Step: 121050 | Dataset: 0-705664 | Loss: 0.716 | 913 ms/step , 6887.36 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 04:10:45 | Epoch: 0 | Step: 121060 | Dataset: 0-705984 | Loss: 0.823 | 913 ms/step , 6891.99 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 04:10:54 | Epoch: 0 | Step: 121070 | Dataset: 0-706304 | Loss: 0.772 | 914 ms/step , 6883.95 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 04:11:03 | Epoch: 0 | Step: 121080 | Dataset: 0-706624 | Loss: 0.729 | 913 ms/step , 6889.65 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 04:11:12 | Epoch: 0 | Step: 121090 | Dataset: 0-706944 | Loss: 0.880 | 915 ms/step , 6877.44 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 04:11:21 | Epoch: 0 | Step: 121100 | Dataset: 0-707264 | Loss: 0.813 | 916 ms/step , 6867.91 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 04:11:23 | Validation | Step: 121100 | Val_loss: 0.770 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:11:32 | Epoch: 0 | Step: 121110 | Dataset: 0-707584 | Loss: 0.773 | 914 ms/step , 6884.12 GFLOP/s , 15268.9 tokens/s INFO:__main__:2024-11-05 04:11:41 | Epoch: 0 | Step: 121120 | Dataset: 0-707904 | Loss: 0.584 | 912 ms/step , 6895.94 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 04:11:50 | Epoch: 0 | Step: 121130 | Dataset: 0-708224 | Loss: 0.853 | 914 ms/step , 6884.25 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 04:12:00 | Epoch: 0 | Step: 121140 | Dataset: 0-708544 | Loss: 0.769 | 914 ms/step , 6883.05 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 04:12:09 | Epoch: 0 | Step: 121150 | Dataset: 0-708864 | Loss: 0.947 | 914 ms/step , 6879.54 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 04:12:18 | Epoch: 0 | Step: 121160 | Dataset: 0-709184 | Loss: 0.834 | 913 ms/step , 6889.64 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 04:12:27 | Epoch: 0 | Step: 121170 | Dataset: 0-709504 | Loss: 0.879 | 915 ms/step , 6874.56 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 04:12:36 | Epoch: 0 | Step: 121180 | Dataset: 0-709824 | Loss: 0.802 | 914 ms/step , 6879.12 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 04:12:45 | Epoch: 0 | Step: 121190 | Dataset: 0-710144 | Loss: 0.822 | 914 ms/step , 6883.89 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 04:12:54 | Epoch: 0 | Step: 121200 | Dataset: 0-710464 | Loss: 0.654 | 913 ms/step , 6890.90 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 04:12:56 | Validation | Step: 121200 | Val_loss: 0.790 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:13:05 | Epoch: 0 | Step: 121210 | Dataset: 0-710784 | Loss: 0.883 | 914 ms/step , 6883.80 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-05 04:13:14 | Epoch: 0 | Step: 121220 | Dataset: 0-711104 | Loss: 0.752 | 915 ms/step , 6875.03 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 04:13:23 | Epoch: 0 | Step: 121230 | Dataset: 0-711424 | Loss: 0.579 | 913 ms/step , 6889.76 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 04:13:33 | Epoch: 0 | Step: 121240 | Dataset: 0-711744 | Loss: 0.813 | 912 ms/step , 6895.63 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 04:13:42 | Epoch: 0 | Step: 121250 | Dataset: 0-712064 | Loss: 0.725 | 915 ms/step , 6873.49 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 04:13:51 | Epoch: 0 | Step: 121260 | Dataset: 0-712384 | Loss: 0.566 | 914 ms/step , 6884.25 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 04:14:00 | Epoch: 0 | Step: 121270 | Dataset: 0-712704 | Loss: 0.848 | 913 ms/step , 6886.73 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 04:14:09 | Epoch: 0 | Step: 121280 | Dataset: 0-713024 | Loss: 0.655 | 914 ms/step , 6882.02 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 04:14:18 | Epoch: 0 | Step: 121290 | Dataset: 0-713344 | Loss: 0.716 | 915 ms/step , 6873.06 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 04:14:27 | Epoch: 0 | Step: 121300 | Dataset: 0-713664 | Loss: 0.944 | 916 ms/step , 6868.61 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 04:14:29 | Validation | Step: 121300 | Val_loss: 0.783 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:14:38 | Epoch: 0 | Step: 121310 | Dataset: 0-713984 | Loss: 0.863 | 914 ms/step , 6884.69 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 04:14:47 | Epoch: 0 | Step: 121320 | Dataset: 0-714304 | Loss: 0.808 | 912 ms/step , 6892.96 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 04:14:56 | Epoch: 0 | Step: 121330 | Dataset: 0-714624 | Loss: 0.789 | 913 ms/step , 6887.55 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 04:15:05 | Epoch: 0 | Step: 121340 | Dataset: 0-714944 | Loss: 0.828 | 913 ms/step , 6888.03 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 04:15:15 | Epoch: 0 | Step: 121350 | Dataset: 0-715264 | Loss: 0.884 | 914 ms/step , 6882.86 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 04:15:24 | Epoch: 0 | Step: 121360 | Dataset: 0-715584 | Loss: 0.804 | 914 ms/step , 6881.05 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 04:15:33 | Epoch: 0 | Step: 121370 | Dataset: 0-715904 | Loss: 0.575 | 912 ms/step , 6898.19 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 04:15:42 | Epoch: 0 | Step: 121380 | Dataset: 0-716224 | Loss: 0.504 | 913 ms/step , 6887.48 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 04:15:51 | Epoch: 0 | Step: 121390 | Dataset: 0-716544 | Loss: 0.904 | 914 ms/step , 6881.68 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 04:16:00 | Epoch: 0 | Step: 121400 | Dataset: 0-716864 | Loss: 0.799 | 916 ms/step , 6869.63 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 04:16:02 | Validation | Step: 121400 | Val_loss: 0.783 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:16:11 | Epoch: 0 | Step: 121410 | Dataset: 0-717184 | Loss: 0.717 | 914 ms/step , 6883.30 GFLOP/s , 15264.5 tokens/s INFO:__main__:2024-11-05 04:16:20 | Epoch: 0 | Step: 121420 | Dataset: 0-717504 | Loss: 0.893 | 914 ms/step , 6877.58 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 04:16:29 | Epoch: 0 | Step: 121430 | Dataset: 0-717824 | Loss: 0.885 | 913 ms/step , 6886.87 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 04:16:38 | Epoch: 0 | Step: 121440 | Dataset: 0-718144 | Loss: 0.629 | 913 ms/step , 6888.40 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 04:16:48 | Epoch: 0 | Step: 121450 | Dataset: 0-718464 | Loss: 0.823 | 914 ms/step , 6881.64 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 04:16:57 | Epoch: 0 | Step: 121460 | Dataset: 0-718784 | Loss: 0.805 | 913 ms/step , 6887.60 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 04:17:06 | Epoch: 0 | Step: 121470 | Dataset: 0-719104 | Loss: 0.822 | 914 ms/step , 6880.43 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 04:17:15 | Epoch: 0 | Step: 121480 | Dataset: 0-719424 | Loss: 0.859 | 913 ms/step , 6891.05 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 04:17:24 | Epoch: 0 | Step: 121490 | Dataset: 0-719744 | Loss: 0.786 | 913 ms/step , 6888.62 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 04:17:33 | Epoch: 0 | Step: 121500 | Dataset: 0-720064 | Loss: 0.634 | 913 ms/step , 6888.00 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 04:17:35 | Validation | Step: 121500 | Val_loss: 0.783 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:17:44 | Epoch: 0 | Step: 121510 | Dataset: 0-720384 | Loss: 0.737 | 913 ms/step , 6891.25 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-05 04:17:53 | Epoch: 0 | Step: 121520 | Dataset: 0-720704 | Loss: 0.825 | 912 ms/step , 6893.27 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 04:18:02 | Epoch: 0 | Step: 121530 | Dataset: 0-721024 | Loss: 0.880 | 913 ms/step , 6888.81 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 04:18:11 | Epoch: 0 | Step: 121540 | Dataset: 0-721344 | Loss: 0.714 | 913 ms/step , 6890.57 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 04:18:21 | Epoch: 0 | Step: 121550 | Dataset: 0-721664 | Loss: 0.673 | 912 ms/step , 6892.76 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 04:18:30 | Epoch: 0 | Step: 121560 | Dataset: 0-721984 | Loss: 0.801 | 915 ms/step , 6876.18 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 04:18:39 | Epoch: 0 | Step: 121570 | Dataset: 0-722304 | Loss: 0.791 | 913 ms/step , 6889.28 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 04:18:48 | Epoch: 0 | Step: 121580 | Dataset: 0-722624 | Loss: 0.739 | 913 ms/step , 6888.96 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 04:18:57 | Epoch: 0 | Step: 121590 | Dataset: 0-722944 | Loss: 0.756 | 914 ms/step , 6878.26 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 04:19:06 | Epoch: 0 | Step: 121600 | Dataset: 0-723264 | Loss: 0.612 | 912 ms/step , 6893.97 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 04:19:08 | Validation | Step: 121600 | Val_loss: 0.824 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:19:17 | Epoch: 0 | Step: 121610 | Dataset: 0-723584 | Loss: 0.739 | 912 ms/step , 6895.32 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-05 04:19:26 | Epoch: 0 | Step: 121620 | Dataset: 0-723904 | Loss: 0.777 | 915 ms/step , 6870.56 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 04:19:35 | Epoch: 0 | Step: 121630 | Dataset: 0-724224 | Loss: 0.801 | 913 ms/step , 6888.77 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 04:19:44 | Epoch: 0 | Step: 121640 | Dataset: 0-724544 | Loss: 0.630 | 913 ms/step , 6892.30 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 04:19:54 | Epoch: 0 | Step: 121650 | Dataset: 0-724864 | Loss: 0.812 | 914 ms/step , 6880.61 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 04:20:03 | Epoch: 0 | Step: 121660 | Dataset: 0-725184 | Loss: 0.841 | 914 ms/step , 6879.57 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 04:20:12 | Epoch: 0 | Step: 121670 | Dataset: 0-725504 | Loss: 0.772 | 914 ms/step , 6883.03 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 04:20:21 | Epoch: 0 | Step: 121680 | Dataset: 0-725824 | Loss: 0.800 | 913 ms/step , 6889.24 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 04:20:30 | Epoch: 0 | Step: 121690 | Dataset: 0-726144 | Loss: 0.912 | 913 ms/step , 6887.09 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 04:20:39 | Epoch: 0 | Step: 121700 | Dataset: 0-726464 | Loss: 0.771 | 913 ms/step , 6892.15 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 04:20:41 | Validation | Step: 121700 | Val_loss: 0.764 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:20:50 | Epoch: 0 | Step: 121710 | Dataset: 0-726784 | Loss: 0.880 | 914 ms/step , 6883.99 GFLOP/s , 15276.1 tokens/s INFO:__main__:2024-11-05 04:20:59 | Epoch: 0 | Step: 121720 | Dataset: 0-727104 | Loss: 0.799 | 913 ms/step , 6891.79 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 04:21:08 | Epoch: 0 | Step: 121730 | Dataset: 0-727424 | Loss: 0.717 | 913 ms/step , 6891.84 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 04:21:17 | Epoch: 0 | Step: 121740 | Dataset: 0-727744 | Loss: 0.770 | 912 ms/step , 6893.89 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 04:21:27 | Epoch: 0 | Step: 121750 | Dataset: 0-728064 | Loss: 0.888 | 914 ms/step , 6884.64 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 04:21:36 | Epoch: 0 | Step: 121760 | Dataset: 0-728384 | Loss: 0.858 | 914 ms/step , 6877.87 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 04:21:45 | Epoch: 0 | Step: 121770 | Dataset: 0-728704 | Loss: 0.839 | 914 ms/step , 6879.76 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 04:21:54 | Epoch: 0 | Step: 121780 | Dataset: 0-729024 | Loss: 0.900 | 914 ms/step , 6880.65 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 04:22:03 | Epoch: 0 | Step: 121790 | Dataset: 0-729344 | Loss: 0.758 | 915 ms/step , 6874.47 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-05 04:22:12 | Epoch: 0 | Step: 121800 | Dataset: 0-729664 | Loss: 0.790 | 913 ms/step , 6889.52 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 04:22:14 | Validation | Step: 121800 | Val_loss: 0.798 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:22:23 | Epoch: 0 | Step: 121810 | Dataset: 0-729984 | Loss: 0.771 | 914 ms/step , 6880.70 GFLOP/s , 15263.1 tokens/s INFO:__main__:2024-11-05 04:22:32 | Epoch: 0 | Step: 121820 | Dataset: 0-730304 | Loss: 0.793 | 914 ms/step , 6878.79 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 04:22:41 | Epoch: 0 | Step: 121830 | Dataset: 0-730624 | Loss: 0.864 | 916 ms/step , 6866.95 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 04:22:50 | Epoch: 0 | Step: 121840 | Dataset: 0-730944 | Loss: 0.722 | 914 ms/step , 6884.51 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 04:23:00 | Epoch: 0 | Step: 121850 | Dataset: 0-731264 | Loss: 0.821 | 913 ms/step , 6888.01 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 04:23:09 | Epoch: 0 | Step: 121860 | Dataset: 0-731584 | Loss: 0.890 | 916 ms/step , 6862.67 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 04:23:18 | Epoch: 0 | Step: 121870 | Dataset: 0-731904 | Loss: 0.715 | 912 ms/step , 6892.77 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 04:23:27 | Epoch: 0 | Step: 121880 | Dataset: 0-732224 | Loss: 0.785 | 915 ms/step , 6875.64 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 04:23:36 | Epoch: 0 | Step: 121890 | Dataset: 0-732544 | Loss: 0.768 | 915 ms/step , 6873.73 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 04:23:45 | Epoch: 0 | Step: 121900 | Dataset: 0-732864 | Loss: 0.857 | 917 ms/step , 6858.90 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 04:23:47 | Validation | Step: 121900 | Val_loss: 0.745 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:23:56 | Epoch: 0 | Step: 121910 | Dataset: 0-733184 | Loss: 0.730 | 914 ms/step , 6882.47 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-05 04:24:05 | Epoch: 0 | Step: 121920 | Dataset: 0-733504 | Loss: 0.811 | 915 ms/step , 6872.23 GFLOP/s , 17912.4 tokens/s INFO:__main__:2024-11-05 04:24:14 | Epoch: 0 | Step: 121930 | Dataset: 0-733824 | Loss: 0.697 | 914 ms/step , 6877.70 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 04:24:23 | Epoch: 0 | Step: 121940 | Dataset: 0-734144 | Loss: 0.778 | 912 ms/step , 6896.77 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 04:24:33 | Epoch: 0 | Step: 121950 | Dataset: 0-734464 | Loss: 0.889 | 914 ms/step , 6878.54 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 04:24:42 | Epoch: 0 | Step: 121960 | Dataset: 0-734784 | Loss: 0.769 | 914 ms/step , 6885.04 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 04:24:51 | Epoch: 0 | Step: 121970 | Dataset: 0-735104 | Loss: 0.834 | 915 ms/step , 6871.24 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 04:25:00 | Epoch: 0 | Step: 121980 | Dataset: 0-735424 | Loss: 0.824 | 915 ms/step , 6875.36 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 04:25:09 | Epoch: 0 | Step: 121990 | Dataset: 0-735744 | Loss: 0.740 | 913 ms/step , 6889.45 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 04:25:18 | Epoch: 0 | Step: 122000 | Dataset: 0-736064 | Loss: 0.735 | 912 ms/step , 6892.87 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 04:25:20 | Validation | Step: 122000 | Val_loss: 0.944 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:25:20 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_042520_step_122000.pt` INFO:__main__:2024-11-05 04:25:30 | Epoch: 0 | Step: 122010 | Dataset: 0-736384 | Loss: 0.667 | 914 ms/step , 6884.67 GFLOP/s , 13817.4 tokens/s INFO:__main__:2024-11-05 04:25:39 | Epoch: 0 | Step: 122020 | Dataset: 0-736704 | Loss: 0.881 | 915 ms/step , 6871.95 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 04:25:48 | Epoch: 0 | Step: 122030 | Dataset: 0-737024 | Loss: 0.778 | 915 ms/step , 6877.30 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 04:25:58 | Epoch: 0 | Step: 122040 | Dataset: 0-737344 | Loss: 0.793 | 913 ms/step , 6885.17 GFLOP/s , 17891.9 tokens/s INFO:__main__:2024-11-05 04:26:07 | Epoch: 0 | Step: 122050 | Dataset: 0-737664 | Loss: 0.759 | 913 ms/step , 6889.74 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 04:26:16 | Epoch: 0 | Step: 122060 | Dataset: 0-737984 | Loss: 0.755 | 913 ms/step , 6889.08 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 04:26:25 | Epoch: 0 | Step: 122070 | Dataset: 0-738304 | Loss: 0.729 | 914 ms/step , 6883.51 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 04:26:34 | Epoch: 0 | Step: 122080 | Dataset: 0-738624 | Loss: 0.688 | 913 ms/step , 6886.49 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 04:26:43 | Epoch: 0 | Step: 122090 | Dataset: 0-738944 | Loss: 0.687 | 914 ms/step , 6883.42 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 04:26:52 | Epoch: 0 | Step: 122100 | Dataset: 0-739264 | Loss: 0.754 | 914 ms/step , 6881.84 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 04:26:54 | Validation | Step: 122100 | Val_loss: 0.800 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:27:03 | Epoch: 0 | Step: 122110 | Dataset: 0-739584 | Loss: 0.710 | 914 ms/step , 6879.24 GFLOP/s , 15275.4 tokens/s INFO:__main__:2024-11-05 04:27:12 | Epoch: 0 | Step: 122120 | Dataset: 0-739904 | Loss: 0.757 | 915 ms/step , 6875.15 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 04:27:21 | Epoch: 0 | Step: 122130 | Dataset: 0-740224 | Loss: 0.630 | 913 ms/step , 6886.08 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 04:27:31 | Epoch: 0 | Step: 122140 | Dataset: 0-740544 | Loss: 0.751 | 916 ms/step , 6865.82 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 04:27:40 | Epoch: 0 | Step: 122150 | Dataset: 0-740864 | Loss: 0.824 | 914 ms/step , 6880.01 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 04:27:49 | Epoch: 0 | Step: 122160 | Dataset: 0-741184 | Loss: 0.856 | 913 ms/step , 6885.13 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 04:27:58 | Epoch: 0 | Step: 122170 | Dataset: 0-741504 | Loss: 0.681 | 913 ms/step , 6889.96 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 04:28:07 | Epoch: 0 | Step: 122180 | Dataset: 0-741824 | Loss: 0.832 | 915 ms/step , 6875.96 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 04:28:16 | Epoch: 0 | Step: 122190 | Dataset: 0-742144 | Loss: 0.761 | 913 ms/step , 6885.50 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 04:28:25 | Epoch: 0 | Step: 122200 | Dataset: 0-742464 | Loss: 0.766 | 914 ms/step , 6883.31 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 04:28:27 | Validation | Step: 122200 | Val_loss: 0.883 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:28:36 | Epoch: 0 | Step: 122210 | Dataset: 0-742784 | Loss: 0.819 | 915 ms/step , 6874.49 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-05 04:28:45 | Epoch: 0 | Step: 122220 | Dataset: 0-743104 | Loss: 0.784 | 913 ms/step , 6886.98 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 04:28:54 | Epoch: 0 | Step: 122230 | Dataset: 0-743424 | Loss: 0.685 | 914 ms/step , 6878.32 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 04:29:04 | Epoch: 0 | Step: 122240 | Dataset: 0-743744 | Loss: 0.764 | 914 ms/step , 6882.24 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-05 04:29:13 | Epoch: 0 | Step: 122250 | Dataset: 0-744064 | Loss: 0.727 | 914 ms/step , 6883.15 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 04:29:22 | Epoch: 0 | Step: 122260 | Dataset: 0-744384 | Loss: 0.721 | 913 ms/step , 6887.24 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 04:29:31 | Epoch: 0 | Step: 122270 | Dataset: 0-744704 | Loss: 0.703 | 914 ms/step , 6883.24 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 04:29:40 | Epoch: 0 | Step: 122280 | Dataset: 0-745024 | Loss: 0.770 | 914 ms/step , 6880.25 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 04:29:49 | Epoch: 0 | Step: 122290 | Dataset: 0-745344 | Loss: 0.836 | 913 ms/step , 6885.15 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 04:29:58 | Epoch: 0 | Step: 122300 | Dataset: 0-745664 | Loss: 0.786 | 915 ms/step , 6870.92 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-05 04:30:00 | Validation | Step: 122300 | Val_loss: 0.903 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:30:09 | Epoch: 0 | Step: 122310 | Dataset: 0-745984 | Loss: 0.755 | 914 ms/step , 6879.73 GFLOP/s , 15269.5 tokens/s INFO:__main__:2024-11-05 04:30:18 | Epoch: 0 | Step: 122320 | Dataset: 0-746304 | Loss: 0.876 | 913 ms/step , 6885.15 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 04:30:27 | Epoch: 0 | Step: 122330 | Dataset: 0-746624 | Loss: 0.892 | 916 ms/step , 6868.17 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 04:30:37 | Epoch: 0 | Step: 122340 | Dataset: 0-746944 | Loss: 0.784 | 914 ms/step , 6881.74 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 04:30:46 | Epoch: 0 | Step: 122350 | Dataset: 0-747264 | Loss: 0.813 | 914 ms/step , 6881.05 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 04:30:55 | Epoch: 0 | Step: 122360 | Dataset: 0-747584 | Loss: 0.798 | 914 ms/step , 6882.88 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 04:31:04 | Epoch: 0 | Step: 122370 | Dataset: 0-747904 | Loss: 0.764 | 913 ms/step , 6885.25 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 04:31:13 | Epoch: 0 | Step: 122380 | Dataset: 0-748224 | Loss: 0.814 | 913 ms/step , 6892.25 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 04:31:22 | Epoch: 0 | Step: 122390 | Dataset: 0-748544 | Loss: 0.789 | 914 ms/step , 6880.80 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 04:31:31 | Epoch: 0 | Step: 122400 | Dataset: 0-748864 | Loss: 0.737 | 914 ms/step , 6884.61 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 04:31:33 | Validation | Step: 122400 | Val_loss: 0.917 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:31:42 | Epoch: 0 | Step: 122410 | Dataset: 0-749184 | Loss: 0.853 | 915 ms/step , 6874.44 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 04:31:51 | Epoch: 0 | Step: 122420 | Dataset: 0-749504 | Loss: 0.806 | 914 ms/step , 6880.50 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 04:32:01 | Epoch: 0 | Step: 122430 | Dataset: 0-749824 | Loss: 0.772 | 913 ms/step , 6890.13 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 04:32:10 | Epoch: 0 | Step: 122440 | Dataset: 0-750144 | Loss: 0.714 | 914 ms/step , 6882.73 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 04:32:19 | Epoch: 0 | Step: 122450 | Dataset: 0-750464 | Loss: 0.755 | 912 ms/step , 6893.98 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 04:32:28 | Epoch: 0 | Step: 122460 | Dataset: 0-750784 | Loss: 0.710 | 915 ms/step , 6877.29 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 04:32:37 | Epoch: 0 | Step: 122470 | Dataset: 0-751104 | Loss: 0.727 | 913 ms/step , 6892.15 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 04:32:46 | Epoch: 0 | Step: 122480 | Dataset: 0-751424 | Loss: 0.903 | 917 ms/step , 6862.29 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-05 04:32:55 | Epoch: 0 | Step: 122490 | Dataset: 0-751744 | Loss: 0.748 | 916 ms/step , 6866.46 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 04:33:05 | Epoch: 0 | Step: 122500 | Dataset: 0-752064 | Loss: 0.735 | 913 ms/step , 6889.06 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 04:33:06 | Validation | Step: 122500 | Val_loss: 0.747 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:33:15 | Epoch: 0 | Step: 122510 | Dataset: 0-752384 | Loss: 0.733 | 913 ms/step , 6888.82 GFLOP/s , 15262.4 tokens/s INFO:__main__:2024-11-05 04:33:24 | Epoch: 0 | Step: 122520 | Dataset: 0-752704 | Loss: 0.623 | 913 ms/step , 6889.27 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 04:33:34 | Epoch: 0 | Step: 122530 | Dataset: 0-753024 | Loss: 0.784 | 913 ms/step , 6885.70 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 04:33:43 | Epoch: 0 | Step: 122540 | Dataset: 0-753344 | Loss: 0.703 | 914 ms/step , 6883.19 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 04:33:52 | Epoch: 0 | Step: 122550 | Dataset: 0-753664 | Loss: 0.717 | 914 ms/step , 6878.24 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 04:34:01 | Epoch: 0 | Step: 122560 | Dataset: 0-753984 | Loss: 0.762 | 915 ms/step , 6872.71 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 04:34:10 | Epoch: 0 | Step: 122570 | Dataset: 0-754304 | Loss: 0.742 | 915 ms/step , 6875.07 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 04:34:19 | Epoch: 0 | Step: 122580 | Dataset: 0-754624 | Loss: 0.773 | 913 ms/step , 6885.54 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 04:34:28 | Epoch: 0 | Step: 122590 | Dataset: 0-754944 | Loss: 0.760 | 914 ms/step , 6880.38 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 04:34:38 | Epoch: 0 | Step: 122600 | Dataset: 0-755264 | Loss: 0.836 | 912 ms/step , 6897.18 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 04:34:39 | Validation | Step: 122600 | Val_loss: 0.789 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:34:48 | Epoch: 0 | Step: 122610 | Dataset: 0-755584 | Loss: 0.840 | 914 ms/step , 6883.79 GFLOP/s , 15265.5 tokens/s INFO:__main__:2024-11-05 04:34:57 | Epoch: 0 | Step: 122620 | Dataset: 0-755904 | Loss: 0.744 | 914 ms/step , 6882.59 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 04:35:07 | Epoch: 0 | Step: 122630 | Dataset: 0-756224 | Loss: 0.706 | 913 ms/step , 6888.87 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 04:35:16 | Epoch: 0 | Step: 122640 | Dataset: 0-756544 | Loss: 0.874 | 915 ms/step , 6874.37 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 04:35:25 | Epoch: 0 | Step: 122650 | Dataset: 0-756864 | Loss: 0.809 | 914 ms/step , 6879.25 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 04:35:34 | Epoch: 0 | Step: 122660 | Dataset: 0-757184 | Loss: 0.743 | 913 ms/step , 6889.81 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 04:35:43 | Epoch: 0 | Step: 122670 | Dataset: 0-757504 | Loss: 0.790 | 914 ms/step , 6878.17 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 04:35:52 | Epoch: 0 | Step: 122680 | Dataset: 0-757824 | Loss: 0.627 | 914 ms/step , 6879.63 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 04:36:01 | Epoch: 0 | Step: 122690 | Dataset: 0-758144 | Loss: 0.809 | 914 ms/step , 6884.58 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 04:36:11 | Epoch: 0 | Step: 122700 | Dataset: 0-758464 | Loss: 0.754 | 914 ms/step , 6883.64 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 04:36:12 | Validation | Step: 122700 | Val_loss: 0.905 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:36:21 | Epoch: 0 | Step: 122710 | Dataset: 0-758784 | Loss: 0.747 | 912 ms/step , 6892.63 GFLOP/s , 15266.2 tokens/s INFO:__main__:2024-11-05 04:36:30 | Epoch: 0 | Step: 122720 | Dataset: 0-759104 | Loss: 0.724 | 915 ms/step , 6876.86 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 04:36:40 | Epoch: 0 | Step: 122730 | Dataset: 0-759424 | Loss: 0.790 | 914 ms/step , 6877.72 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 04:36:49 | Epoch: 0 | Step: 122740 | Dataset: 0-759744 | Loss: 0.738 | 915 ms/step , 6870.41 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 04:36:58 | Epoch: 0 | Step: 122750 | Dataset: 0-760064 | Loss: 0.843 | 913 ms/step , 6887.33 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 04:37:07 | Epoch: 0 | Step: 122760 | Dataset: 0-760384 | Loss: 0.812 | 914 ms/step , 6879.39 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 04:37:16 | Epoch: 0 | Step: 122770 | Dataset: 0-760704 | Loss: 0.844 | 913 ms/step , 6885.97 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 04:37:25 | Epoch: 0 | Step: 122780 | Dataset: 0-761024 | Loss: 0.772 | 913 ms/step , 6892.49 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 04:37:34 | Epoch: 0 | Step: 122790 | Dataset: 0-761344 | Loss: 0.836 | 915 ms/step , 6873.04 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 04:37:44 | Epoch: 0 | Step: 122800 | Dataset: 0-761664 | Loss: 0.749 | 914 ms/step , 6884.87 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 04:37:45 | Validation | Step: 122800 | Val_loss: 0.882 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:37:54 | Epoch: 0 | Step: 122810 | Dataset: 0-761984 | Loss: 0.790 | 913 ms/step , 6887.07 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 04:38:03 | Epoch: 0 | Step: 122820 | Dataset: 0-762304 | Loss: 0.765 | 915 ms/step , 6873.84 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 04:38:13 | Epoch: 0 | Step: 122830 | Dataset: 0-762624 | Loss: 0.794 | 914 ms/step , 6881.25 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 04:38:22 | Epoch: 0 | Step: 122840 | Dataset: 0-762944 | Loss: 0.755 | 914 ms/step , 6881.53 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 04:38:31 | Epoch: 0 | Step: 122850 | Dataset: 0-763264 | Loss: 0.774 | 915 ms/step , 6874.94 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 04:38:40 | Epoch: 0 | Step: 122860 | Dataset: 0-763584 | Loss: 0.807 | 916 ms/step , 6868.42 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-05 04:38:49 | Epoch: 0 | Step: 122870 | Dataset: 0-763904 | Loss: 0.918 | 914 ms/step , 6881.70 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 04:38:58 | Epoch: 0 | Step: 122880 | Dataset: 0-764224 | Loss: 0.757 | 915 ms/step , 6875.41 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-05 04:39:07 | Epoch: 0 | Step: 122890 | Dataset: 0-764544 | Loss: 0.823 | 916 ms/step , 6866.92 GFLOP/s , 17908.0 tokens/s INFO:__main__:2024-11-05 04:39:17 | Epoch: 0 | Step: 122900 | Dataset: 0-764864 | Loss: 0.862 | 914 ms/step , 6878.88 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 04:39:18 | Validation | Step: 122900 | Val_loss: 0.922 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:39:27 | Epoch: 0 | Step: 122910 | Dataset: 0-765184 | Loss: 0.818 | 914 ms/step , 6878.58 GFLOP/s , 15271.3 tokens/s INFO:__main__:2024-11-05 04:39:36 | Epoch: 0 | Step: 122920 | Dataset: 0-765504 | Loss: 0.586 | 913 ms/step , 6891.92 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 04:39:46 | Epoch: 0 | Step: 122930 | Dataset: 0-765824 | Loss: 0.748 | 915 ms/step , 6875.23 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 04:39:55 | Epoch: 0 | Step: 122940 | Dataset: 0-766144 | Loss: 0.740 | 914 ms/step , 6883.49 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 04:40:04 | Epoch: 0 | Step: 122950 | Dataset: 0-766464 | Loss: 0.796 | 914 ms/step , 6880.99 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 04:40:13 | Epoch: 0 | Step: 122960 | Dataset: 0-766784 | Loss: 0.785 | 914 ms/step , 6878.98 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 04:40:22 | Epoch: 0 | Step: 122970 | Dataset: 0-767104 | Loss: 0.766 | 913 ms/step , 6885.69 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 04:40:31 | Epoch: 0 | Step: 122980 | Dataset: 0-767424 | Loss: 0.730 | 913 ms/step , 6891.34 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 04:40:40 | Epoch: 0 | Step: 122990 | Dataset: 0-767744 | Loss: 0.752 | 913 ms/step , 6889.53 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 04:40:50 | Epoch: 0 | Step: 123000 | Dataset: 0-768064 | Loss: 0.831 | 914 ms/step , 6879.46 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 04:40:51 | Validation | Step: 123000 | Val_loss: 0.921 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:40:51 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_044051_step_123000.pt` INFO:__main__:2024-11-05 04:41:01 | Epoch: 0 | Step: 123010 | Dataset: 0-768384 | Loss: 0.821 | 913 ms/step , 6887.83 GFLOP/s , 13795.0 tokens/s INFO:__main__:2024-11-05 04:41:11 | Epoch: 0 | Step: 123020 | Dataset: 0-768704 | Loss: 0.760 | 913 ms/step , 6891.17 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 04:41:20 | Epoch: 0 | Step: 123030 | Dataset: 0-769024 | Loss: 0.705 | 913 ms/step , 6886.71 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 04:41:29 | Epoch: 0 | Step: 123040 | Dataset: 0-769344 | Loss: 0.787 | 914 ms/step , 6883.01 GFLOP/s , 17834.3 tokens/s INFO:__main__:2024-11-05 04:41:38 | Epoch: 0 | Step: 123050 | Dataset: 0-769664 | Loss: 0.793 | 913 ms/step , 6886.06 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 04:41:47 | Epoch: 0 | Step: 123060 | Dataset: 0-769984 | Loss: 0.857 | 913 ms/step , 6891.62 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 04:41:56 | Epoch: 0 | Step: 123070 | Dataset: 0-770304 | Loss: 0.731 | 913 ms/step , 6891.35 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 04:42:06 | Epoch: 0 | Step: 123080 | Dataset: 0-770624 | Loss: 0.911 | 913 ms/step , 6890.75 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 04:42:15 | Epoch: 0 | Step: 123090 | Dataset: 0-770944 | Loss: 0.934 | 913 ms/step , 6889.46 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 04:42:24 | Epoch: 0 | Step: 123100 | Dataset: 0-771264 | Loss: 0.891 | 912 ms/step , 6892.93 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 04:42:25 | Validation | Step: 123100 | Val_loss: 0.865 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:42:35 | Epoch: 0 | Step: 123110 | Dataset: 0-771584 | Loss: 0.935 | 913 ms/step , 6888.19 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 04:42:44 | Epoch: 0 | Step: 123120 | Dataset: 0-771904 | Loss: 0.892 | 912 ms/step , 6897.07 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 04:42:53 | Epoch: 0 | Step: 123130 | Dataset: 0-772224 | Loss: 0.945 | 913 ms/step , 6889.50 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 04:43:02 | Epoch: 0 | Step: 123140 | Dataset: 0-772544 | Loss: 0.963 | 913 ms/step , 6891.91 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 04:43:11 | Epoch: 0 | Step: 123150 | Dataset: 0-772864 | Loss: 0.840 | 913 ms/step , 6892.48 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 04:43:20 | Epoch: 0 | Step: 123160 | Dataset: 0-773184 | Loss: 0.880 | 914 ms/step , 6882.43 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 04:43:29 | Epoch: 0 | Step: 123170 | Dataset: 0-773504 | Loss: 0.737 | 912 ms/step , 6893.78 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 04:43:38 | Epoch: 0 | Step: 123180 | Dataset: 0-773824 | Loss: 0.860 | 913 ms/step , 6890.79 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 04:43:48 | Epoch: 0 | Step: 123190 | Dataset: 0-774144 | Loss: 0.784 | 912 ms/step , 6900.02 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 04:43:57 | Epoch: 0 | Step: 123200 | Dataset: 0-774464 | Loss: 0.812 | 911 ms/step , 6900.62 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 04:43:58 | Validation | Step: 123200 | Val_loss: 0.917 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:44:07 | Epoch: 0 | Step: 123210 | Dataset: 0-774784 | Loss: 0.910 | 913 ms/step , 6891.49 GFLOP/s , 15283.4 tokens/s INFO:__main__:2024-11-05 04:44:17 | Epoch: 0 | Step: 123220 | Dataset: 0-775104 | Loss: 0.811 | 914 ms/step , 6884.00 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 04:44:26 | Epoch: 0 | Step: 123230 | Dataset: 0-775424 | Loss: 0.932 | 913 ms/step , 6887.43 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 04:44:35 | Epoch: 0 | Step: 123240 | Dataset: 0-775744 | Loss: 0.875 | 914 ms/step , 6884.85 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 04:44:44 | Epoch: 0 | Step: 123250 | Dataset: 0-776064 | Loss: 0.755 | 914 ms/step , 6884.43 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 04:44:53 | Epoch: 0 | Step: 123260 | Dataset: 0-776384 | Loss: 0.822 | 912 ms/step , 6896.62 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 04:45:02 | Epoch: 0 | Step: 123270 | Dataset: 0-776704 | Loss: 0.903 | 913 ms/step , 6891.40 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 04:45:11 | Epoch: 0 | Step: 123280 | Dataset: 0-777024 | Loss: 0.704 | 912 ms/step , 6896.22 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 04:45:21 | Epoch: 0 | Step: 123290 | Dataset: 0-777344 | Loss: 0.928 | 914 ms/step , 6883.10 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 04:45:30 | Epoch: 0 | Step: 123300 | Dataset: 0-777664 | Loss: 0.881 | 912 ms/step , 6892.77 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 04:45:31 | Validation | Step: 123300 | Val_loss: 0.911 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:45:40 | Epoch: 0 | Step: 123310 | Dataset: 0-777984 | Loss: 0.850 | 912 ms/step , 6894.13 GFLOP/s , 15283.0 tokens/s INFO:__main__:2024-11-05 04:45:50 | Epoch: 0 | Step: 123320 | Dataset: 0-778304 | Loss: 0.938 | 913 ms/step , 6889.49 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 04:45:59 | Epoch: 0 | Step: 123330 | Dataset: 0-778624 | Loss: 0.890 | 914 ms/step , 6881.78 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 04:46:08 | Epoch: 0 | Step: 123340 | Dataset: 0-778944 | Loss: 0.895 | 912 ms/step , 6893.06 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 04:46:17 | Epoch: 0 | Step: 123350 | Dataset: 0-779264 | Loss: 0.958 | 914 ms/step , 6883.00 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 04:46:26 | Epoch: 0 | Step: 123360 | Dataset: 0-779584 | Loss: 0.852 | 913 ms/step , 6892.58 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 04:46:35 | Epoch: 0 | Step: 123370 | Dataset: 0-779904 | Loss: 0.819 | 913 ms/step , 6886.64 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 04:46:44 | Epoch: 0 | Step: 123380 | Dataset: 0-780224 | Loss: 0.795 | 912 ms/step , 6895.07 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 04:46:53 | Epoch: 0 | Step: 123390 | Dataset: 0-780544 | Loss: 0.769 | 914 ms/step , 6883.35 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 04:47:03 | Epoch: 0 | Step: 123400 | Dataset: 0-780864 | Loss: 0.826 | 913 ms/step , 6888.30 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 04:47:04 | Validation | Step: 123400 | Val_loss: 0.885 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:47:13 | Epoch: 0 | Step: 123410 | Dataset: 0-781184 | Loss: 0.790 | 913 ms/step , 6887.27 GFLOP/s , 15284.0 tokens/s INFO:__main__:2024-11-05 04:47:22 | Epoch: 0 | Step: 123420 | Dataset: 0-781504 | Loss: 0.831 | 914 ms/step , 6878.72 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 04:47:32 | Epoch: 0 | Step: 123430 | Dataset: 0-781824 | Loss: 0.832 | 912 ms/step , 6897.42 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 04:47:41 | Epoch: 0 | Step: 123440 | Dataset: 0-782144 | Loss: 0.909 | 913 ms/step , 6886.66 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 04:47:50 | Epoch: 0 | Step: 123450 | Dataset: 0-782464 | Loss: 0.883 | 913 ms/step , 6892.36 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 04:47:59 | Epoch: 0 | Step: 123460 | Dataset: 0-782784 | Loss: 0.853 | 913 ms/step , 6889.13 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 04:48:08 | Epoch: 0 | Step: 123470 | Dataset: 0-783104 | Loss: 0.878 | 913 ms/step , 6890.13 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 04:48:17 | Epoch: 0 | Step: 123480 | Dataset: 0-783424 | Loss: 0.876 | 913 ms/step , 6890.96 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 04:48:26 | Epoch: 0 | Step: 123490 | Dataset: 0-783744 | Loss: 0.809 | 913 ms/step , 6889.49 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 04:48:35 | Epoch: 0 | Step: 123500 | Dataset: 0-784064 | Loss: 0.746 | 913 ms/step , 6885.97 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 04:48:37 | Validation | Step: 123500 | Val_loss: 0.902 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:48:46 | Epoch: 0 | Step: 123510 | Dataset: 0-784384 | Loss: 0.872 | 914 ms/step , 6881.97 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-05 04:48:55 | Epoch: 0 | Step: 123520 | Dataset: 0-784704 | Loss: 0.693 | 914 ms/step , 6881.09 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 04:49:04 | Epoch: 0 | Step: 123530 | Dataset: 0-785024 | Loss: 0.761 | 912 ms/step , 6899.22 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 04:49:14 | Epoch: 0 | Step: 123540 | Dataset: 0-785344 | Loss: 0.852 | 912 ms/step , 6895.45 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 04:49:23 | Epoch: 0 | Step: 123550 | Dataset: 0-785664 | Loss: 0.864 | 914 ms/step , 6880.77 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 04:49:32 | Epoch: 0 | Step: 123560 | Dataset: 0-785984 | Loss: 0.921 | 913 ms/step , 6888.49 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 04:49:41 | Epoch: 0 | Step: 123570 | Dataset: 0-786304 | Loss: 0.718 | 912 ms/step , 6895.52 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 04:49:50 | Epoch: 0 | Step: 123580 | Dataset: 0-786624 | Loss: 0.794 | 914 ms/step , 6884.29 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 04:49:59 | Epoch: 0 | Step: 123590 | Dataset: 0-786944 | Loss: 0.799 | 913 ms/step , 6887.93 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 04:50:08 | Epoch: 0 | Step: 123600 | Dataset: 0-787264 | Loss: 0.883 | 913 ms/step , 6888.44 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 04:50:10 | Validation | Step: 123600 | Val_loss: 0.806 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:50:19 | Epoch: 0 | Step: 123610 | Dataset: 0-787584 | Loss: 0.961 | 914 ms/step , 6880.29 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 04:50:28 | Epoch: 0 | Step: 123620 | Dataset: 0-787904 | Loss: 0.723 | 913 ms/step , 6890.19 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 04:50:37 | Epoch: 0 | Step: 123630 | Dataset: 0-788224 | Loss: 0.819 | 915 ms/step , 6870.69 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 04:50:47 | Epoch: 0 | Step: 123640 | Dataset: 0-788544 | Loss: 0.694 | 913 ms/step , 6885.96 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 04:50:56 | Epoch: 0 | Step: 123650 | Dataset: 0-788864 | Loss: 0.771 | 912 ms/step , 6896.25 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 04:51:05 | Epoch: 0 | Step: 123660 | Dataset: 0-789184 | Loss: 0.775 | 913 ms/step , 6887.03 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 04:51:14 | Epoch: 0 | Step: 123670 | Dataset: 0-789504 | Loss: 0.901 | 913 ms/step , 6886.92 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 04:51:23 | Epoch: 0 | Step: 123680 | Dataset: 0-789824 | Loss: 0.844 | 912 ms/step , 6895.55 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 04:51:32 | Epoch: 0 | Step: 123690 | Dataset: 0-790144 | Loss: 0.949 | 913 ms/step , 6888.31 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 04:51:41 | Epoch: 0 | Step: 123700 | Dataset: 0-790464 | Loss: 0.799 | 912 ms/step , 6894.06 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 04:51:43 | Validation | Step: 123700 | Val_loss: 0.895 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:51:52 | Epoch: 0 | Step: 123710 | Dataset: 0-790784 | Loss: 0.907 | 913 ms/step , 6885.51 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 04:52:01 | Epoch: 0 | Step: 123720 | Dataset: 0-791104 | Loss: 0.909 | 914 ms/step , 6883.39 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 04:52:10 | Epoch: 0 | Step: 123730 | Dataset: 0-791424 | Loss: 0.858 | 914 ms/step , 6884.00 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 04:52:19 | Epoch: 0 | Step: 123740 | Dataset: 0-791744 | Loss: 0.898 | 913 ms/step , 6890.02 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 04:52:29 | Epoch: 0 | Step: 123750 | Dataset: 0-792064 | Loss: 0.890 | 915 ms/step , 6877.11 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 04:52:38 | Epoch: 0 | Step: 123760 | Dataset: 0-792384 | Loss: 0.807 | 913 ms/step , 6890.93 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 04:52:47 | Epoch: 0 | Step: 123770 | Dataset: 0-792704 | Loss: 0.749 | 912 ms/step , 6895.90 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 04:52:56 | Epoch: 0 | Step: 123780 | Dataset: 0-793024 | Loss: 0.793 | 912 ms/step , 6897.99 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 04:53:05 | Epoch: 0 | Step: 123790 | Dataset: 0-793344 | Loss: 0.759 | 913 ms/step , 6889.29 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-05 04:53:14 | Epoch: 0 | Step: 123800 | Dataset: 0-793664 | Loss: 0.855 | 912 ms/step , 6892.94 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 04:53:16 | Validation | Step: 123800 | Val_loss: 0.913 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:53:25 | Epoch: 0 | Step: 123810 | Dataset: 0-793984 | Loss: 0.683 | 912 ms/step , 6893.06 GFLOP/s , 15284.4 tokens/s INFO:__main__:2024-11-05 04:53:34 | Epoch: 0 | Step: 123820 | Dataset: 0-794304 | Loss: 0.822 | 914 ms/step , 6883.56 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 04:53:43 | Epoch: 0 | Step: 123830 | Dataset: 0-794624 | Loss: 0.682 | 912 ms/step , 6893.52 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 04:53:52 | Epoch: 0 | Step: 123840 | Dataset: 0-794944 | Loss: 0.908 | 915 ms/step , 6875.68 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 04:54:02 | Epoch: 0 | Step: 123850 | Dataset: 0-795264 | Loss: 0.756 | 913 ms/step , 6888.84 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 04:54:11 | Epoch: 0 | Step: 123860 | Dataset: 0-795584 | Loss: 0.767 | 914 ms/step , 6882.59 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 04:54:20 | Epoch: 0 | Step: 123870 | Dataset: 0-795904 | Loss: 0.816 | 915 ms/step , 6871.50 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 04:54:29 | Epoch: 0 | Step: 123880 | Dataset: 0-796224 | Loss: 0.913 | 913 ms/step , 6885.38 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 04:54:38 | Epoch: 0 | Step: 123890 | Dataset: 0-796544 | Loss: 0.826 | 913 ms/step , 6890.13 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 04:54:47 | Epoch: 0 | Step: 123900 | Dataset: 0-796864 | Loss: 0.889 | 912 ms/step , 6894.64 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 04:54:49 | Validation | Step: 123900 | Val_loss: 0.924 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:54:58 | Epoch: 0 | Step: 123910 | Dataset: 0-797184 | Loss: 0.860 | 912 ms/step , 6894.76 GFLOP/s , 15274.2 tokens/s INFO:__main__:2024-11-05 04:55:07 | Epoch: 0 | Step: 123920 | Dataset: 0-797504 | Loss: 0.804 | 913 ms/step , 6885.92 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 04:55:16 | Epoch: 0 | Step: 123930 | Dataset: 0-797824 | Loss: 0.791 | 912 ms/step , 6898.89 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 04:55:25 | Epoch: 0 | Step: 123940 | Dataset: 0-798144 | Loss: 0.790 | 912 ms/step , 6892.65 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 04:55:34 | Epoch: 0 | Step: 123950 | Dataset: 0-798464 | Loss: 0.815 | 913 ms/step , 6887.03 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 04:55:44 | Epoch: 0 | Step: 123960 | Dataset: 0-798784 | Loss: 0.785 | 913 ms/step , 6892.14 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 04:55:53 | Epoch: 0 | Step: 123970 | Dataset: 0-799104 | Loss: 0.701 | 914 ms/step , 6884.55 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 04:56:02 | Epoch: 0 | Step: 123980 | Dataset: 0-799424 | Loss: 0.858 | 914 ms/step , 6883.69 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 04:56:11 | Epoch: 0 | Step: 123990 | Dataset: 0-799744 | Loss: 0.746 | 913 ms/step , 6892.45 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 04:56:20 | Epoch: 0 | Step: 124000 | Dataset: 0-800064 | Loss: 0.795 | 912 ms/step , 6894.58 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 04:56:22 | Validation | Step: 124000 | Val_loss: 0.908 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:56:22 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_045622_step_124000.pt` INFO:__main__:2024-11-05 04:56:32 | Epoch: 0 | Step: 124010 | Dataset: 0-800384 | Loss: 0.870 | 913 ms/step , 6891.24 GFLOP/s , 13812.0 tokens/s INFO:__main__:2024-11-05 04:56:41 | Epoch: 0 | Step: 124020 | Dataset: 0-800704 | Loss: 0.725 | 914 ms/step , 6878.15 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 04:56:50 | Epoch: 0 | Step: 124030 | Dataset: 0-801024 | Loss: 0.719 | 912 ms/step , 6895.12 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 04:56:59 | Epoch: 0 | Step: 124040 | Dataset: 0-801344 | Loss: 0.891 | 914 ms/step , 6880.26 GFLOP/s , 17891.3 tokens/s INFO:__main__:2024-11-05 04:57:09 | Epoch: 0 | Step: 124050 | Dataset: 0-801664 | Loss: 0.729 | 915 ms/step , 6876.69 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 04:57:18 | Epoch: 0 | Step: 124060 | Dataset: 0-801984 | Loss: 0.844 | 914 ms/step , 6884.55 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 04:57:27 | Epoch: 0 | Step: 124070 | Dataset: 0-802304 | Loss: 0.873 | 913 ms/step , 6886.32 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 04:57:36 | Epoch: 0 | Step: 124080 | Dataset: 0-802624 | Loss: 0.808 | 913 ms/step , 6886.99 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 04:57:45 | Epoch: 0 | Step: 124090 | Dataset: 0-802944 | Loss: 0.778 | 912 ms/step , 6897.12 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 04:57:54 | Epoch: 0 | Step: 124100 | Dataset: 0-803264 | Loss: 0.680 | 912 ms/step , 6892.84 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 04:57:56 | Validation | Step: 124100 | Val_loss: 0.911 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:58:05 | Epoch: 0 | Step: 124110 | Dataset: 0-803584 | Loss: 0.734 | 912 ms/step , 6896.60 GFLOP/s , 15287.2 tokens/s INFO:__main__:2024-11-05 04:58:14 | Epoch: 0 | Step: 124120 | Dataset: 0-803904 | Loss: 0.801 | 911 ms/step , 6903.66 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 04:58:23 | Epoch: 0 | Step: 124130 | Dataset: 0-804224 | Loss: 0.821 | 913 ms/step , 6888.60 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 04:58:32 | Epoch: 0 | Step: 124140 | Dataset: 0-804544 | Loss: 0.684 | 911 ms/step , 6902.24 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 04:58:41 | Epoch: 0 | Step: 124150 | Dataset: 0-804864 | Loss: 0.704 | 913 ms/step , 6886.06 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 04:58:51 | Epoch: 0 | Step: 124160 | Dataset: 0-805184 | Loss: 0.790 | 913 ms/step , 6887.06 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 04:59:00 | Epoch: 0 | Step: 124170 | Dataset: 0-805504 | Loss: 0.927 | 914 ms/step , 6884.29 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 04:59:09 | Epoch: 0 | Step: 124180 | Dataset: 0-805824 | Loss: 0.626 | 911 ms/step , 6904.10 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 04:59:18 | Epoch: 0 | Step: 124190 | Dataset: 0-806144 | Loss: 0.783 | 913 ms/step , 6886.84 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 04:59:27 | Epoch: 0 | Step: 124200 | Dataset: 0-806464 | Loss: 0.818 | 912 ms/step , 6896.94 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 04:59:29 | Validation | Step: 124200 | Val_loss: 0.948 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 04:59:38 | Epoch: 0 | Step: 124210 | Dataset: 0-806784 | Loss: 0.739 | 912 ms/step , 6894.83 GFLOP/s , 15279.6 tokens/s INFO:__main__:2024-11-05 04:59:47 | Epoch: 0 | Step: 124220 | Dataset: 0-807104 | Loss: 0.933 | 914 ms/step , 6879.28 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 04:59:56 | Epoch: 0 | Step: 124230 | Dataset: 0-807424 | Loss: 0.815 | 912 ms/step , 6896.79 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 05:00:05 | Epoch: 0 | Step: 124240 | Dataset: 0-807744 | Loss: 0.939 | 914 ms/step , 6884.59 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 05:00:14 | Epoch: 0 | Step: 124250 | Dataset: 0-808064 | Loss: 0.917 | 912 ms/step , 6895.20 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 05:00:24 | Epoch: 0 | Step: 124260 | Dataset: 0-808384 | Loss: 0.817 | 913 ms/step , 6888.91 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 05:00:33 | Epoch: 0 | Step: 124270 | Dataset: 0-808704 | Loss: 0.692 | 913 ms/step , 6889.30 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 05:00:42 | Epoch: 0 | Step: 124280 | Dataset: 0-809024 | Loss: 0.744 | 913 ms/step , 6886.54 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 05:00:51 | Epoch: 0 | Step: 124290 | Dataset: 0-809344 | Loss: 0.842 | 913 ms/step , 6886.91 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 05:01:00 | Epoch: 0 | Step: 124300 | Dataset: 0-809664 | Loss: 0.761 | 913 ms/step , 6887.93 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 05:01:02 | Validation | Step: 124300 | Val_loss: 0.946 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:01:11 | Epoch: 0 | Step: 124310 | Dataset: 0-809984 | Loss: 0.788 | 912 ms/step , 6892.87 GFLOP/s , 15283.0 tokens/s INFO:__main__:2024-11-05 05:01:20 | Epoch: 0 | Step: 124320 | Dataset: 0-810304 | Loss: 0.772 | 912 ms/step , 6896.73 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 05:01:29 | Epoch: 0 | Step: 124330 | Dataset: 0-810624 | Loss: 0.751 | 914 ms/step , 6879.17 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:01:38 | Epoch: 0 | Step: 124340 | Dataset: 0-810944 | Loss: 0.732 | 915 ms/step , 6876.15 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 05:01:47 | Epoch: 0 | Step: 124350 | Dataset: 0-811264 | Loss: 0.824 | 913 ms/step , 6888.99 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 05:01:56 | Epoch: 0 | Step: 124360 | Dataset: 0-811584 | Loss: 0.835 | 912 ms/step , 6894.95 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 05:02:06 | Epoch: 0 | Step: 124370 | Dataset: 0-811904 | Loss: 0.868 | 912 ms/step , 6896.65 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 05:02:15 | Epoch: 0 | Step: 124380 | Dataset: 0-812224 | Loss: 0.840 | 913 ms/step , 6887.65 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 05:02:24 | Epoch: 0 | Step: 124390 | Dataset: 0-812544 | Loss: 0.734 | 913 ms/step , 6891.27 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 05:02:33 | Epoch: 0 | Step: 124400 | Dataset: 0-812864 | Loss: 0.853 | 913 ms/step , 6890.82 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 05:02:35 | Validation | Step: 124400 | Val_loss: 0.812 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:02:44 | Epoch: 0 | Step: 124410 | Dataset: 0-813184 | Loss: 0.761 | 915 ms/step , 6875.29 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 05:02:53 | Epoch: 0 | Step: 124420 | Dataset: 0-813504 | Loss: 0.813 | 913 ms/step , 6889.10 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 05:03:02 | Epoch: 0 | Step: 124430 | Dataset: 0-813824 | Loss: 0.844 | 912 ms/step , 6897.07 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 05:03:11 | Epoch: 0 | Step: 124440 | Dataset: 0-814144 | Loss: 0.722 | 913 ms/step , 6887.47 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 05:03:20 | Epoch: 0 | Step: 124450 | Dataset: 0-814464 | Loss: 0.831 | 914 ms/step , 6881.76 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 05:03:29 | Epoch: 0 | Step: 124460 | Dataset: 0-814784 | Loss: 0.826 | 914 ms/step , 6879.71 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 05:03:39 | Epoch: 0 | Step: 124470 | Dataset: 0-815104 | Loss: 0.741 | 913 ms/step , 6889.86 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 05:03:48 | Epoch: 0 | Step: 124480 | Dataset: 0-815424 | Loss: 0.772 | 911 ms/step , 6900.33 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 05:03:57 | Epoch: 0 | Step: 124490 | Dataset: 0-815744 | Loss: 0.763 | 913 ms/step , 6890.83 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 05:04:06 | Epoch: 0 | Step: 124500 | Dataset: 0-816064 | Loss: 0.829 | 912 ms/step , 6895.26 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 05:04:08 | Validation | Step: 124500 | Val_loss: 0.800 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:04:17 | Epoch: 0 | Step: 124510 | Dataset: 0-816384 | Loss: 0.902 | 912 ms/step , 6896.97 GFLOP/s , 15291.9 tokens/s INFO:__main__:2024-11-05 05:04:26 | Epoch: 0 | Step: 124520 | Dataset: 0-816704 | Loss: 0.759 | 912 ms/step , 6895.77 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 05:04:35 | Epoch: 0 | Step: 124530 | Dataset: 0-817024 | Loss: 0.700 | 912 ms/step , 6892.92 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 05:04:44 | Epoch: 0 | Step: 124540 | Dataset: 0-817344 | Loss: 0.922 | 914 ms/step , 6880.09 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 05:04:53 | Epoch: 0 | Step: 124550 | Dataset: 0-817664 | Loss: 0.843 | 912 ms/step , 6893.11 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 05:05:02 | Epoch: 0 | Step: 124560 | Dataset: 0-817984 | Loss: 0.778 | 914 ms/step , 6882.45 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 05:05:11 | Epoch: 0 | Step: 124570 | Dataset: 0-818304 | Loss: 0.688 | 912 ms/step , 6893.58 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 05:05:21 | Epoch: 0 | Step: 124580 | Dataset: 0-818624 | Loss: 0.779 | 913 ms/step , 6892.41 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 05:05:30 | Epoch: 0 | Step: 124590 | Dataset: 0-818944 | Loss: 0.664 | 912 ms/step , 6896.07 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 05:05:39 | Epoch: 0 | Step: 124600 | Dataset: 0-819264 | Loss: 0.828 | 912 ms/step , 6894.15 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 05:05:40 | Validation | Step: 124600 | Val_loss: 0.800 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:05:50 | Epoch: 0 | Step: 124610 | Dataset: 0-819584 | Loss: 0.788 | 913 ms/step , 6890.12 GFLOP/s , 15282.6 tokens/s INFO:__main__:2024-11-05 05:05:59 | Epoch: 0 | Step: 124620 | Dataset: 0-819904 | Loss: 0.754 | 912 ms/step , 6896.33 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 05:06:08 | Epoch: 0 | Step: 124630 | Dataset: 0-820224 | Loss: 0.552 | 912 ms/step , 6892.81 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 05:06:17 | Epoch: 0 | Step: 124640 | Dataset: 0-820544 | Loss: 0.761 | 914 ms/step , 6878.32 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 05:06:26 | Epoch: 0 | Step: 124650 | Dataset: 0-820864 | Loss: 0.884 | 913 ms/step , 6888.70 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 05:06:35 | Epoch: 0 | Step: 124660 | Dataset: 0-821184 | Loss: 0.828 | 913 ms/step , 6889.72 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 05:06:44 | Epoch: 0 | Step: 124670 | Dataset: 0-821504 | Loss: 0.811 | 913 ms/step , 6887.76 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 05:06:54 | Epoch: 0 | Step: 124680 | Dataset: 0-821824 | Loss: 0.846 | 914 ms/step , 6878.28 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 05:07:03 | Epoch: 0 | Step: 124690 | Dataset: 0-822144 | Loss: 0.769 | 913 ms/step , 6887.82 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 05:07:12 | Epoch: 0 | Step: 124700 | Dataset: 0-822464 | Loss: 0.864 | 913 ms/step , 6887.14 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 05:07:13 | Validation | Step: 124700 | Val_loss: 0.805 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:07:22 | Epoch: 0 | Step: 124710 | Dataset: 0-822784 | Loss: 0.950 | 914 ms/step , 6880.69 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 05:07:32 | Epoch: 0 | Step: 124720 | Dataset: 0-823104 | Loss: 0.842 | 913 ms/step , 6889.64 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 05:07:41 | Epoch: 0 | Step: 124730 | Dataset: 0-823424 | Loss: 0.794 | 913 ms/step , 6886.03 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 05:07:50 | Epoch: 0 | Step: 124740 | Dataset: 0-823744 | Loss: 0.889 | 913 ms/step , 6888.33 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:07:59 | Epoch: 0 | Step: 124750 | Dataset: 0-824064 | Loss: 0.758 | 912 ms/step , 6896.92 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 05:08:08 | Epoch: 0 | Step: 124760 | Dataset: 0-824384 | Loss: 0.776 | 913 ms/step , 6886.12 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:08:17 | Epoch: 0 | Step: 124770 | Dataset: 0-824704 | Loss: 0.699 | 913 ms/step , 6891.28 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 05:08:26 | Epoch: 0 | Step: 124780 | Dataset: 0-825024 | Loss: 0.751 | 914 ms/step , 6882.51 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 05:08:36 | Epoch: 0 | Step: 124790 | Dataset: 0-825344 | Loss: 0.854 | 913 ms/step , 6890.08 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 05:08:45 | Epoch: 0 | Step: 124800 | Dataset: 0-825664 | Loss: 0.796 | 912 ms/step , 6892.83 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 05:08:46 | Validation | Step: 124800 | Val_loss: 0.768 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:08:55 | Epoch: 0 | Step: 124810 | Dataset: 0-825984 | Loss: 0.827 | 914 ms/step , 6882.77 GFLOP/s , 15278.0 tokens/s INFO:__main__:2024-11-05 05:09:05 | Epoch: 0 | Step: 124820 | Dataset: 0-826304 | Loss: 0.822 | 914 ms/step , 6884.18 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 05:09:14 | Epoch: 0 | Step: 124830 | Dataset: 0-826624 | Loss: 0.875 | 913 ms/step , 6890.08 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 05:09:23 | Epoch: 0 | Step: 124840 | Dataset: 0-826944 | Loss: 0.693 | 911 ms/step , 6902.76 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 05:09:32 | Epoch: 0 | Step: 124850 | Dataset: 0-827264 | Loss: 0.810 | 912 ms/step , 6893.82 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 05:09:41 | Epoch: 0 | Step: 124860 | Dataset: 0-827584 | Loss: 0.793 | 912 ms/step , 6894.14 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 05:09:50 | Epoch: 0 | Step: 124870 | Dataset: 0-827904 | Loss: 0.738 | 913 ms/step , 6892.20 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 05:09:59 | Epoch: 0 | Step: 124880 | Dataset: 0-828224 | Loss: 0.830 | 913 ms/step , 6892.29 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 05:10:08 | Epoch: 0 | Step: 124890 | Dataset: 0-828544 | Loss: 0.753 | 913 ms/step , 6885.24 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 05:10:18 | Epoch: 0 | Step: 124900 | Dataset: 0-828864 | Loss: 0.745 | 914 ms/step , 6882.83 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 05:10:19 | Validation | Step: 124900 | Val_loss: 0.842 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:10:28 | Epoch: 0 | Step: 124910 | Dataset: 0-829184 | Loss: 0.742 | 913 ms/step , 6885.99 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 05:10:37 | Epoch: 0 | Step: 124920 | Dataset: 0-829504 | Loss: 0.931 | 914 ms/step , 6884.82 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 05:10:47 | Epoch: 0 | Step: 124930 | Dataset: 0-829824 | Loss: 0.776 | 913 ms/step , 6886.64 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 05:10:56 | Epoch: 0 | Step: 124940 | Dataset: 0-830144 | Loss: 0.873 | 913 ms/step , 6891.66 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 05:11:05 | Epoch: 0 | Step: 124950 | Dataset: 0-830464 | Loss: 0.779 | 912 ms/step , 6896.20 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 05:11:14 | Epoch: 0 | Step: 124960 | Dataset: 0-830784 | Loss: 0.862 | 912 ms/step , 6895.69 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 05:11:23 | Epoch: 0 | Step: 124970 | Dataset: 0-831104 | Loss: 0.753 | 913 ms/step , 6889.03 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 05:11:32 | Epoch: 0 | Step: 124980 | Dataset: 0-831424 | Loss: 0.767 | 913 ms/step , 6891.90 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 05:11:41 | Epoch: 0 | Step: 124990 | Dataset: 0-831744 | Loss: 0.675 | 912 ms/step , 6893.83 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 05:11:51 | Epoch: 0 | Step: 125000 | Dataset: 0-832064 | Loss: 0.838 | 914 ms/step , 6884.15 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 05:11:52 | Validation | Step: 125000 | Val_loss: 0.766 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:11:52 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_051152_step_125000.pt` INFO:__main__:2024-11-05 05:12:02 | Epoch: 0 | Step: 125010 | Dataset: 0-832384 | Loss: 0.716 | 912 ms/step , 6894.00 GFLOP/s , 13821.2 tokens/s INFO:__main__:2024-11-05 05:12:12 | Epoch: 0 | Step: 125020 | Dataset: 0-832704 | Loss: 0.801 | 914 ms/step , 6881.04 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 05:12:21 | Epoch: 0 | Step: 125030 | Dataset: 0-833024 | Loss: 0.858 | 914 ms/step , 6878.97 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 05:12:30 | Epoch: 0 | Step: 125040 | Dataset: 0-833344 | Loss: 0.778 | 914 ms/step , 6881.23 GFLOP/s , 17902.6 tokens/s INFO:__main__:2024-11-05 05:12:39 | Epoch: 0 | Step: 125050 | Dataset: 0-833664 | Loss: 0.813 | 913 ms/step , 6885.96 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 05:12:48 | Epoch: 0 | Step: 125060 | Dataset: 0-833984 | Loss: 0.715 | 912 ms/step , 6893.19 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 05:12:57 | Epoch: 0 | Step: 125070 | Dataset: 0-834304 | Loss: 0.826 | 912 ms/step , 6892.96 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 05:13:06 | Epoch: 0 | Step: 125080 | Dataset: 0-834624 | Loss: 0.698 | 913 ms/step , 6891.42 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 05:13:15 | Epoch: 0 | Step: 125090 | Dataset: 0-834944 | Loss: 0.756 | 912 ms/step , 6895.55 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 05:13:25 | Epoch: 0 | Step: 125100 | Dataset: 0-835264 | Loss: 0.812 | 913 ms/step , 6887.51 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 05:13:26 | Validation | Step: 125100 | Val_loss: 0.789 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:13:35 | Epoch: 0 | Step: 125110 | Dataset: 0-835584 | Loss: 0.838 | 913 ms/step , 6887.85 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-05 05:13:44 | Epoch: 0 | Step: 125120 | Dataset: 0-835904 | Loss: 0.782 | 912 ms/step , 6894.74 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 05:13:54 | Epoch: 0 | Step: 125130 | Dataset: 0-836224 | Loss: 0.770 | 913 ms/step , 6887.35 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 05:14:03 | Epoch: 0 | Step: 125140 | Dataset: 0-836544 | Loss: 0.683 | 913 ms/step , 6885.51 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 05:14:12 | Epoch: 0 | Step: 125150 | Dataset: 0-836864 | Loss: 0.739 | 912 ms/step , 6894.47 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 05:14:21 | Epoch: 0 | Step: 125160 | Dataset: 0-837184 | Loss: 0.759 | 913 ms/step , 6886.79 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 05:14:30 | Epoch: 0 | Step: 125170 | Dataset: 0-837504 | Loss: 0.782 | 912 ms/step , 6893.12 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 05:14:39 | Epoch: 0 | Step: 125180 | Dataset: 0-837824 | Loss: 0.822 | 912 ms/step , 6894.02 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 05:14:48 | Epoch: 0 | Step: 125190 | Dataset: 0-838144 | Loss: 0.882 | 914 ms/step , 6883.55 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:14:58 | Epoch: 0 | Step: 125200 | Dataset: 0-838464 | Loss: 0.672 | 912 ms/step , 6896.16 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 05:14:59 | Validation | Step: 125200 | Val_loss: 0.596 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:15:08 | Epoch: 0 | Step: 125210 | Dataset: 0-838784 | Loss: 0.827 | 912 ms/step , 6898.85 GFLOP/s , 15284.6 tokens/s INFO:__main__:2024-11-05 05:15:17 | Epoch: 0 | Step: 125220 | Dataset: 0-839104 | Loss: 0.811 | 913 ms/step , 6885.72 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 05:15:27 | Epoch: 0 | Step: 125230 | Dataset: 0-839424 | Loss: 0.778 | 913 ms/step , 6887.88 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 05:15:36 | Epoch: 0 | Step: 125240 | Dataset: 0-839744 | Loss: 0.763 | 913 ms/step , 6887.33 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 05:15:45 | Epoch: 0 | Step: 125250 | Dataset: 0-840064 | Loss: 0.710 | 912 ms/step , 6895.36 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 05:15:54 | Epoch: 0 | Step: 125260 | Dataset: 0-840384 | Loss: 0.794 | 913 ms/step , 6890.08 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 05:16:03 | Epoch: 0 | Step: 125270 | Dataset: 0-840704 | Loss: 0.789 | 914 ms/step , 6883.39 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 05:16:12 | Epoch: 0 | Step: 125280 | Dataset: 0-841024 | Loss: 0.807 | 912 ms/step , 6894.10 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 05:16:21 | Epoch: 0 | Step: 125290 | Dataset: 0-841344 | Loss: 0.810 | 914 ms/step , 6884.99 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:16:30 | Epoch: 0 | Step: 125300 | Dataset: 0-841664 | Loss: 0.724 | 913 ms/step , 6891.72 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 05:16:32 | Validation | Step: 125300 | Val_loss: 0.385 | Best_val_loss: 0.4485 INFO:__main__:2024-11-05 05:16:32 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_051632_step_125300.pt` INFO:__main__:2024-11-05 05:16:42 | Epoch: 0 | Step: 125310 | Dataset: 0-841984 | Loss: 0.830 | 912 ms/step , 6893.77 GFLOP/s , 13739.9 tokens/s INFO:__main__:2024-11-05 05:16:52 | Epoch: 0 | Step: 125320 | Dataset: 0-842304 | Loss: 0.751 | 913 ms/step , 6890.80 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 05:17:01 | Epoch: 0 | Step: 125330 | Dataset: 0-842624 | Loss: 0.916 | 913 ms/step , 6891.15 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 05:17:10 | Epoch: 0 | Step: 125340 | Dataset: 0-842944 | Loss: 0.902 | 913 ms/step , 6886.80 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 05:17:19 | Epoch: 0 | Step: 125350 | Dataset: 0-843264 | Loss: 0.750 | 913 ms/step , 6886.67 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 05:17:28 | Epoch: 0 | Step: 125360 | Dataset: 0-843584 | Loss: 0.709 | 913 ms/step , 6891.21 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 05:17:37 | Epoch: 0 | Step: 125370 | Dataset: 0-843904 | Loss: 0.737 | 913 ms/step , 6891.26 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 05:17:46 | Epoch: 0 | Step: 125380 | Dataset: 0-844224 | Loss: 0.772 | 912 ms/step , 6899.09 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 05:17:55 | Epoch: 0 | Step: 125390 | Dataset: 0-844544 | Loss: 0.858 | 914 ms/step , 6882.01 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 05:18:05 | Epoch: 0 | Step: 125400 | Dataset: 0-844864 | Loss: 0.789 | 913 ms/step , 6888.53 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 05:18:06 | Validation | Step: 125400 | Val_loss: 0.395 | Best_val_loss: 0.3846 INFO:__main__:2024-11-05 05:18:15 | Epoch: 0 | Step: 125410 | Dataset: 0-845184 | Loss: 0.795 | 913 ms/step , 6892.31 GFLOP/s , 15276.1 tokens/s INFO:__main__:2024-11-05 05:18:24 | Epoch: 0 | Step: 125420 | Dataset: 0-845504 | Loss: 0.767 | 914 ms/step , 6883.40 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 05:18:34 | Epoch: 0 | Step: 125430 | Dataset: 0-845824 | Loss: 0.768 | 913 ms/step , 6886.29 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 05:18:43 | Epoch: 0 | Step: 125440 | Dataset: 0-846144 | Loss: 0.668 | 913 ms/step , 6885.66 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 05:18:52 | Epoch: 0 | Step: 125450 | Dataset: 0-846464 | Loss: 0.700 | 913 ms/step , 6892.01 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 05:19:01 | Epoch: 0 | Step: 125460 | Dataset: 0-846784 | Loss: 0.775 | 914 ms/step , 6881.95 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 05:19:10 | Epoch: 0 | Step: 125470 | Dataset: 0-847104 | Loss: 0.771 | 913 ms/step , 6888.20 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 05:19:19 | Epoch: 0 | Step: 125480 | Dataset: 0-847424 | Loss: 0.780 | 914 ms/step , 6884.33 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 05:19:28 | Epoch: 0 | Step: 125490 | Dataset: 0-847744 | Loss: 0.842 | 912 ms/step , 6894.10 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 05:19:38 | Epoch: 0 | Step: 125500 | Dataset: 0-848064 | Loss: 0.829 | 913 ms/step , 6890.88 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 05:19:39 | Validation | Step: 125500 | Val_loss: 0.378 | Best_val_loss: 0.3846 INFO:__main__:2024-11-05 05:19:39 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_051939_step_125500.pt` INFO:__main__:2024-11-05 05:19:49 | Epoch: 0 | Step: 125510 | Dataset: 0-848384 | Loss: 0.782 | 913 ms/step , 6888.85 GFLOP/s , 13801.4 tokens/s INFO:__main__:2024-11-05 05:19:59 | Epoch: 0 | Step: 125520 | Dataset: 0-848704 | Loss: 0.833 | 913 ms/step , 6891.05 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 05:20:08 | Epoch: 0 | Step: 125530 | Dataset: 0-849024 | Loss: 0.741 | 913 ms/step , 6889.25 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 05:20:17 | Epoch: 0 | Step: 125540 | Dataset: 0-849344 | Loss: 0.757 | 913 ms/step , 6890.71 GFLOP/s , 17850.2 tokens/s INFO:__main__:2024-11-05 05:20:26 | Epoch: 0 | Step: 125550 | Dataset: 0-849664 | Loss: 0.809 | 913 ms/step , 6888.32 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 05:20:35 | Epoch: 0 | Step: 125560 | Dataset: 0-849984 | Loss: 0.728 | 911 ms/step , 6900.84 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 05:20:44 | Epoch: 0 | Step: 125570 | Dataset: 0-850304 | Loss: 0.659 | 912 ms/step , 6895.17 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 05:20:53 | Epoch: 0 | Step: 125580 | Dataset: 0-850624 | Loss: 0.954 | 913 ms/step , 6889.40 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 05:21:03 | Epoch: 0 | Step: 125590 | Dataset: 0-850944 | Loss: 1.023 | 913 ms/step , 6888.67 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 05:21:12 | Epoch: 0 | Step: 125600 | Dataset: 0-851264 | Loss: 0.911 | 913 ms/step , 6888.83 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 05:21:13 | Validation | Step: 125600 | Val_loss: 0.374 | Best_val_loss: 0.3782 INFO:__main__:2024-11-05 05:21:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_052113_step_125600.pt` INFO:__main__:2024-11-05 05:21:23 | Epoch: 0 | Step: 125610 | Dataset: 0-851584 | Loss: 0.652 | 914 ms/step , 6882.17 GFLOP/s , 13838.8 tokens/s INFO:__main__:2024-11-05 05:21:33 | Epoch: 0 | Step: 125620 | Dataset: 0-851904 | Loss: 0.948 | 913 ms/step , 6886.29 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 05:21:42 | Epoch: 0 | Step: 125630 | Dataset: 0-852224 | Loss: 0.828 | 913 ms/step , 6886.20 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:21:51 | Epoch: 0 | Step: 125640 | Dataset: 0-852544 | Loss: 0.893 | 913 ms/step , 6886.85 GFLOP/s , 17891.4 tokens/s INFO:__main__:2024-11-05 05:22:00 | Epoch: 0 | Step: 125650 | Dataset: 0-852864 | Loss: 0.972 | 914 ms/step , 6882.02 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 05:22:09 | Epoch: 0 | Step: 125660 | Dataset: 0-853184 | Loss: 0.902 | 914 ms/step , 6881.37 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 05:22:18 | Epoch: 0 | Step: 125670 | Dataset: 0-853504 | Loss: 0.831 | 912 ms/step , 6895.45 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 05:22:27 | Epoch: 0 | Step: 125680 | Dataset: 0-853824 | Loss: 0.862 | 913 ms/step , 6885.52 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 05:22:37 | Epoch: 0 | Step: 125690 | Dataset: 0-854144 | Loss: 0.712 | 911 ms/step , 6900.29 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 05:22:46 | Epoch: 0 | Step: 125700 | Dataset: 0-854464 | Loss: 0.761 | 912 ms/step , 6895.78 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 05:22:47 | Validation | Step: 125700 | Val_loss: 0.374 | Best_val_loss: 0.3743 INFO:__main__:2024-11-05 05:22:47 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_052247_step_125700.pt` INFO:__main__:2024-11-05 05:22:58 | Epoch: 0 | Step: 125710 | Dataset: 0-854784 | Loss: 0.864 | 912 ms/step , 6898.22 GFLOP/s , 13825.2 tokens/s INFO:__main__:2024-11-05 05:23:07 | Epoch: 0 | Step: 125720 | Dataset: 0-855104 | Loss: 0.911 | 913 ms/step , 6886.16 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:23:16 | Epoch: 0 | Step: 125730 | Dataset: 0-855424 | Loss: 1.015 | 913 ms/step , 6889.90 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:23:25 | Epoch: 0 | Step: 125740 | Dataset: 0-855744 | Loss: 0.940 | 913 ms/step , 6890.06 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 05:23:34 | Epoch: 0 | Step: 125750 | Dataset: 0-856064 | Loss: 0.869 | 912 ms/step , 6894.07 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 05:23:43 | Epoch: 0 | Step: 125760 | Dataset: 0-856384 | Loss: 0.868 | 913 ms/step , 6892.12 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 05:23:52 | Epoch: 0 | Step: 125770 | Dataset: 0-856704 | Loss: 0.920 | 913 ms/step , 6891.26 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 05:24:02 | Epoch: 0 | Step: 125780 | Dataset: 0-857024 | Loss: 0.858 | 913 ms/step , 6886.61 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 05:24:11 | Epoch: 0 | Step: 125790 | Dataset: 0-857344 | Loss: 0.817 | 912 ms/step , 6899.51 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 05:24:20 | Epoch: 0 | Step: 125800 | Dataset: 0-857664 | Loss: 0.748 | 913 ms/step , 6891.92 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 05:24:21 | Validation | Step: 125800 | Val_loss: 0.381 | Best_val_loss: 0.3735 INFO:__main__:2024-11-05 05:24:31 | Epoch: 0 | Step: 125810 | Dataset: 0-857984 | Loss: 0.842 | 914 ms/step , 6879.59 GFLOP/s , 15270.8 tokens/s INFO:__main__:2024-11-05 05:24:40 | Epoch: 0 | Step: 125820 | Dataset: 0-858304 | Loss: 1.007 | 914 ms/step , 6881.38 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 05:24:49 | Epoch: 0 | Step: 125830 | Dataset: 0-858624 | Loss: 0.812 | 912 ms/step , 6897.53 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 05:24:58 | Epoch: 0 | Step: 125840 | Dataset: 0-858944 | Loss: 0.823 | 914 ms/step , 6883.12 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 05:25:07 | Epoch: 0 | Step: 125850 | Dataset: 0-859264 | Loss: 0.873 | 915 ms/step , 6873.62 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 05:25:16 | Epoch: 0 | Step: 125860 | Dataset: 0-859584 | Loss: 0.772 | 914 ms/step , 6884.78 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 05:25:25 | Epoch: 0 | Step: 125870 | Dataset: 0-859904 | Loss: 0.884 | 913 ms/step , 6888.85 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 05:25:35 | Epoch: 0 | Step: 125880 | Dataset: 0-860224 | Loss: 0.907 | 912 ms/step , 6894.73 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 05:25:44 | Epoch: 0 | Step: 125890 | Dataset: 0-860544 | Loss: 0.748 | 912 ms/step , 6895.79 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 05:25:53 | Epoch: 0 | Step: 125900 | Dataset: 0-860864 | Loss: 0.950 | 912 ms/step , 6894.10 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 05:25:54 | Validation | Step: 125900 | Val_loss: 0.357 | Best_val_loss: 0.3735 INFO:__main__:2024-11-05 05:25:54 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_052554_step_125900.pt` INFO:__main__:2024-11-05 05:26:05 | Epoch: 0 | Step: 125910 | Dataset: 0-861184 | Loss: 0.642 | 912 ms/step , 6896.57 GFLOP/s , 13808.9 tokens/s INFO:__main__:2024-11-05 05:26:14 | Epoch: 0 | Step: 125920 | Dataset: 0-861504 | Loss: 0.928 | 913 ms/step , 6891.22 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 05:26:23 | Epoch: 0 | Step: 125930 | Dataset: 0-861824 | Loss: 0.948 | 914 ms/step , 6884.62 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 05:26:32 | Epoch: 0 | Step: 125940 | Dataset: 0-862144 | Loss: 0.853 | 914 ms/step , 6883.04 GFLOP/s , 17892.8 tokens/s INFO:__main__:2024-11-05 05:26:41 | Epoch: 0 | Step: 125950 | Dataset: 0-862464 | Loss: 0.766 | 913 ms/step , 6887.40 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 05:26:50 | Epoch: 0 | Step: 125960 | Dataset: 0-862784 | Loss: 0.912 | 913 ms/step , 6889.21 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 05:26:59 | Epoch: 0 | Step: 125970 | Dataset: 0-863104 | Loss: 0.946 | 915 ms/step , 6876.19 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 05:27:09 | Epoch: 0 | Step: 125980 | Dataset: 0-863424 | Loss: 0.862 | 913 ms/step , 6886.26 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 05:27:18 | Epoch: 0 | Step: 125990 | Dataset: 0-863744 | Loss: 0.969 | 914 ms/step , 6881.86 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 05:27:27 | Epoch: 0 | Step: 126000 | Dataset: 0-864064 | Loss: 0.818 | 912 ms/step , 6896.59 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 05:27:28 | Validation | Step: 126000 | Val_loss: 0.362 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:27:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_052728_step_126000.pt` INFO:__main__:2024-11-05 05:27:39 | Epoch: 0 | Step: 126010 | Dataset: 0-864384 | Loss: 0.819 | 931 ms/step , 6752.62 GFLOP/s , 13788.1 tokens/s INFO:__main__:2024-11-05 05:27:48 | Epoch: 0 | Step: 126020 | Dataset: 0-864704 | Loss: 0.984 | 915 ms/step , 6875.63 GFLOP/s , 17909.2 tokens/s INFO:__main__:2024-11-05 05:27:57 | Epoch: 0 | Step: 126030 | Dataset: 0-865024 | Loss: 0.910 | 914 ms/step , 6879.56 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 05:28:06 | Epoch: 0 | Step: 126040 | Dataset: 0-865344 | Loss: 0.915 | 914 ms/step , 6883.50 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 05:28:15 | Epoch: 0 | Step: 126050 | Dataset: 0-865664 | Loss: 0.873 | 914 ms/step , 6882.20 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 05:28:25 | Epoch: 0 | Step: 126060 | Dataset: 0-865984 | Loss: 0.898 | 913 ms/step , 6889.93 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 05:28:34 | Epoch: 0 | Step: 126070 | Dataset: 0-866304 | Loss: 0.981 | 913 ms/step , 6885.46 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 05:28:43 | Epoch: 0 | Step: 126080 | Dataset: 0-866624 | Loss: 0.874 | 914 ms/step , 6883.10 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 05:28:52 | Epoch: 0 | Step: 126090 | Dataset: 0-866944 | Loss: 0.797 | 913 ms/step , 6886.55 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 05:29:01 | Epoch: 0 | Step: 126100 | Dataset: 0-867264 | Loss: 0.563 | 913 ms/step , 6891.79 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 05:29:03 | Validation | Step: 126100 | Val_loss: 0.360 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:29:12 | Epoch: 0 | Step: 126110 | Dataset: 0-867584 | Loss: 0.779 | 913 ms/step , 6888.54 GFLOP/s , 15275.1 tokens/s INFO:__main__:2024-11-05 05:29:21 | Epoch: 0 | Step: 126120 | Dataset: 0-867904 | Loss: 0.925 | 914 ms/step , 6884.22 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 05:29:30 | Epoch: 0 | Step: 126130 | Dataset: 0-868224 | Loss: 0.796 | 914 ms/step , 6878.66 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 05:29:39 | Epoch: 0 | Step: 126140 | Dataset: 0-868544 | Loss: 0.984 | 915 ms/step , 6873.99 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 05:29:48 | Epoch: 0 | Step: 126150 | Dataset: 0-868864 | Loss: 0.812 | 912 ms/step , 6892.94 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 05:29:57 | Epoch: 0 | Step: 126160 | Dataset: 0-869184 | Loss: 0.860 | 914 ms/step , 6885.02 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 05:30:07 | Epoch: 0 | Step: 126170 | Dataset: 0-869504 | Loss: 0.969 | 915 ms/step , 6875.60 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 05:30:16 | Epoch: 0 | Step: 126180 | Dataset: 0-869824 | Loss: 0.613 | 911 ms/step , 6901.91 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 05:30:25 | Epoch: 0 | Step: 126190 | Dataset: 0-870144 | Loss: 0.733 | 912 ms/step , 6894.56 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 05:30:34 | Epoch: 0 | Step: 126200 | Dataset: 0-870464 | Loss: 0.909 | 912 ms/step , 6897.53 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 05:30:36 | Validation | Step: 126200 | Val_loss: 0.374 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:30:45 | Epoch: 0 | Step: 126210 | Dataset: 0-870784 | Loss: 0.935 | 914 ms/step , 6878.83 GFLOP/s , 15276.0 tokens/s INFO:__main__:2024-11-05 05:30:54 | Epoch: 0 | Step: 126220 | Dataset: 0-871104 | Loss: 1.010 | 914 ms/step , 6884.62 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 05:31:03 | Epoch: 0 | Step: 126230 | Dataset: 0-871424 | Loss: 0.773 | 913 ms/step , 6888.93 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 05:31:12 | Epoch: 0 | Step: 126240 | Dataset: 0-871744 | Loss: 0.792 | 914 ms/step , 6878.69 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 05:31:21 | Epoch: 0 | Step: 126250 | Dataset: 0-872064 | Loss: 0.904 | 915 ms/step , 6876.98 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 05:31:30 | Epoch: 0 | Step: 126260 | Dataset: 0-872384 | Loss: 0.715 | 913 ms/step , 6885.57 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 05:31:40 | Epoch: 0 | Step: 126270 | Dataset: 0-872704 | Loss: 0.797 | 914 ms/step , 6883.19 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:31:49 | Epoch: 0 | Step: 126280 | Dataset: 0-873024 | Loss: 0.834 | 914 ms/step , 6878.34 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 05:31:58 | Epoch: 0 | Step: 126290 | Dataset: 0-873344 | Loss: 0.640 | 913 ms/step , 6887.87 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 05:32:07 | Epoch: 0 | Step: 126300 | Dataset: 0-873664 | Loss: 0.910 | 913 ms/step , 6886.62 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 05:32:09 | Validation | Step: 126300 | Val_loss: 0.377 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:32:18 | Epoch: 0 | Step: 126310 | Dataset: 0-873984 | Loss: 0.809 | 914 ms/step , 6878.62 GFLOP/s , 15299.9 tokens/s INFO:__main__:2024-11-05 05:32:27 | Epoch: 0 | Step: 126320 | Dataset: 0-874304 | Loss: 0.980 | 913 ms/step , 6885.87 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 05:32:36 | Epoch: 0 | Step: 126330 | Dataset: 0-874624 | Loss: 0.896 | 912 ms/step , 6896.42 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 05:32:45 | Epoch: 0 | Step: 126340 | Dataset: 0-874944 | Loss: 0.922 | 912 ms/step , 6893.10 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 05:32:54 | Epoch: 0 | Step: 126350 | Dataset: 0-875264 | Loss: 0.893 | 914 ms/step , 6884.92 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 05:33:03 | Epoch: 0 | Step: 126360 | Dataset: 0-875584 | Loss: 0.890 | 914 ms/step , 6880.74 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 05:33:12 | Epoch: 0 | Step: 126370 | Dataset: 0-875904 | Loss: 1.006 | 913 ms/step , 6886.03 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 05:33:22 | Epoch: 0 | Step: 126380 | Dataset: 0-876224 | Loss: 0.904 | 914 ms/step , 6881.14 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 05:33:31 | Epoch: 0 | Step: 126390 | Dataset: 0-876544 | Loss: 0.693 | 913 ms/step , 6892.50 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 05:33:40 | Epoch: 0 | Step: 126400 | Dataset: 0-876864 | Loss: 0.975 | 914 ms/step , 6882.75 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 05:33:41 | Validation | Step: 126400 | Val_loss: 0.380 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:33:51 | Epoch: 0 | Step: 126410 | Dataset: 0-877184 | Loss: 0.825 | 912 ms/step , 6894.36 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-05 05:34:00 | Epoch: 0 | Step: 126420 | Dataset: 0-877504 | Loss: 0.946 | 914 ms/step , 6883.32 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 05:34:09 | Epoch: 0 | Step: 126430 | Dataset: 0-877824 | Loss: 0.892 | 914 ms/step , 6881.52 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 05:34:18 | Epoch: 0 | Step: 126440 | Dataset: 0-878144 | Loss: 0.928 | 915 ms/step , 6877.10 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 05:34:27 | Epoch: 0 | Step: 126450 | Dataset: 0-878464 | Loss: 0.910 | 914 ms/step , 6879.81 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 05:34:36 | Epoch: 0 | Step: 126460 | Dataset: 0-878784 | Loss: 0.678 | 912 ms/step , 6896.41 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 05:34:45 | Epoch: 0 | Step: 126470 | Dataset: 0-879104 | Loss: 0.922 | 913 ms/step , 6886.13 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 05:34:55 | Epoch: 0 | Step: 126480 | Dataset: 0-879424 | Loss: 0.934 | 914 ms/step , 6877.73 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 05:35:04 | Epoch: 0 | Step: 126490 | Dataset: 0-879744 | Loss: 0.853 | 914 ms/step , 6882.54 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 05:35:13 | Epoch: 0 | Step: 126500 | Dataset: 0-880064 | Loss: 0.802 | 914 ms/step , 6878.48 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 05:35:14 | Validation | Step: 126500 | Val_loss: 0.379 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:35:24 | Epoch: 0 | Step: 126510 | Dataset: 0-880384 | Loss: 0.907 | 913 ms/step , 6888.26 GFLOP/s , 15273.5 tokens/s INFO:__main__:2024-11-05 05:35:33 | Epoch: 0 | Step: 126520 | Dataset: 0-880704 | Loss: 0.891 | 913 ms/step , 6888.25 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 05:35:42 | Epoch: 0 | Step: 126530 | Dataset: 0-881024 | Loss: 0.988 | 914 ms/step , 6881.63 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 05:35:51 | Epoch: 0 | Step: 126540 | Dataset: 0-881344 | Loss: 0.994 | 914 ms/step , 6879.07 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 05:36:00 | Epoch: 0 | Step: 126550 | Dataset: 0-881664 | Loss: 0.905 | 914 ms/step , 6884.53 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 05:36:09 | Epoch: 0 | Step: 126560 | Dataset: 0-881984 | Loss: 0.961 | 912 ms/step , 6892.75 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 05:36:18 | Epoch: 0 | Step: 126570 | Dataset: 0-882304 | Loss: 0.604 | 912 ms/step , 6897.23 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 05:36:28 | Epoch: 0 | Step: 126580 | Dataset: 0-882624 | Loss: 0.665 | 913 ms/step , 6891.19 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 05:36:37 | Epoch: 0 | Step: 126590 | Dataset: 0-882944 | Loss: 0.856 | 912 ms/step , 6893.99 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 05:36:46 | Epoch: 0 | Step: 126600 | Dataset: 0-883264 | Loss: 0.824 | 914 ms/step , 6883.43 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 05:36:47 | Validation | Step: 126600 | Val_loss: 0.392 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:36:57 | Epoch: 0 | Step: 126610 | Dataset: 0-883584 | Loss: 0.851 | 913 ms/step , 6887.19 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-05 05:37:06 | Epoch: 0 | Step: 126620 | Dataset: 0-883904 | Loss: 0.790 | 915 ms/step , 6876.19 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 05:37:15 | Epoch: 0 | Step: 126630 | Dataset: 0-884224 | Loss: 0.717 | 914 ms/step , 6880.64 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 05:37:24 | Epoch: 0 | Step: 126640 | Dataset: 0-884544 | Loss: 0.845 | 914 ms/step , 6884.25 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 05:37:33 | Epoch: 0 | Step: 126650 | Dataset: 0-884864 | Loss: 0.766 | 913 ms/step , 6889.75 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:37:42 | Epoch: 0 | Step: 126660 | Dataset: 0-885184 | Loss: 0.844 | 914 ms/step , 6879.36 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 05:37:51 | Epoch: 0 | Step: 126670 | Dataset: 0-885504 | Loss: 0.792 | 913 ms/step , 6886.22 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 05:38:01 | Epoch: 0 | Step: 126680 | Dataset: 0-885824 | Loss: 0.857 | 913 ms/step , 6885.26 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 05:38:10 | Epoch: 0 | Step: 126690 | Dataset: 0-886144 | Loss: 0.925 | 914 ms/step , 6881.40 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 05:38:19 | Epoch: 0 | Step: 126700 | Dataset: 0-886464 | Loss: 0.787 | 913 ms/step , 6891.61 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 05:38:20 | Validation | Step: 126700 | Val_loss: 0.357 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:38:30 | Epoch: 0 | Step: 126710 | Dataset: 0-886784 | Loss: 0.654 | 912 ms/step , 6898.67 GFLOP/s , 15285.7 tokens/s INFO:__main__:2024-11-05 05:38:39 | Epoch: 0 | Step: 126720 | Dataset: 0-887104 | Loss: 0.967 | 914 ms/step , 6884.88 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 05:38:48 | Epoch: 0 | Step: 126730 | Dataset: 0-887424 | Loss: 0.782 | 913 ms/step , 6886.68 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 05:38:57 | Epoch: 0 | Step: 126740 | Dataset: 0-887744 | Loss: 0.827 | 914 ms/step , 6879.32 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 05:39:06 | Epoch: 0 | Step: 126750 | Dataset: 0-888064 | Loss: 0.571 | 912 ms/step , 6896.43 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 05:39:15 | Epoch: 0 | Step: 126760 | Dataset: 0-888384 | Loss: 0.937 | 914 ms/step , 6883.19 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 05:39:24 | Epoch: 0 | Step: 126770 | Dataset: 0-888704 | Loss: 0.868 | 913 ms/step , 6889.11 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 05:39:33 | Epoch: 0 | Step: 126780 | Dataset: 0-889024 | Loss: 0.720 | 912 ms/step , 6893.26 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 05:39:43 | Epoch: 0 | Step: 126790 | Dataset: 0-889344 | Loss: 0.902 | 914 ms/step , 6884.58 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 05:39:52 | Epoch: 0 | Step: 126800 | Dataset: 0-889664 | Loss: 0.771 | 914 ms/step , 6883.97 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 05:39:53 | Validation | Step: 126800 | Val_loss: 0.423 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:40:02 | Epoch: 0 | Step: 126810 | Dataset: 0-889984 | Loss: 0.978 | 913 ms/step , 6885.29 GFLOP/s , 15296.3 tokens/s INFO:__main__:2024-11-05 05:40:12 | Epoch: 0 | Step: 126820 | Dataset: 0-890304 | Loss: 0.811 | 913 ms/step , 6891.96 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 05:40:21 | Epoch: 0 | Step: 126830 | Dataset: 0-890624 | Loss: 0.793 | 915 ms/step , 6875.11 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 05:40:30 | Epoch: 0 | Step: 126840 | Dataset: 0-890944 | Loss: 0.949 | 914 ms/step , 6878.08 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 05:40:39 | Epoch: 0 | Step: 126850 | Dataset: 0-891264 | Loss: 0.900 | 914 ms/step , 6882.89 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 05:40:48 | Epoch: 0 | Step: 126860 | Dataset: 0-891584 | Loss: 0.944 | 913 ms/step , 6886.62 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 05:40:57 | Epoch: 0 | Step: 126870 | Dataset: 0-891904 | Loss: 0.798 | 913 ms/step , 6885.35 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 05:41:06 | Epoch: 0 | Step: 126880 | Dataset: 0-892224 | Loss: 0.828 | 913 ms/step , 6892.50 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 05:41:16 | Epoch: 0 | Step: 126890 | Dataset: 0-892544 | Loss: 0.934 | 914 ms/step , 6878.87 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 05:41:25 | Epoch: 0 | Step: 126900 | Dataset: 0-892864 | Loss: 0.966 | 914 ms/step , 6881.07 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 05:41:26 | Validation | Step: 126900 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:41:35 | Epoch: 0 | Step: 126910 | Dataset: 0-893184 | Loss: 0.720 | 913 ms/step , 6887.47 GFLOP/s , 15271.2 tokens/s INFO:__main__:2024-11-05 05:41:45 | Epoch: 0 | Step: 126920 | Dataset: 0-893504 | Loss: 0.900 | 912 ms/step , 6893.76 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 05:41:54 | Epoch: 0 | Step: 126930 | Dataset: 0-893824 | Loss: 0.805 | 912 ms/step , 6896.40 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 05:42:03 | Epoch: 0 | Step: 126940 | Dataset: 0-894144 | Loss: 0.945 | 914 ms/step , 6877.71 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 05:42:12 | Epoch: 0 | Step: 126950 | Dataset: 0-894464 | Loss: 1.043 | 914 ms/step , 6884.51 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 05:42:21 | Epoch: 0 | Step: 126960 | Dataset: 0-894784 | Loss: 0.774 | 912 ms/step , 6894.95 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 05:42:30 | Epoch: 0 | Step: 126970 | Dataset: 0-895104 | Loss: 0.978 | 914 ms/step , 6882.32 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 05:42:39 | Epoch: 0 | Step: 126980 | Dataset: 0-895424 | Loss: 0.752 | 913 ms/step , 6892.42 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 05:42:49 | Epoch: 0 | Step: 126990 | Dataset: 0-895744 | Loss: 0.738 | 913 ms/step , 6886.74 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 05:42:58 | Epoch: 0 | Step: 127000 | Dataset: 0-896064 | Loss: 0.930 | 913 ms/step , 6886.70 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 05:42:59 | Validation | Step: 127000 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:42:59 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_054259_step_127000.pt` INFO:__main__:2024-11-05 05:43:10 | Epoch: 0 | Step: 127010 | Dataset: 0-896384 | Loss: 0.812 | 913 ms/step , 6886.09 GFLOP/s , 13821.1 tokens/s INFO:__main__:2024-11-05 05:43:19 | Epoch: 0 | Step: 127020 | Dataset: 0-896704 | Loss: 0.796 | 913 ms/step , 6888.52 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 05:43:28 | Epoch: 0 | Step: 127030 | Dataset: 0-897024 | Loss: 0.925 | 913 ms/step , 6885.43 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 05:43:37 | Epoch: 0 | Step: 127040 | Dataset: 0-897344 | Loss: 0.936 | 914 ms/step , 6884.33 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 05:43:46 | Epoch: 0 | Step: 127050 | Dataset: 0-897664 | Loss: 0.830 | 913 ms/step , 6891.63 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 05:43:55 | Epoch: 0 | Step: 127060 | Dataset: 0-897984 | Loss: 0.691 | 913 ms/step , 6885.69 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 05:44:04 | Epoch: 0 | Step: 127070 | Dataset: 0-898304 | Loss: 0.661 | 914 ms/step , 6882.91 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 05:44:14 | Epoch: 0 | Step: 127080 | Dataset: 0-898624 | Loss: 0.862 | 914 ms/step , 6885.05 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 05:44:23 | Epoch: 0 | Step: 127090 | Dataset: 0-898944 | Loss: 0.851 | 914 ms/step , 6884.56 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 05:44:32 | Epoch: 0 | Step: 127100 | Dataset: 0-899264 | Loss: 0.703 | 912 ms/step , 6892.64 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 05:44:33 | Validation | Step: 127100 | Val_loss: 0.799 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:44:43 | Epoch: 0 | Step: 127110 | Dataset: 0-899584 | Loss: 0.907 | 913 ms/step , 6892.34 GFLOP/s , 15285.2 tokens/s INFO:__main__:2024-11-05 05:44:52 | Epoch: 0 | Step: 127120 | Dataset: 0-899904 | Loss: 0.801 | 912 ms/step , 6896.26 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 05:45:01 | Epoch: 0 | Step: 127130 | Dataset: 0-900224 | Loss: 0.855 | 912 ms/step , 6896.74 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 05:45:10 | Epoch: 0 | Step: 127140 | Dataset: 0-900544 | Loss: 0.766 | 912 ms/step , 6895.95 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 05:45:19 | Epoch: 0 | Step: 127150 | Dataset: 0-900864 | Loss: 0.825 | 913 ms/step , 6886.60 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 05:45:28 | Epoch: 0 | Step: 127160 | Dataset: 0-901184 | Loss: 0.870 | 914 ms/step , 6884.19 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 05:45:37 | Epoch: 0 | Step: 127170 | Dataset: 0-901504 | Loss: 1.017 | 914 ms/step , 6884.56 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 05:45:46 | Epoch: 0 | Step: 127180 | Dataset: 0-901824 | Loss: 0.884 | 913 ms/step , 6887.16 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 05:45:56 | Epoch: 0 | Step: 127190 | Dataset: 0-902144 | Loss: 0.709 | 911 ms/step , 6900.56 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 05:46:05 | Epoch: 0 | Step: 127200 | Dataset: 0-902464 | Loss: 0.873 | 913 ms/step , 6885.43 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 05:46:06 | Validation | Step: 127200 | Val_loss: 0.821 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:46:15 | Epoch: 0 | Step: 127210 | Dataset: 0-902784 | Loss: 0.860 | 913 ms/step , 6887.84 GFLOP/s , 15267.9 tokens/s INFO:__main__:2024-11-05 05:46:25 | Epoch: 0 | Step: 127220 | Dataset: 0-903104 | Loss: 0.801 | 912 ms/step , 6895.35 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 05:46:34 | Epoch: 0 | Step: 127230 | Dataset: 0-903424 | Loss: 0.835 | 914 ms/step , 6881.63 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 05:46:43 | Epoch: 0 | Step: 127240 | Dataset: 0-903744 | Loss: 0.837 | 912 ms/step , 6898.13 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 05:46:52 | Epoch: 0 | Step: 127250 | Dataset: 0-904064 | Loss: 0.926 | 913 ms/step , 6886.85 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 05:47:01 | Epoch: 0 | Step: 127260 | Dataset: 0-904384 | Loss: 0.841 | 914 ms/step , 6882.31 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 05:47:10 | Epoch: 0 | Step: 127270 | Dataset: 0-904704 | Loss: 0.982 | 915 ms/step , 6874.60 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 05:47:19 | Epoch: 0 | Step: 127280 | Dataset: 0-905024 | Loss: 0.816 | 913 ms/step , 6886.45 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 05:47:29 | Epoch: 0 | Step: 127290 | Dataset: 0-905344 | Loss: 0.890 | 914 ms/step , 6879.76 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 05:47:38 | Epoch: 0 | Step: 127300 | Dataset: 0-905664 | Loss: 1.086 | 913 ms/step , 6885.73 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 05:47:39 | Validation | Step: 127300 | Val_loss: 0.754 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:47:48 | Epoch: 0 | Step: 127310 | Dataset: 0-905984 | Loss: 0.948 | 915 ms/step , 6874.33 GFLOP/s , 15278.0 tokens/s INFO:__main__:2024-11-05 05:47:58 | Epoch: 0 | Step: 127320 | Dataset: 0-906304 | Loss: 0.964 | 914 ms/step , 6881.25 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 05:48:07 | Epoch: 0 | Step: 127330 | Dataset: 0-906624 | Loss: 0.887 | 914 ms/step , 6880.50 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 05:48:16 | Epoch: 0 | Step: 127340 | Dataset: 0-906944 | Loss: 0.786 | 912 ms/step , 6893.17 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 05:48:25 | Epoch: 0 | Step: 127350 | Dataset: 0-907264 | Loss: 0.946 | 913 ms/step , 6887.04 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 05:48:34 | Epoch: 0 | Step: 127360 | Dataset: 0-907584 | Loss: 0.824 | 913 ms/step , 6890.03 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 05:48:43 | Epoch: 0 | Step: 127370 | Dataset: 0-907904 | Loss: 0.890 | 913 ms/step , 6890.01 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 05:48:52 | Epoch: 0 | Step: 127380 | Dataset: 0-908224 | Loss: 0.964 | 912 ms/step , 6893.10 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 05:49:02 | Epoch: 0 | Step: 127390 | Dataset: 0-908544 | Loss: 0.819 | 913 ms/step , 6885.34 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 05:49:11 | Epoch: 0 | Step: 127400 | Dataset: 0-908864 | Loss: 0.731 | 913 ms/step , 6890.96 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 05:49:12 | Validation | Step: 127400 | Val_loss: 0.797 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:49:21 | Epoch: 0 | Step: 127410 | Dataset: 0-909184 | Loss: 0.884 | 914 ms/step , 6884.59 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 05:49:31 | Epoch: 0 | Step: 127420 | Dataset: 0-909504 | Loss: 0.769 | 914 ms/step , 6883.22 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 05:49:40 | Epoch: 0 | Step: 127430 | Dataset: 0-909824 | Loss: 0.835 | 914 ms/step , 6880.33 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 05:49:49 | Epoch: 0 | Step: 127440 | Dataset: 0-910144 | Loss: 0.830 | 913 ms/step , 6886.68 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 05:49:58 | Epoch: 0 | Step: 127450 | Dataset: 0-910464 | Loss: 0.815 | 913 ms/step , 6891.45 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 05:50:07 | Epoch: 0 | Step: 127460 | Dataset: 0-910784 | Loss: 0.863 | 912 ms/step , 6892.88 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 05:50:16 | Epoch: 0 | Step: 127470 | Dataset: 0-911104 | Loss: 0.883 | 913 ms/step , 6889.11 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 05:50:25 | Epoch: 0 | Step: 127480 | Dataset: 0-911424 | Loss: 0.831 | 912 ms/step , 6897.94 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 05:50:34 | Epoch: 0 | Step: 127490 | Dataset: 0-911744 | Loss: 0.940 | 915 ms/step , 6877.51 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 05:50:44 | Epoch: 0 | Step: 127500 | Dataset: 0-912064 | Loss: 0.684 | 913 ms/step , 6890.71 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 05:50:45 | Validation | Step: 127500 | Val_loss: 0.722 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:50:54 | Epoch: 0 | Step: 127510 | Dataset: 0-912384 | Loss: 0.785 | 913 ms/step , 6891.73 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 05:51:03 | Epoch: 0 | Step: 127520 | Dataset: 0-912704 | Loss: 0.935 | 913 ms/step , 6890.34 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 05:51:13 | Epoch: 0 | Step: 127530 | Dataset: 0-913024 | Loss: 0.973 | 914 ms/step , 6880.55 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 05:51:22 | Epoch: 0 | Step: 127540 | Dataset: 0-913344 | Loss: 0.798 | 912 ms/step , 6894.05 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 05:51:31 | Epoch: 0 | Step: 127550 | Dataset: 0-913664 | Loss: 0.807 | 914 ms/step , 6883.51 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 05:51:40 | Epoch: 0 | Step: 127560 | Dataset: 0-913984 | Loss: 0.894 | 914 ms/step , 6878.99 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 05:51:49 | Epoch: 0 | Step: 127570 | Dataset: 0-914304 | Loss: 0.986 | 914 ms/step , 6879.39 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 05:51:58 | Epoch: 0 | Step: 127580 | Dataset: 0-914624 | Loss: 0.904 | 913 ms/step , 6890.15 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 05:52:07 | Epoch: 0 | Step: 127590 | Dataset: 0-914944 | Loss: 0.860 | 913 ms/step , 6891.06 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 05:52:17 | Epoch: 0 | Step: 127600 | Dataset: 0-915264 | Loss: 0.721 | 913 ms/step , 6885.71 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 05:52:18 | Validation | Step: 127600 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:52:27 | Epoch: 0 | Step: 127610 | Dataset: 0-915584 | Loss: 0.935 | 914 ms/step , 6883.15 GFLOP/s , 15269.9 tokens/s INFO:__main__:2024-11-05 05:52:36 | Epoch: 0 | Step: 127620 | Dataset: 0-915904 | Loss: 1.019 | 915 ms/step , 6874.16 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 05:52:46 | Epoch: 0 | Step: 127630 | Dataset: 0-916224 | Loss: 0.914 | 914 ms/step , 6882.40 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 05:52:55 | Epoch: 0 | Step: 127640 | Dataset: 0-916544 | Loss: 0.882 | 913 ms/step , 6887.50 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 05:53:04 | Epoch: 0 | Step: 127650 | Dataset: 0-916864 | Loss: 0.645 | 915 ms/step , 6876.16 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 05:53:13 | Epoch: 0 | Step: 127660 | Dataset: 0-917184 | Loss: 0.832 | 913 ms/step , 6887.33 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 05:53:22 | Epoch: 0 | Step: 127670 | Dataset: 0-917504 | Loss: 0.849 | 913 ms/step , 6887.79 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 05:53:31 | Epoch: 0 | Step: 127680 | Dataset: 0-917824 | Loss: 0.914 | 913 ms/step , 6889.19 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 05:53:40 | Epoch: 0 | Step: 127690 | Dataset: 0-918144 | Loss: 0.879 | 914 ms/step , 6884.47 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 05:53:50 | Epoch: 0 | Step: 127700 | Dataset: 0-918464 | Loss: 0.981 | 915 ms/step , 6875.08 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 05:53:51 | Validation | Step: 127700 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:54:00 | Epoch: 0 | Step: 127710 | Dataset: 0-918784 | Loss: 0.827 | 912 ms/step , 6894.02 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-05 05:54:09 | Epoch: 0 | Step: 127720 | Dataset: 0-919104 | Loss: 0.936 | 914 ms/step , 6879.43 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 05:54:19 | Epoch: 0 | Step: 127730 | Dataset: 0-919424 | Loss: 0.789 | 913 ms/step , 6885.08 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 05:54:28 | Epoch: 0 | Step: 127740 | Dataset: 0-919744 | Loss: 0.990 | 914 ms/step , 6883.23 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 05:54:37 | Epoch: 0 | Step: 127750 | Dataset: 0-920064 | Loss: 0.746 | 913 ms/step , 6885.10 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 05:54:46 | Epoch: 0 | Step: 127760 | Dataset: 0-920384 | Loss: 0.881 | 913 ms/step , 6888.85 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 05:54:55 | Epoch: 0 | Step: 127770 | Dataset: 0-920704 | Loss: 0.954 | 913 ms/step , 6889.35 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 05:55:04 | Epoch: 0 | Step: 127780 | Dataset: 0-921024 | Loss: 0.843 | 913 ms/step , 6892.01 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 05:55:13 | Epoch: 0 | Step: 127790 | Dataset: 0-921344 | Loss: 0.956 | 914 ms/step , 6878.77 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 05:55:23 | Epoch: 0 | Step: 127800 | Dataset: 0-921664 | Loss: 0.666 | 913 ms/step , 6889.58 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 05:55:24 | Validation | Step: 127800 | Val_loss: 0.819 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:55:33 | Epoch: 0 | Step: 127810 | Dataset: 0-921984 | Loss: 0.939 | 914 ms/step , 6880.54 GFLOP/s , 15262.2 tokens/s INFO:__main__:2024-11-05 05:55:42 | Epoch: 0 | Step: 127820 | Dataset: 0-922304 | Loss: 0.966 | 913 ms/step , 6889.86 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 05:55:52 | Epoch: 0 | Step: 127830 | Dataset: 0-922624 | Loss: 0.902 | 913 ms/step , 6886.53 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 05:56:01 | Epoch: 0 | Step: 127840 | Dataset: 0-922944 | Loss: 0.811 | 913 ms/step , 6886.40 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 05:56:10 | Epoch: 0 | Step: 127850 | Dataset: 0-923264 | Loss: 0.897 | 913 ms/step , 6885.38 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 05:56:19 | Epoch: 0 | Step: 127860 | Dataset: 0-923584 | Loss: 0.814 | 915 ms/step , 6877.15 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 05:56:28 | Epoch: 0 | Step: 127870 | Dataset: 0-923904 | Loss: 0.892 | 915 ms/step , 6874.80 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 05:56:37 | Epoch: 0 | Step: 127880 | Dataset: 0-924224 | Loss: 0.867 | 914 ms/step , 6881.22 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 05:56:46 | Epoch: 0 | Step: 127890 | Dataset: 0-924544 | Loss: 0.836 | 912 ms/step , 6892.98 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 05:56:56 | Epoch: 0 | Step: 127900 | Dataset: 0-924864 | Loss: 0.694 | 914 ms/step , 6882.70 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 05:56:57 | Validation | Step: 127900 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:57:06 | Epoch: 0 | Step: 127910 | Dataset: 0-925184 | Loss: 0.953 | 913 ms/step , 6885.62 GFLOP/s , 15269.4 tokens/s INFO:__main__:2024-11-05 05:57:15 | Epoch: 0 | Step: 127920 | Dataset: 0-925504 | Loss: 0.926 | 913 ms/step , 6886.69 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 05:57:25 | Epoch: 0 | Step: 127930 | Dataset: 0-925824 | Loss: 0.898 | 913 ms/step , 6889.96 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 05:57:34 | Epoch: 0 | Step: 127940 | Dataset: 0-926144 | Loss: 0.871 | 914 ms/step , 6881.47 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 05:57:43 | Epoch: 0 | Step: 127950 | Dataset: 0-926464 | Loss: 0.900 | 914 ms/step , 6880.19 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 05:57:52 | Epoch: 0 | Step: 127960 | Dataset: 0-926784 | Loss: 0.956 | 913 ms/step , 6888.02 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 05:58:01 | Epoch: 0 | Step: 127970 | Dataset: 0-927104 | Loss: 0.557 | 913 ms/step , 6887.98 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 05:58:10 | Epoch: 0 | Step: 127980 | Dataset: 0-927424 | Loss: 0.809 | 912 ms/step , 6896.27 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 05:58:19 | Epoch: 0 | Step: 127990 | Dataset: 0-927744 | Loss: 0.733 | 913 ms/step , 6888.54 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 05:58:28 | Epoch: 0 | Step: 128000 | Dataset: 0-928064 | Loss: 0.903 | 913 ms/step , 6885.84 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 05:58:30 | Validation | Step: 128000 | Val_loss: 0.792 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 05:58:30 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_055830_step_128000.pt` INFO:__main__:2024-11-05 05:58:40 | Epoch: 0 | Step: 128010 | Dataset: 0-928384 | Loss: 0.895 | 914 ms/step , 6883.59 GFLOP/s , 13834.1 tokens/s INFO:__main__:2024-11-05 05:58:49 | Epoch: 0 | Step: 128020 | Dataset: 0-928704 | Loss: 0.932 | 914 ms/step , 6880.32 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 05:58:59 | Epoch: 0 | Step: 128030 | Dataset: 0-929024 | Loss: 0.846 | 914 ms/step , 6884.12 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 05:59:08 | Epoch: 0 | Step: 128040 | Dataset: 0-929344 | Loss: 0.836 | 913 ms/step , 6889.23 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 05:59:17 | Epoch: 0 | Step: 128050 | Dataset: 0-929664 | Loss: 0.477 | 911 ms/step , 6900.55 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 05:59:26 | Epoch: 0 | Step: 128060 | Dataset: 0-929984 | Loss: 0.978 | 914 ms/step , 6882.24 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 05:59:35 | Epoch: 0 | Step: 128070 | Dataset: 0-930304 | Loss: 0.921 | 914 ms/step , 6880.82 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 05:59:44 | Epoch: 0 | Step: 128080 | Dataset: 0-930624 | Loss: 0.794 | 912 ms/step , 6895.11 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 05:59:53 | Epoch: 0 | Step: 128090 | Dataset: 0-930944 | Loss: 0.778 | 914 ms/step , 6882.99 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 06:00:03 | Epoch: 0 | Step: 128100 | Dataset: 0-931264 | Loss: 0.940 | 914 ms/step , 6884.13 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 06:00:04 | Validation | Step: 128100 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:00:13 | Epoch: 0 | Step: 128110 | Dataset: 0-931584 | Loss: 0.885 | 913 ms/step , 6889.12 GFLOP/s , 15276.4 tokens/s INFO:__main__:2024-11-05 06:00:22 | Epoch: 0 | Step: 128120 | Dataset: 0-931904 | Loss: 0.718 | 913 ms/step , 6891.07 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 06:00:32 | Epoch: 0 | Step: 128130 | Dataset: 0-932224 | Loss: 0.810 | 911 ms/step , 6907.31 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 06:00:41 | Epoch: 0 | Step: 128140 | Dataset: 0-932544 | Loss: 0.808 | 913 ms/step , 6887.30 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 06:00:50 | Epoch: 0 | Step: 128150 | Dataset: 0-932864 | Loss: 0.777 | 915 ms/step , 6875.67 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 06:00:59 | Epoch: 0 | Step: 128160 | Dataset: 0-933184 | Loss: 0.798 | 913 ms/step , 6888.91 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 06:01:08 | Epoch: 0 | Step: 128170 | Dataset: 0-933504 | Loss: 0.622 | 913 ms/step , 6889.63 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 06:01:17 | Epoch: 0 | Step: 128180 | Dataset: 0-933824 | Loss: 0.741 | 913 ms/step , 6886.11 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 06:01:26 | Epoch: 0 | Step: 128190 | Dataset: 0-934144 | Loss: 0.830 | 912 ms/step , 6894.44 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-05 06:01:35 | Epoch: 0 | Step: 128200 | Dataset: 0-934464 | Loss: 0.766 | 913 ms/step , 6888.18 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 06:01:37 | Validation | Step: 128200 | Val_loss: 0.817 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:01:46 | Epoch: 0 | Step: 128210 | Dataset: 0-934784 | Loss: 0.763 | 914 ms/step , 6879.22 GFLOP/s , 15272.1 tokens/s INFO:__main__:2024-11-05 06:01:55 | Epoch: 0 | Step: 128220 | Dataset: 0-935104 | Loss: 0.841 | 914 ms/step , 6882.86 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 06:02:04 | Epoch: 0 | Step: 128230 | Dataset: 0-935424 | Loss: 0.805 | 912 ms/step , 6893.82 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 06:02:14 | Epoch: 0 | Step: 128240 | Dataset: 0-935744 | Loss: 0.737 | 913 ms/step , 6885.50 GFLOP/s , 17864.2 tokens/s INFO:__main__:2024-11-05 06:02:23 | Epoch: 0 | Step: 128250 | Dataset: 0-936064 | Loss: 0.815 | 914 ms/step , 6881.24 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 06:02:32 | Epoch: 0 | Step: 128260 | Dataset: 0-936384 | Loss: 0.766 | 914 ms/step , 6884.45 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 06:02:41 | Epoch: 0 | Step: 128270 | Dataset: 0-936704 | Loss: 0.868 | 922 ms/step , 6824.31 GFLOP/s , 17779.5 tokens/s INFO:__main__:2024-11-05 06:02:50 | Epoch: 0 | Step: 128280 | Dataset: 0-937024 | Loss: 0.789 | 913 ms/step , 6887.08 GFLOP/s , 17821.8 tokens/s INFO:__main__:2024-11-05 06:02:59 | Epoch: 0 | Step: 128290 | Dataset: 0-937344 | Loss: 0.852 | 913 ms/step , 6885.08 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 06:03:09 | Epoch: 0 | Step: 128300 | Dataset: 0-937664 | Loss: 0.800 | 913 ms/step , 6885.21 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:03:10 | Validation | Step: 128300 | Val_loss: 0.786 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:03:19 | Epoch: 0 | Step: 128310 | Dataset: 0-937984 | Loss: 0.754 | 911 ms/step , 6902.38 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 06:03:28 | Epoch: 0 | Step: 128320 | Dataset: 0-938304 | Loss: 0.807 | 913 ms/step , 6889.76 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 06:03:38 | Epoch: 0 | Step: 128330 | Dataset: 0-938624 | Loss: 0.820 | 913 ms/step , 6889.32 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 06:03:47 | Epoch: 0 | Step: 128340 | Dataset: 0-938944 | Loss: 0.833 | 914 ms/step , 6884.98 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 06:03:56 | Epoch: 0 | Step: 128350 | Dataset: 0-939264 | Loss: 0.761 | 914 ms/step , 6884.30 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 06:04:05 | Epoch: 0 | Step: 128360 | Dataset: 0-939584 | Loss: 0.949 | 913 ms/step , 6887.58 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 06:04:14 | Epoch: 0 | Step: 128370 | Dataset: 0-939904 | Loss: 0.797 | 913 ms/step , 6890.17 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 06:04:23 | Epoch: 0 | Step: 128380 | Dataset: 0-940224 | Loss: 0.742 | 912 ms/step , 6893.68 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 06:04:32 | Epoch: 0 | Step: 128390 | Dataset: 0-940544 | Loss: 0.740 | 913 ms/step , 6889.69 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 06:04:42 | Epoch: 0 | Step: 128400 | Dataset: 0-940864 | Loss: 0.807 | 913 ms/step , 6885.60 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 06:04:43 | Validation | Step: 128400 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:04:52 | Epoch: 0 | Step: 128410 | Dataset: 0-941184 | Loss: 0.784 | 929 ms/step , 6772.37 GFLOP/s , 15253.5 tokens/s INFO:__main__:2024-11-05 06:05:01 | Epoch: 0 | Step: 128420 | Dataset: 0-941504 | Loss: 0.830 | 914 ms/step , 6884.43 GFLOP/s , 17831.1 tokens/s INFO:__main__:2024-11-05 06:05:11 | Epoch: 0 | Step: 128430 | Dataset: 0-941824 | Loss: 0.713 | 912 ms/step , 6892.74 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 06:05:20 | Epoch: 0 | Step: 128440 | Dataset: 0-942144 | Loss: 0.808 | 913 ms/step , 6892.01 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 06:05:29 | Epoch: 0 | Step: 128450 | Dataset: 0-942464 | Loss: 0.822 | 914 ms/step , 6879.40 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 06:05:38 | Epoch: 0 | Step: 128460 | Dataset: 0-942784 | Loss: 0.686 | 913 ms/step , 6886.80 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 06:05:47 | Epoch: 0 | Step: 128470 | Dataset: 0-943104 | Loss: 0.766 | 913 ms/step , 6889.06 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 06:05:56 | Epoch: 0 | Step: 128480 | Dataset: 0-943424 | Loss: 0.824 | 914 ms/step , 6882.00 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 06:06:05 | Epoch: 0 | Step: 128490 | Dataset: 0-943744 | Loss: 0.823 | 914 ms/step , 6883.01 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 06:06:15 | Epoch: 0 | Step: 128500 | Dataset: 0-944064 | Loss: 0.420 | 913 ms/step , 6889.63 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 06:06:16 | Validation | Step: 128500 | Val_loss: 0.806 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:06:25 | Epoch: 0 | Step: 128510 | Dataset: 0-944384 | Loss: 0.842 | 914 ms/step , 6884.54 GFLOP/s , 15270.6 tokens/s INFO:__main__:2024-11-05 06:06:34 | Epoch: 0 | Step: 128520 | Dataset: 0-944704 | Loss: 0.757 | 913 ms/step , 6888.09 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 06:06:44 | Epoch: 0 | Step: 128530 | Dataset: 0-945024 | Loss: 0.874 | 913 ms/step , 6886.32 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 06:06:53 | Epoch: 0 | Step: 128540 | Dataset: 0-945344 | Loss: 0.732 | 913 ms/step , 6888.75 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 06:07:02 | Epoch: 0 | Step: 128550 | Dataset: 0-945664 | Loss: 0.820 | 912 ms/step , 6893.01 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 06:07:11 | Epoch: 0 | Step: 128560 | Dataset: 0-945984 | Loss: 0.761 | 912 ms/step , 6893.59 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 06:07:20 | Epoch: 0 | Step: 128570 | Dataset: 0-946304 | Loss: 0.813 | 914 ms/step , 6883.30 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 06:07:29 | Epoch: 0 | Step: 128580 | Dataset: 0-946624 | Loss: 0.779 | 912 ms/step , 6893.98 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 06:07:38 | Epoch: 0 | Step: 128590 | Dataset: 0-946944 | Loss: 0.877 | 912 ms/step , 6897.45 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 06:07:48 | Epoch: 0 | Step: 128600 | Dataset: 0-947264 | Loss: 0.772 | 913 ms/step , 6887.76 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 06:07:49 | Validation | Step: 128600 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:07:58 | Epoch: 0 | Step: 128610 | Dataset: 0-947584 | Loss: 0.795 | 913 ms/step , 6887.83 GFLOP/s , 15277.2 tokens/s INFO:__main__:2024-11-05 06:08:07 | Epoch: 0 | Step: 128620 | Dataset: 0-947904 | Loss: 0.803 | 913 ms/step , 6885.68 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 06:08:17 | Epoch: 0 | Step: 128630 | Dataset: 0-948224 | Loss: 0.810 | 915 ms/step , 6876.36 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 06:08:26 | Epoch: 0 | Step: 128640 | Dataset: 0-948544 | Loss: 0.670 | 912 ms/step , 6894.89 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 06:08:35 | Epoch: 0 | Step: 128650 | Dataset: 0-948864 | Loss: 0.705 | 913 ms/step , 6892.14 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 06:08:44 | Epoch: 0 | Step: 128660 | Dataset: 0-949184 | Loss: 0.854 | 913 ms/step , 6885.61 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:08:53 | Epoch: 0 | Step: 128670 | Dataset: 0-949504 | Loss: 0.828 | 914 ms/step , 6882.92 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 06:09:02 | Epoch: 0 | Step: 128680 | Dataset: 0-949824 | Loss: 0.768 | 913 ms/step , 6885.55 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:09:11 | Epoch: 0 | Step: 128690 | Dataset: 0-950144 | Loss: 0.698 | 912 ms/step , 6894.21 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 06:09:21 | Epoch: 0 | Step: 128700 | Dataset: 0-950464 | Loss: 0.748 | 915 ms/step , 6875.99 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 06:09:22 | Validation | Step: 128700 | Val_loss: 0.670 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:09:31 | Epoch: 0 | Step: 128710 | Dataset: 0-950784 | Loss: 0.735 | 912 ms/step , 6893.36 GFLOP/s , 15284.7 tokens/s INFO:__main__:2024-11-05 06:09:40 | Epoch: 0 | Step: 128720 | Dataset: 0-951104 | Loss: 0.829 | 915 ms/step , 6873.36 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 06:09:50 | Epoch: 0 | Step: 128730 | Dataset: 0-951424 | Loss: 0.784 | 913 ms/step , 6887.26 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 06:09:59 | Epoch: 0 | Step: 128740 | Dataset: 0-951744 | Loss: 0.769 | 913 ms/step , 6889.65 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 06:10:08 | Epoch: 0 | Step: 128750 | Dataset: 0-952064 | Loss: 0.792 | 912 ms/step , 6894.09 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 06:10:17 | Epoch: 0 | Step: 128760 | Dataset: 0-952384 | Loss: 0.734 | 913 ms/step , 6889.79 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 06:10:26 | Epoch: 0 | Step: 128770 | Dataset: 0-952704 | Loss: 0.842 | 912 ms/step , 6892.81 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 06:10:35 | Epoch: 0 | Step: 128780 | Dataset: 0-953024 | Loss: 0.758 | 912 ms/step , 6895.07 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 06:10:44 | Epoch: 0 | Step: 128790 | Dataset: 0-953344 | Loss: 0.756 | 912 ms/step , 6896.63 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 06:10:53 | Epoch: 0 | Step: 128800 | Dataset: 0-953664 | Loss: 0.854 | 913 ms/step , 6891.90 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 06:10:55 | Validation | Step: 128800 | Val_loss: 0.683 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:11:04 | Epoch: 0 | Step: 128810 | Dataset: 0-953984 | Loss: 0.767 | 913 ms/step , 6892.24 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-05 06:11:13 | Epoch: 0 | Step: 128820 | Dataset: 0-954304 | Loss: 0.800 | 915 ms/step , 6877.36 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 06:11:22 | Epoch: 0 | Step: 128830 | Dataset: 0-954624 | Loss: 0.811 | 913 ms/step , 6891.95 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 06:11:32 | Epoch: 0 | Step: 128840 | Dataset: 0-954944 | Loss: 0.629 | 913 ms/step , 6885.43 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 06:11:41 | Epoch: 0 | Step: 128850 | Dataset: 0-955264 | Loss: 0.816 | 913 ms/step , 6886.18 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 06:11:50 | Epoch: 0 | Step: 128860 | Dataset: 0-955584 | Loss: 0.715 | 914 ms/step , 6883.54 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 06:11:59 | Epoch: 0 | Step: 128870 | Dataset: 0-955904 | Loss: 0.830 | 913 ms/step , 6887.33 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 06:12:08 | Epoch: 0 | Step: 128880 | Dataset: 0-956224 | Loss: 0.793 | 912 ms/step , 6899.03 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 06:12:17 | Epoch: 0 | Step: 128890 | Dataset: 0-956544 | Loss: 0.871 | 913 ms/step , 6885.13 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 06:12:26 | Epoch: 0 | Step: 128900 | Dataset: 0-956864 | Loss: 0.841 | 913 ms/step , 6887.33 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 06:12:28 | Validation | Step: 128900 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:12:37 | Epoch: 0 | Step: 128910 | Dataset: 0-957184 | Loss: 0.708 | 912 ms/step , 6899.58 GFLOP/s , 15284.8 tokens/s INFO:__main__:2024-11-05 06:12:46 | Epoch: 0 | Step: 128920 | Dataset: 0-957504 | Loss: 0.826 | 913 ms/step , 6889.63 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 06:12:55 | Epoch: 0 | Step: 128930 | Dataset: 0-957824 | Loss: 0.825 | 914 ms/step , 6883.72 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 06:13:04 | Epoch: 0 | Step: 128940 | Dataset: 0-958144 | Loss: 0.800 | 912 ms/step , 6893.07 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 06:13:14 | Epoch: 0 | Step: 128950 | Dataset: 0-958464 | Loss: 0.518 | 912 ms/step , 6896.74 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 06:13:23 | Epoch: 0 | Step: 128960 | Dataset: 0-958784 | Loss: 0.740 | 913 ms/step , 6886.38 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 06:13:32 | Epoch: 0 | Step: 128970 | Dataset: 0-959104 | Loss: 0.829 | 914 ms/step , 6884.37 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 06:13:41 | Epoch: 0 | Step: 128980 | Dataset: 0-959424 | Loss: 0.850 | 912 ms/step , 6894.65 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 06:13:50 | Epoch: 0 | Step: 128990 | Dataset: 0-959744 | Loss: 0.824 | 913 ms/step , 6889.19 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 06:13:59 | Epoch: 0 | Step: 129000 | Dataset: 0-960064 | Loss: 0.706 | 913 ms/step , 6891.57 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:14:01 | Validation | Step: 129000 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:14:01 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_061401_step_129000.pt` INFO:__main__:2024-11-05 06:14:11 | Epoch: 0 | Step: 129010 | Dataset: 0-960384 | Loss: 0.725 | 912 ms/step , 6896.17 GFLOP/s , 13801.6 tokens/s INFO:__main__:2024-11-05 06:14:20 | Epoch: 0 | Step: 129020 | Dataset: 0-960704 | Loss: 0.733 | 913 ms/step , 6891.14 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 06:14:29 | Epoch: 0 | Step: 129030 | Dataset: 0-961024 | Loss: 0.756 | 915 ms/step , 6875.42 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 06:14:39 | Epoch: 0 | Step: 129040 | Dataset: 0-961344 | Loss: 0.772 | 912 ms/step , 6897.04 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 06:14:48 | Epoch: 0 | Step: 129050 | Dataset: 0-961664 | Loss: 0.816 | 914 ms/step , 6883.72 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 06:14:57 | Epoch: 0 | Step: 129060 | Dataset: 0-961984 | Loss: 0.745 | 913 ms/step , 6888.94 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 06:15:06 | Epoch: 0 | Step: 129070 | Dataset: 0-962304 | Loss: 0.666 | 914 ms/step , 6884.75 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 06:15:15 | Epoch: 0 | Step: 129080 | Dataset: 0-962624 | Loss: 0.854 | 914 ms/step , 6881.83 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:15:24 | Epoch: 0 | Step: 129090 | Dataset: 0-962944 | Loss: 0.840 | 914 ms/step , 6881.96 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 06:15:33 | Epoch: 0 | Step: 129100 | Dataset: 0-963264 | Loss: 0.729 | 912 ms/step , 6893.19 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 06:15:35 | Validation | Step: 129100 | Val_loss: 0.768 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:15:44 | Epoch: 0 | Step: 129110 | Dataset: 0-963584 | Loss: 0.656 | 912 ms/step , 6892.74 GFLOP/s , 15298.3 tokens/s INFO:__main__:2024-11-05 06:15:53 | Epoch: 0 | Step: 129120 | Dataset: 0-963904 | Loss: 0.759 | 914 ms/step , 6884.82 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 06:16:02 | Epoch: 0 | Step: 129130 | Dataset: 0-964224 | Loss: 0.845 | 914 ms/step , 6881.55 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 06:16:12 | Epoch: 0 | Step: 129140 | Dataset: 0-964544 | Loss: 0.928 | 915 ms/step , 6876.66 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 06:16:21 | Epoch: 0 | Step: 129150 | Dataset: 0-964864 | Loss: 0.777 | 913 ms/step , 6886.43 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 06:16:30 | Epoch: 0 | Step: 129160 | Dataset: 0-965184 | Loss: 0.762 | 913 ms/step , 6891.64 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 06:16:39 | Epoch: 0 | Step: 129170 | Dataset: 0-965504 | Loss: 0.875 | 914 ms/step , 6879.45 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 06:16:48 | Epoch: 0 | Step: 129180 | Dataset: 0-965824 | Loss: 0.662 | 913 ms/step , 6890.83 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 06:16:57 | Epoch: 0 | Step: 129190 | Dataset: 0-966144 | Loss: 0.867 | 913 ms/step , 6887.41 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 06:17:06 | Epoch: 0 | Step: 129200 | Dataset: 0-966464 | Loss: 0.873 | 912 ms/step , 6894.14 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 06:17:08 | Validation | Step: 129200 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:17:17 | Epoch: 0 | Step: 129210 | Dataset: 0-966784 | Loss: 0.801 | 913 ms/step , 6890.17 GFLOP/s , 15283.7 tokens/s INFO:__main__:2024-11-05 06:17:26 | Epoch: 0 | Step: 129220 | Dataset: 0-967104 | Loss: 0.731 | 913 ms/step , 6887.86 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 06:17:35 | Epoch: 0 | Step: 129230 | Dataset: 0-967424 | Loss: 0.844 | 914 ms/step , 6884.87 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 06:17:44 | Epoch: 0 | Step: 129240 | Dataset: 0-967744 | Loss: 0.865 | 913 ms/step , 6891.01 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 06:17:54 | Epoch: 0 | Step: 129250 | Dataset: 0-968064 | Loss: 0.808 | 913 ms/step , 6891.20 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 06:18:03 | Epoch: 0 | Step: 129260 | Dataset: 0-968384 | Loss: 0.787 | 914 ms/step , 6883.95 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 06:18:12 | Epoch: 0 | Step: 129270 | Dataset: 0-968704 | Loss: 0.819 | 914 ms/step , 6884.25 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 06:18:21 | Epoch: 0 | Step: 129280 | Dataset: 0-969024 | Loss: 0.941 | 916 ms/step , 6866.29 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 06:18:30 | Epoch: 0 | Step: 129290 | Dataset: 0-969344 | Loss: 0.845 | 913 ms/step , 6885.07 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 06:18:39 | Epoch: 0 | Step: 129300 | Dataset: 0-969664 | Loss: 0.835 | 914 ms/step , 6884.41 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:18:41 | Validation | Step: 129300 | Val_loss: 0.778 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:18:50 | Epoch: 0 | Step: 129310 | Dataset: 0-969984 | Loss: 0.928 | 913 ms/step , 6885.56 GFLOP/s , 15287.0 tokens/s INFO:__main__:2024-11-05 06:18:59 | Epoch: 0 | Step: 129320 | Dataset: 0-970304 | Loss: 0.797 | 914 ms/step , 6883.02 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 06:19:08 | Epoch: 0 | Step: 129330 | Dataset: 0-970624 | Loss: 0.825 | 914 ms/step , 6879.38 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:19:17 | Epoch: 0 | Step: 129340 | Dataset: 0-970944 | Loss: 0.780 | 912 ms/step , 6893.82 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 06:19:27 | Epoch: 0 | Step: 129350 | Dataset: 0-971264 | Loss: 0.898 | 912 ms/step , 6894.22 GFLOP/s , 17907.9 tokens/s INFO:__main__:2024-11-05 06:19:36 | Epoch: 0 | Step: 129360 | Dataset: 0-971584 | Loss: 0.792 | 913 ms/step , 6887.60 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 06:19:45 | Epoch: 0 | Step: 129370 | Dataset: 0-971904 | Loss: 0.637 | 912 ms/step , 6896.48 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 06:19:54 | Epoch: 0 | Step: 129380 | Dataset: 0-972224 | Loss: 0.910 | 914 ms/step , 6881.79 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 06:20:03 | Epoch: 0 | Step: 129390 | Dataset: 0-972544 | Loss: 0.744 | 914 ms/step , 6880.75 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 06:20:12 | Epoch: 0 | Step: 129400 | Dataset: 0-972864 | Loss: 0.888 | 914 ms/step , 6884.47 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 06:20:14 | Validation | Step: 129400 | Val_loss: 0.791 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:20:23 | Epoch: 0 | Step: 129410 | Dataset: 0-973184 | Loss: 0.774 | 912 ms/step , 6899.01 GFLOP/s , 15275.8 tokens/s INFO:__main__:2024-11-05 06:20:32 | Epoch: 0 | Step: 129420 | Dataset: 0-973504 | Loss: 0.692 | 914 ms/step , 6882.50 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 06:20:41 | Epoch: 0 | Step: 129430 | Dataset: 0-973824 | Loss: 0.697 | 914 ms/step , 6882.48 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 06:20:50 | Epoch: 0 | Step: 129440 | Dataset: 0-974144 | Loss: 0.854 | 914 ms/step , 6884.98 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 06:21:00 | Epoch: 0 | Step: 129450 | Dataset: 0-974464 | Loss: 0.806 | 914 ms/step , 6884.43 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 06:21:09 | Epoch: 0 | Step: 129460 | Dataset: 0-974784 | Loss: 0.843 | 913 ms/step , 6887.12 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 06:21:18 | Epoch: 0 | Step: 129470 | Dataset: 0-975104 | Loss: 0.721 | 913 ms/step , 6887.67 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 06:21:27 | Epoch: 0 | Step: 129480 | Dataset: 0-975424 | Loss: 0.679 | 913 ms/step , 6890.64 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 06:21:36 | Epoch: 0 | Step: 129490 | Dataset: 0-975744 | Loss: 0.708 | 913 ms/step , 6888.23 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 06:21:45 | Epoch: 0 | Step: 129500 | Dataset: 0-976064 | Loss: 0.927 | 914 ms/step , 6881.73 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 06:21:47 | Validation | Step: 129500 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:21:56 | Epoch: 0 | Step: 129510 | Dataset: 0-976384 | Loss: 0.783 | 913 ms/step , 6886.10 GFLOP/s , 15278.6 tokens/s INFO:__main__:2024-11-05 06:22:05 | Epoch: 0 | Step: 129520 | Dataset: 0-976704 | Loss: 0.738 | 913 ms/step , 6891.47 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 06:22:14 | Epoch: 0 | Step: 129530 | Dataset: 0-977024 | Loss: 0.822 | 913 ms/step , 6892.23 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 06:22:23 | Epoch: 0 | Step: 129540 | Dataset: 0-977344 | Loss: 0.887 | 914 ms/step , 6882.55 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 06:22:32 | Epoch: 0 | Step: 129550 | Dataset: 0-977664 | Loss: 0.857 | 914 ms/step , 6882.28 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 06:22:42 | Epoch: 0 | Step: 129560 | Dataset: 0-977984 | Loss: 0.855 | 912 ms/step , 6896.57 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 06:22:51 | Epoch: 0 | Step: 129570 | Dataset: 0-978304 | Loss: 0.707 | 912 ms/step , 6893.21 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 06:23:00 | Epoch: 0 | Step: 129580 | Dataset: 0-978624 | Loss: 0.787 | 913 ms/step , 6889.63 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 06:23:09 | Epoch: 0 | Step: 129590 | Dataset: 0-978944 | Loss: 0.783 | 915 ms/step , 6873.44 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 06:23:18 | Epoch: 0 | Step: 129600 | Dataset: 0-979264 | Loss: 0.805 | 913 ms/step , 6890.72 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 06:23:20 | Validation | Step: 129600 | Val_loss: 0.790 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:23:29 | Epoch: 0 | Step: 129610 | Dataset: 0-979584 | Loss: 0.775 | 914 ms/step , 6880.86 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-05 06:23:38 | Epoch: 0 | Step: 129620 | Dataset: 0-979904 | Loss: 0.821 | 916 ms/step , 6868.49 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 06:23:47 | Epoch: 0 | Step: 129630 | Dataset: 0-980224 | Loss: 0.797 | 915 ms/step , 6876.62 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 06:23:56 | Epoch: 0 | Step: 129640 | Dataset: 0-980544 | Loss: 0.771 | 914 ms/step , 6884.40 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 06:24:05 | Epoch: 0 | Step: 129650 | Dataset: 0-980864 | Loss: 0.844 | 913 ms/step , 6892.45 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:24:15 | Epoch: 0 | Step: 129660 | Dataset: 0-981184 | Loss: 0.668 | 914 ms/step , 6884.38 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 06:24:24 | Epoch: 0 | Step: 129670 | Dataset: 0-981504 | Loss: 0.820 | 912 ms/step , 6892.74 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:24:33 | Epoch: 0 | Step: 129680 | Dataset: 0-981824 | Loss: 0.721 | 914 ms/step , 6882.99 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 06:24:42 | Epoch: 0 | Step: 129690 | Dataset: 0-982144 | Loss: 0.918 | 917 ms/step , 6860.66 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 06:24:51 | Epoch: 0 | Step: 129700 | Dataset: 0-982464 | Loss: 0.765 | 913 ms/step , 6888.61 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 06:24:53 | Validation | Step: 129700 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:25:02 | Epoch: 0 | Step: 129710 | Dataset: 0-982784 | Loss: 0.717 | 913 ms/step , 6887.23 GFLOP/s , 15264.0 tokens/s INFO:__main__:2024-11-05 06:25:11 | Epoch: 0 | Step: 129720 | Dataset: 0-983104 | Loss: 0.737 | 912 ms/step , 6895.59 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:25:20 | Epoch: 0 | Step: 129730 | Dataset: 0-983424 | Loss: 0.782 | 914 ms/step , 6880.18 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 06:25:29 | Epoch: 0 | Step: 129740 | Dataset: 0-983744 | Loss: 0.784 | 913 ms/step , 6886.89 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 06:25:38 | Epoch: 0 | Step: 129750 | Dataset: 0-984064 | Loss: 0.755 | 914 ms/step , 6884.32 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 06:25:48 | Epoch: 0 | Step: 129760 | Dataset: 0-984384 | Loss: 0.792 | 912 ms/step , 6895.19 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 06:25:57 | Epoch: 0 | Step: 129770 | Dataset: 0-984704 | Loss: 0.790 | 913 ms/step , 6885.33 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 06:26:06 | Epoch: 0 | Step: 129780 | Dataset: 0-985024 | Loss: 0.817 | 912 ms/step , 6897.06 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 06:26:15 | Epoch: 0 | Step: 129790 | Dataset: 0-985344 | Loss: 0.894 | 913 ms/step , 6886.92 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 06:26:24 | Epoch: 0 | Step: 129800 | Dataset: 0-985664 | Loss: 0.813 | 914 ms/step , 6878.64 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 06:26:26 | Validation | Step: 129800 | Val_loss: 0.820 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:26:35 | Epoch: 0 | Step: 129810 | Dataset: 0-985984 | Loss: 0.815 | 913 ms/step , 6892.22 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 06:26:44 | Epoch: 0 | Step: 129820 | Dataset: 0-986304 | Loss: 0.795 | 914 ms/step , 6881.53 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 06:26:53 | Epoch: 0 | Step: 129830 | Dataset: 0-986624 | Loss: 0.686 | 913 ms/step , 6887.33 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 06:27:02 | Epoch: 0 | Step: 129840 | Dataset: 0-986944 | Loss: 0.771 | 912 ms/step , 6894.68 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 06:27:11 | Epoch: 0 | Step: 129850 | Dataset: 0-987264 | Loss: 0.790 | 914 ms/step , 6877.71 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 06:27:21 | Epoch: 0 | Step: 129860 | Dataset: 0-987584 | Loss: 0.740 | 914 ms/step , 6883.24 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 06:27:30 | Epoch: 0 | Step: 129870 | Dataset: 0-987904 | Loss: 0.756 | 913 ms/step , 6887.52 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 06:27:39 | Epoch: 0 | Step: 129880 | Dataset: 0-988224 | Loss: 0.757 | 915 ms/step , 6870.34 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 06:27:48 | Epoch: 0 | Step: 129890 | Dataset: 0-988544 | Loss: 0.684 | 915 ms/step , 6871.94 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 06:27:57 | Epoch: 0 | Step: 129900 | Dataset: 0-988864 | Loss: 0.824 | 914 ms/step , 6879.99 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 06:27:59 | Validation | Step: 129900 | Val_loss: 0.784 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:28:08 | Epoch: 0 | Step: 129910 | Dataset: 0-989184 | Loss: 0.774 | 913 ms/step , 6886.17 GFLOP/s , 15262.5 tokens/s INFO:__main__:2024-11-05 06:28:17 | Epoch: 0 | Step: 129920 | Dataset: 0-989504 | Loss: 0.749 | 913 ms/step , 6890.14 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 06:28:26 | Epoch: 0 | Step: 129930 | Dataset: 0-989824 | Loss: 0.694 | 912 ms/step , 6894.35 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 06:28:35 | Epoch: 0 | Step: 129940 | Dataset: 0-990144 | Loss: 0.827 | 915 ms/step , 6870.14 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 06:28:44 | Epoch: 0 | Step: 129950 | Dataset: 0-990464 | Loss: 0.757 | 913 ms/step , 6891.57 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 06:28:54 | Epoch: 0 | Step: 129960 | Dataset: 0-990784 | Loss: 0.771 | 914 ms/step , 6878.60 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 06:29:03 | Epoch: 0 | Step: 129970 | Dataset: 0-991104 | Loss: 0.786 | 914 ms/step , 6880.96 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 06:29:12 | Epoch: 0 | Step: 129980 | Dataset: 0-991424 | Loss: 0.810 | 913 ms/step , 6892.14 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 06:29:21 | Epoch: 0 | Step: 129990 | Dataset: 0-991744 | Loss: 0.653 | 912 ms/step , 6899.03 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 06:29:30 | Epoch: 0 | Step: 130000 | Dataset: 0-992064 | Loss: 0.826 | 914 ms/step , 6885.01 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 06:29:32 | Validation | Step: 130000 | Val_loss: 0.798 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:29:32 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_062932_step_130000.pt` INFO:__main__:2024-11-05 06:29:42 | Epoch: 0 | Step: 130010 | Dataset: 0-992384 | Loss: 0.741 | 912 ms/step , 6896.71 GFLOP/s , 13822.5 tokens/s INFO:__main__:2024-11-05 06:29:51 | Epoch: 0 | Step: 130020 | Dataset: 0-992704 | Loss: 0.787 | 913 ms/step , 6886.43 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 06:30:00 | Epoch: 0 | Step: 130030 | Dataset: 0-993024 | Loss: 0.801 | 913 ms/step , 6885.67 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 06:30:09 | Epoch: 0 | Step: 130040 | Dataset: 0-993344 | Loss: 0.696 | 913 ms/step , 6886.63 GFLOP/s , 17903.6 tokens/s INFO:__main__:2024-11-05 06:30:19 | Epoch: 0 | Step: 130050 | Dataset: 0-993664 | Loss: 0.898 | 913 ms/step , 6885.76 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 06:30:28 | Epoch: 0 | Step: 130060 | Dataset: 0-993984 | Loss: 0.849 | 914 ms/step , 6884.01 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:30:37 | Epoch: 0 | Step: 130070 | Dataset: 0-994304 | Loss: 0.734 | 912 ms/step , 6895.58 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 06:30:46 | Epoch: 0 | Step: 130080 | Dataset: 0-994624 | Loss: 0.782 | 914 ms/step , 6878.55 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 06:30:55 | Epoch: 0 | Step: 130090 | Dataset: 0-994944 | Loss: 0.946 | 914 ms/step , 6882.64 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 06:31:04 | Epoch: 0 | Step: 130100 | Dataset: 0-995264 | Loss: 0.854 | 914 ms/step , 6879.56 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 06:31:06 | Validation | Step: 130100 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:31:15 | Epoch: 0 | Step: 130110 | Dataset: 0-995584 | Loss: 0.876 | 913 ms/step , 6885.72 GFLOP/s , 15267.8 tokens/s INFO:__main__:2024-11-05 06:31:24 | Epoch: 0 | Step: 130120 | Dataset: 0-995904 | Loss: 0.782 | 912 ms/step , 6894.46 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 06:31:33 | Epoch: 0 | Step: 130130 | Dataset: 0-996224 | Loss: 0.723 | 913 ms/step , 6886.65 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 06:31:42 | Epoch: 0 | Step: 130140 | Dataset: 0-996544 | Loss: 0.682 | 913 ms/step , 6887.10 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 06:31:51 | Epoch: 0 | Step: 130150 | Dataset: 0-996864 | Loss: 0.716 | 914 ms/step , 6880.72 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 06:32:01 | Epoch: 0 | Step: 130160 | Dataset: 0-997184 | Loss: 0.647 | 914 ms/step , 6882.35 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 06:32:10 | Epoch: 0 | Step: 130170 | Dataset: 0-997504 | Loss: 0.697 | 913 ms/step , 6887.99 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 06:32:19 | Epoch: 0 | Step: 130180 | Dataset: 0-997824 | Loss: 0.752 | 914 ms/step , 6884.38 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 06:32:28 | Epoch: 0 | Step: 130190 | Dataset: 0-998144 | Loss: 0.755 | 915 ms/step , 6875.68 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 06:32:37 | Epoch: 0 | Step: 130200 | Dataset: 0-998464 | Loss: 0.874 | 913 ms/step , 6890.50 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 06:32:39 | Validation | Step: 130200 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:32:48 | Epoch: 0 | Step: 130210 | Dataset: 0-998784 | Loss: 0.780 | 913 ms/step , 6886.66 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 06:32:57 | Epoch: 0 | Step: 130220 | Dataset: 0-999104 | Loss: 0.742 | 913 ms/step , 6889.48 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 06:33:06 | Epoch: 0 | Step: 130230 | Dataset: 0-999424 | Loss: 0.687 | 913 ms/step , 6888.24 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:33:15 | Epoch: 0 | Step: 130240 | Dataset: 0-999744 | Loss: 0.778 | 913 ms/step , 6891.79 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 06:33:24 | Epoch: 0 | Step: 130250 | Dataset: 0-1000064 | Loss: 0.635 | 913 ms/step , 6892.29 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 06:33:34 | Epoch: 0 | Step: 130260 | Dataset: 0-1000384 | Loss: 0.751 | 913 ms/step , 6885.05 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 06:33:43 | Epoch: 0 | Step: 130270 | Dataset: 0-1000704 | Loss: 0.733 | 914 ms/step , 6878.64 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 06:33:52 | Epoch: 0 | Step: 130280 | Dataset: 0-1001024 | Loss: 0.771 | 913 ms/step , 6891.99 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 06:34:01 | Epoch: 0 | Step: 130290 | Dataset: 0-1001344 | Loss: 0.780 | 913 ms/step , 6892.32 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 06:34:10 | Epoch: 0 | Step: 130300 | Dataset: 0-1001664 | Loss: 0.650 | 913 ms/step , 6892.57 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 06:34:12 | Validation | Step: 130300 | Val_loss: 0.797 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:34:21 | Epoch: 0 | Step: 130310 | Dataset: 0-1001984 | Loss: 0.757 | 913 ms/step , 6885.86 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 06:34:30 | Epoch: 0 | Step: 130320 | Dataset: 0-1002304 | Loss: 0.756 | 913 ms/step , 6888.83 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 06:34:39 | Epoch: 0 | Step: 130330 | Dataset: 0-1002624 | Loss: 0.682 | 914 ms/step , 6884.74 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 06:34:48 | Epoch: 0 | Step: 130340 | Dataset: 0-1002944 | Loss: 0.776 | 913 ms/step , 6885.71 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:34:57 | Epoch: 0 | Step: 130350 | Dataset: 0-1003264 | Loss: 0.808 | 915 ms/step , 6877.00 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 06:35:07 | Epoch: 0 | Step: 130360 | Dataset: 0-1003584 | Loss: 0.756 | 913 ms/step , 6886.50 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 06:35:16 | Epoch: 0 | Step: 130370 | Dataset: 0-1003904 | Loss: 0.815 | 913 ms/step , 6885.95 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 06:35:25 | Epoch: 0 | Step: 130380 | Dataset: 0-1004224 | Loss: 0.712 | 914 ms/step , 6881.27 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 06:35:34 | Epoch: 0 | Step: 130390 | Dataset: 0-1004544 | Loss: 0.856 | 913 ms/step , 6887.65 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 06:35:43 | Epoch: 0 | Step: 130400 | Dataset: 0-1004864 | Loss: 0.842 | 913 ms/step , 6891.79 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 06:35:45 | Validation | Step: 130400 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:35:54 | Epoch: 0 | Step: 130410 | Dataset: 0-1005184 | Loss: 0.713 | 914 ms/step , 6879.97 GFLOP/s , 15270.2 tokens/s INFO:__main__:2024-11-05 06:36:03 | Epoch: 0 | Step: 130420 | Dataset: 0-1005504 | Loss: 0.863 | 915 ms/step , 6870.41 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 06:36:12 | Epoch: 0 | Step: 130430 | Dataset: 0-1005824 | Loss: 0.763 | 913 ms/step , 6891.07 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 06:36:21 | Epoch: 0 | Step: 130440 | Dataset: 0-1006144 | Loss: 0.829 | 915 ms/step , 6877.41 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 06:36:30 | Epoch: 0 | Step: 130450 | Dataset: 0-1006464 | Loss: 0.721 | 912 ms/step , 6897.99 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:36:40 | Epoch: 0 | Step: 130460 | Dataset: 0-1006784 | Loss: 0.738 | 912 ms/step , 6897.93 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:36:49 | Epoch: 0 | Step: 130470 | Dataset: 0-1007104 | Loss: 0.760 | 912 ms/step , 6894.36 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 06:36:58 | Epoch: 0 | Step: 130480 | Dataset: 0-1007424 | Loss: 0.744 | 914 ms/step , 6884.26 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 06:37:07 | Epoch: 0 | Step: 130490 | Dataset: 0-1007744 | Loss: 0.710 | 912 ms/step , 6894.81 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 06:37:16 | Epoch: 0 | Step: 130500 | Dataset: 0-1008064 | Loss: 0.751 | 915 ms/step , 6876.46 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 06:37:18 | Validation | Step: 130500 | Val_loss: 0.743 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:37:27 | Epoch: 0 | Step: 130510 | Dataset: 0-1008384 | Loss: 0.641 | 912 ms/step , 6893.77 GFLOP/s , 15277.4 tokens/s INFO:__main__:2024-11-05 06:37:36 | Epoch: 0 | Step: 130520 | Dataset: 0-1008704 | Loss: 0.762 | 913 ms/step , 6885.56 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 06:37:45 | Epoch: 0 | Step: 130530 | Dataset: 0-1009024 | Loss: 0.725 | 914 ms/step , 6879.96 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 06:37:54 | Epoch: 0 | Step: 130540 | Dataset: 0-1009344 | Loss: 0.745 | 913 ms/step , 6888.72 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 06:38:03 | Epoch: 0 | Step: 130550 | Dataset: 0-1009664 | Loss: 0.682 | 913 ms/step , 6887.07 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 06:38:12 | Epoch: 0 | Step: 130560 | Dataset: 0-1009984 | Loss: 0.809 | 912 ms/step , 6893.99 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 06:38:22 | Epoch: 0 | Step: 130570 | Dataset: 0-1010304 | Loss: 0.776 | 913 ms/step , 6886.61 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 06:38:31 | Epoch: 0 | Step: 130580 | Dataset: 0-1010624 | Loss: 0.814 | 912 ms/step , 6893.43 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 06:38:40 | Epoch: 0 | Step: 130590 | Dataset: 0-1010944 | Loss: 0.760 | 913 ms/step , 6885.44 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 06:38:49 | Epoch: 0 | Step: 130600 | Dataset: 0-1011264 | Loss: 0.681 | 913 ms/step , 6888.63 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 06:38:51 | Validation | Step: 130600 | Val_loss: 0.788 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:39:00 | Epoch: 0 | Step: 130610 | Dataset: 0-1011584 | Loss: 0.716 | 915 ms/step , 6876.78 GFLOP/s , 15280.3 tokens/s INFO:__main__:2024-11-05 06:39:09 | Epoch: 0 | Step: 130620 | Dataset: 0-1011904 | Loss: 0.763 | 913 ms/step , 6891.28 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 06:39:18 | Epoch: 0 | Step: 130630 | Dataset: 0-1012224 | Loss: 0.700 | 914 ms/step , 6883.95 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 06:39:27 | Epoch: 0 | Step: 130640 | Dataset: 0-1012544 | Loss: 0.756 | 914 ms/step , 6883.94 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 06:39:36 | Epoch: 0 | Step: 130650 | Dataset: 0-1012864 | Loss: 0.788 | 913 ms/step , 6885.57 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 06:39:45 | Epoch: 0 | Step: 130660 | Dataset: 0-1013184 | Loss: 0.800 | 913 ms/step , 6889.42 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 06:39:55 | Epoch: 0 | Step: 130670 | Dataset: 0-1013504 | Loss: 0.728 | 913 ms/step , 6891.71 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 06:40:04 | Epoch: 0 | Step: 130680 | Dataset: 0-1013824 | Loss: 0.754 | 913 ms/step , 6885.36 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 06:40:13 | Epoch: 0 | Step: 130690 | Dataset: 0-1014144 | Loss: 0.706 | 913 ms/step , 6890.14 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 06:40:22 | Epoch: 0 | Step: 130700 | Dataset: 0-1014464 | Loss: 0.775 | 912 ms/step , 6896.95 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 06:40:24 | Validation | Step: 130700 | Val_loss: 0.809 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:40:33 | Epoch: 0 | Step: 130710 | Dataset: 0-1014784 | Loss: 0.718 | 913 ms/step , 6889.03 GFLOP/s , 15274.2 tokens/s INFO:__main__:2024-11-05 06:40:42 | Epoch: 0 | Step: 130720 | Dataset: 0-1015104 | Loss: 0.806 | 913 ms/step , 6885.37 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 06:40:51 | Epoch: 0 | Step: 130730 | Dataset: 0-1015424 | Loss: 0.794 | 912 ms/step , 6893.93 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 06:41:00 | Epoch: 0 | Step: 130740 | Dataset: 0-1015744 | Loss: 0.768 | 914 ms/step , 6878.14 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 06:41:09 | Epoch: 0 | Step: 130750 | Dataset: 0-1016064 | Loss: 0.655 | 913 ms/step , 6887.33 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 06:41:18 | Epoch: 0 | Step: 130760 | Dataset: 0-1016384 | Loss: 0.693 | 913 ms/step , 6889.22 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 06:41:28 | Epoch: 0 | Step: 130770 | Dataset: 0-1016704 | Loss: 0.700 | 912 ms/step , 6893.60 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 06:41:37 | Epoch: 0 | Step: 130780 | Dataset: 0-1017024 | Loss: 0.713 | 912 ms/step , 6893.27 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 06:41:46 | Epoch: 0 | Step: 130790 | Dataset: 0-1017344 | Loss: 0.764 | 913 ms/step , 6888.26 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 06:41:55 | Epoch: 0 | Step: 130800 | Dataset: 0-1017664 | Loss: 0.781 | 912 ms/step , 6892.83 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 06:41:57 | Validation | Step: 130800 | Val_loss: 0.800 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:42:06 | Epoch: 0 | Step: 130810 | Dataset: 0-1017984 | Loss: 0.741 | 912 ms/step , 6893.57 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 06:42:15 | Epoch: 0 | Step: 130820 | Dataset: 0-1018304 | Loss: 0.736 | 914 ms/step , 6880.01 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 06:42:24 | Epoch: 0 | Step: 130830 | Dataset: 0-1018624 | Loss: 0.651 | 913 ms/step , 6891.46 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 06:42:33 | Epoch: 0 | Step: 130840 | Dataset: 0-1018944 | Loss: 0.751 | 913 ms/step , 6891.31 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 06:42:42 | Epoch: 0 | Step: 130850 | Dataset: 0-1019264 | Loss: 0.832 | 914 ms/step , 6884.25 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 06:42:51 | Epoch: 0 | Step: 130860 | Dataset: 0-1019584 | Loss: 0.831 | 917 ms/step , 6860.30 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 06:43:00 | Epoch: 0 | Step: 130870 | Dataset: 0-1019904 | Loss: 0.529 | 913 ms/step , 6888.88 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 06:43:10 | Epoch: 0 | Step: 130880 | Dataset: 0-1020224 | Loss: 0.764 | 914 ms/step , 6884.51 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 06:43:19 | Epoch: 0 | Step: 130890 | Dataset: 0-1020544 | Loss: 0.857 | 913 ms/step , 6890.95 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 06:43:28 | Epoch: 0 | Step: 130900 | Dataset: 0-1020864 | Loss: 0.717 | 912 ms/step , 6897.75 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 06:43:29 | Validation | Step: 130900 | Val_loss: 0.710 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:43:39 | Epoch: 0 | Step: 130910 | Dataset: 0-1021184 | Loss: 0.711 | 913 ms/step , 6886.77 GFLOP/s , 15282.8 tokens/s INFO:__main__:2024-11-05 06:43:48 | Epoch: 0 | Step: 130920 | Dataset: 0-1021504 | Loss: 0.879 | 913 ms/step , 6887.12 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 06:43:57 | Epoch: 0 | Step: 130930 | Dataset: 0-1021824 | Loss: 0.817 | 913 ms/step , 6886.96 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 06:44:06 | Epoch: 0 | Step: 130940 | Dataset: 0-1022144 | Loss: 0.782 | 913 ms/step , 6887.59 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 06:44:15 | Epoch: 0 | Step: 130950 | Dataset: 0-1022464 | Loss: 0.782 | 913 ms/step , 6888.61 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 06:44:24 | Epoch: 0 | Step: 130960 | Dataset: 0-1022784 | Loss: 0.909 | 914 ms/step , 6884.68 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 06:44:33 | Epoch: 0 | Step: 130970 | Dataset: 0-1023104 | Loss: 0.826 | 912 ms/step , 6897.50 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 06:44:43 | Epoch: 0 | Step: 130980 | Dataset: 0-1023424 | Loss: 0.768 | 914 ms/step , 6882.35 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 06:44:52 | Epoch: 0 | Step: 130990 | Dataset: 0-1023744 | Loss: 0.776 | 912 ms/step , 6894.01 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 06:45:01 | Epoch: 0 | Step: 131000 | Dataset: 0-1024064 | Loss: 0.773 | 912 ms/step , 6897.17 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 06:45:02 | Validation | Step: 131000 | Val_loss: 0.664 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:45:02 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_064502_step_131000.pt` INFO:__main__:2024-11-05 06:45:13 | Epoch: 0 | Step: 131010 | Dataset: 0-1024384 | Loss: 0.698 | 914 ms/step , 6880.55 GFLOP/s , 13803.4 tokens/s INFO:__main__:2024-11-05 06:45:22 | Epoch: 0 | Step: 131020 | Dataset: 0-1024704 | Loss: 0.816 | 913 ms/step , 6887.18 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:45:31 | Epoch: 0 | Step: 131030 | Dataset: 0-1025024 | Loss: 0.777 | 913 ms/step , 6886.23 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 06:45:40 | Epoch: 0 | Step: 131040 | Dataset: 0-1025344 | Loss: 0.650 | 914 ms/step , 6885.00 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 06:45:49 | Epoch: 0 | Step: 131050 | Dataset: 0-1025664 | Loss: 0.840 | 913 ms/step , 6886.58 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 06:45:58 | Epoch: 0 | Step: 131060 | Dataset: 0-1025984 | Loss: 0.776 | 914 ms/step , 6884.82 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 06:46:08 | Epoch: 0 | Step: 131070 | Dataset: 0-1026304 | Loss: 0.778 | 915 ms/step , 6874.13 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 06:46:17 | Epoch: 0 | Step: 131080 | Dataset: 0-1026624 | Loss: 0.606 | 913 ms/step , 6890.20 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 06:46:26 | Epoch: 0 | Step: 131090 | Dataset: 0-1026944 | Loss: 0.902 | 914 ms/step , 6881.67 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 06:46:35 | Epoch: 0 | Step: 131100 | Dataset: 0-1027264 | Loss: 0.847 | 913 ms/step , 6886.70 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 06:46:37 | Validation | Step: 131100 | Val_loss: 0.653 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:46:46 | Epoch: 0 | Step: 131110 | Dataset: 0-1027584 | Loss: 0.810 | 913 ms/step , 6892.32 GFLOP/s , 15279.7 tokens/s INFO:__main__:2024-11-05 06:46:55 | Epoch: 0 | Step: 131120 | Dataset: 0-1027904 | Loss: 0.847 | 913 ms/step , 6889.35 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 06:47:04 | Epoch: 0 | Step: 131130 | Dataset: 0-1028224 | Loss: 0.875 | 913 ms/step , 6890.55 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 06:47:13 | Epoch: 0 | Step: 131140 | Dataset: 0-1028544 | Loss: 0.733 | 913 ms/step , 6888.35 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 06:47:22 | Epoch: 0 | Step: 131150 | Dataset: 0-1028864 | Loss: 0.777 | 914 ms/step , 6882.68 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 06:47:31 | Epoch: 0 | Step: 131160 | Dataset: 0-1029184 | Loss: 0.715 | 913 ms/step , 6889.68 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 06:47:40 | Epoch: 0 | Step: 131170 | Dataset: 0-1029504 | Loss: 0.799 | 915 ms/step , 6876.22 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 06:47:50 | Epoch: 0 | Step: 131180 | Dataset: 0-1029824 | Loss: 0.551 | 914 ms/step , 6884.61 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 06:47:59 | Epoch: 0 | Step: 131190 | Dataset: 0-1030144 | Loss: 0.706 | 913 ms/step , 6887.39 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 06:48:08 | Epoch: 0 | Step: 131200 | Dataset: 0-1030464 | Loss: 0.826 | 914 ms/step , 6883.29 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 06:48:09 | Validation | Step: 131200 | Val_loss: 0.678 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:48:19 | Epoch: 0 | Step: 131210 | Dataset: 0-1030784 | Loss: 0.809 | 913 ms/step , 6885.24 GFLOP/s , 15287.4 tokens/s INFO:__main__:2024-11-05 06:48:28 | Epoch: 0 | Step: 131220 | Dataset: 0-1031104 | Loss: 0.837 | 912 ms/step , 6897.85 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:48:37 | Epoch: 0 | Step: 131230 | Dataset: 0-1031424 | Loss: 0.671 | 913 ms/step , 6889.45 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 06:48:46 | Epoch: 0 | Step: 131240 | Dataset: 0-1031744 | Loss: 0.800 | 912 ms/step , 6894.66 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 06:48:55 | Epoch: 0 | Step: 131250 | Dataset: 0-1032064 | Loss: 0.763 | 913 ms/step , 6886.87 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 06:49:04 | Epoch: 0 | Step: 131260 | Dataset: 0-1032384 | Loss: 0.568 | 913 ms/step , 6890.58 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 06:49:13 | Epoch: 0 | Step: 131270 | Dataset: 0-1032704 | Loss: 0.906 | 913 ms/step , 6889.60 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 06:49:23 | Epoch: 0 | Step: 131280 | Dataset: 0-1033024 | Loss: 0.801 | 914 ms/step , 6881.21 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 06:49:32 | Epoch: 0 | Step: 131290 | Dataset: 0-1033344 | Loss: 0.758 | 913 ms/step , 6892.30 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 06:49:41 | Epoch: 0 | Step: 131300 | Dataset: 0-1033664 | Loss: 0.664 | 912 ms/step , 6895.52 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 06:49:42 | Validation | Step: 131300 | Val_loss: 0.690 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:49:52 | Epoch: 0 | Step: 131310 | Dataset: 0-1033984 | Loss: 0.844 | 914 ms/step , 6878.72 GFLOP/s , 15272.0 tokens/s INFO:__main__:2024-11-05 06:50:01 | Epoch: 0 | Step: 131320 | Dataset: 0-1034304 | Loss: 0.822 | 913 ms/step , 6890.60 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 06:50:10 | Epoch: 0 | Step: 131330 | Dataset: 0-1034624 | Loss: 0.820 | 914 ms/step , 6884.76 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 06:50:19 | Epoch: 0 | Step: 131340 | Dataset: 0-1034944 | Loss: 0.920 | 914 ms/step , 6884.46 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 06:50:28 | Epoch: 0 | Step: 131350 | Dataset: 0-1035264 | Loss: 0.844 | 912 ms/step , 6894.08 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 06:50:37 | Epoch: 0 | Step: 131360 | Dataset: 0-1035584 | Loss: 0.880 | 914 ms/step , 6880.92 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 06:50:46 | Epoch: 0 | Step: 131370 | Dataset: 0-1035904 | Loss: 0.817 | 914 ms/step , 6883.92 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 06:50:55 | Epoch: 0 | Step: 131380 | Dataset: 0-1036224 | Loss: 0.777 | 914 ms/step , 6882.97 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 06:51:05 | Epoch: 0 | Step: 131390 | Dataset: 0-1036544 | Loss: 0.786 | 913 ms/step , 6890.41 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 06:51:14 | Epoch: 0 | Step: 131400 | Dataset: 0-1036864 | Loss: 0.740 | 913 ms/step , 6888.46 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 06:51:15 | Validation | Step: 131400 | Val_loss: 0.678 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:51:24 | Epoch: 0 | Step: 131410 | Dataset: 0-1037184 | Loss: 0.728 | 913 ms/step , 6891.04 GFLOP/s , 15278.7 tokens/s INFO:__main__:2024-11-05 06:51:34 | Epoch: 0 | Step: 131420 | Dataset: 0-1037504 | Loss: 0.789 | 913 ms/step , 6889.89 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 06:51:43 | Epoch: 0 | Step: 131430 | Dataset: 0-1037824 | Loss: 0.893 | 914 ms/step , 6880.29 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 06:51:52 | Epoch: 0 | Step: 131440 | Dataset: 0-1038144 | Loss: 0.885 | 914 ms/step , 6881.70 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 06:52:01 | Epoch: 0 | Step: 131450 | Dataset: 0-1038464 | Loss: 0.767 | 913 ms/step , 6889.43 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 06:52:10 | Epoch: 0 | Step: 131460 | Dataset: 0-1038784 | Loss: 0.750 | 913 ms/step , 6889.66 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 06:52:19 | Epoch: 0 | Step: 131470 | Dataset: 0-1039104 | Loss: 0.746 | 913 ms/step , 6889.18 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 06:52:28 | Epoch: 0 | Step: 131480 | Dataset: 0-1039424 | Loss: 0.712 | 912 ms/step , 6893.26 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 06:52:38 | Epoch: 0 | Step: 131490 | Dataset: 0-1039744 | Loss: 0.783 | 913 ms/step , 6887.30 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 06:52:47 | Epoch: 0 | Step: 131500 | Dataset: 0-1040064 | Loss: 0.868 | 913 ms/step , 6886.27 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 06:52:48 | Validation | Step: 131500 | Val_loss: 0.681 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:52:57 | Epoch: 0 | Step: 131510 | Dataset: 0-1040384 | Loss: 0.821 | 920 ms/step , 6839.24 GFLOP/s , 15240.8 tokens/s INFO:__main__:2024-11-05 06:53:07 | Epoch: 0 | Step: 131520 | Dataset: 0-1040704 | Loss: 0.921 | 922 ms/step , 6824.13 GFLOP/s , 17793.2 tokens/s INFO:__main__:2024-11-05 06:53:16 | Epoch: 0 | Step: 131530 | Dataset: 0-1041024 | Loss: 0.723 | 920 ms/step , 6834.51 GFLOP/s , 17804.3 tokens/s INFO:__main__:2024-11-05 06:53:25 | Epoch: 0 | Step: 131540 | Dataset: 0-1041344 | Loss: 0.720 | 913 ms/step , 6889.67 GFLOP/s , 17902.6 tokens/s INFO:__main__:2024-11-05 06:53:34 | Epoch: 0 | Step: 131550 | Dataset: 0-1041664 | Loss: 0.777 | 913 ms/step , 6888.11 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 06:53:43 | Epoch: 0 | Step: 131560 | Dataset: 0-1041984 | Loss: 0.798 | 912 ms/step , 6893.26 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 06:53:52 | Epoch: 0 | Step: 131570 | Dataset: 0-1042304 | Loss: 0.798 | 914 ms/step , 6882.66 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 06:54:02 | Epoch: 0 | Step: 131580 | Dataset: 0-1042624 | Loss: 0.771 | 913 ms/step , 6885.86 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 06:54:11 | Epoch: 0 | Step: 131590 | Dataset: 0-1042944 | Loss: 0.679 | 912 ms/step , 6895.95 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 06:54:20 | Epoch: 0 | Step: 131600 | Dataset: 0-1043264 | Loss: 0.787 | 913 ms/step , 6889.23 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 06:54:21 | Validation | Step: 131600 | Val_loss: 0.666 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:54:31 | Epoch: 0 | Step: 131610 | Dataset: 0-1043584 | Loss: 0.792 | 913 ms/step , 6885.95 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 06:54:40 | Epoch: 0 | Step: 131620 | Dataset: 0-1043904 | Loss: 0.764 | 913 ms/step , 6892.06 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 06:54:49 | Epoch: 0 | Step: 131630 | Dataset: 0-1044224 | Loss: 0.816 | 914 ms/step , 6883.48 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 06:54:58 | Epoch: 0 | Step: 131640 | Dataset: 0-1044544 | Loss: 0.870 | 913 ms/step , 6892.27 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 06:55:07 | Epoch: 0 | Step: 131650 | Dataset: 0-1044864 | Loss: 0.682 | 913 ms/step , 6891.79 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 06:55:16 | Epoch: 0 | Step: 131660 | Dataset: 0-1045184 | Loss: 0.804 | 914 ms/step , 6882.67 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 06:55:25 | Epoch: 0 | Step: 131670 | Dataset: 0-1045504 | Loss: 0.854 | 913 ms/step , 6885.98 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 06:55:35 | Epoch: 0 | Step: 131680 | Dataset: 0-1045824 | Loss: 0.762 | 914 ms/step , 6879.65 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 06:55:44 | Epoch: 0 | Step: 131690 | Dataset: 0-1046144 | Loss: 0.742 | 913 ms/step , 6886.79 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 06:55:53 | Epoch: 0 | Step: 131700 | Dataset: 0-1046464 | Loss: 0.650 | 926 ms/step , 6793.99 GFLOP/s , 17869.0 tokens/s INFO:__main__:2024-11-05 06:55:54 | Validation | Step: 131700 | Val_loss: 0.681 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:56:04 | Epoch: 0 | Step: 131710 | Dataset: 0-1046784 | Loss: 0.714 | 912 ms/step , 6893.09 GFLOP/s , 15209.5 tokens/s INFO:__main__:2024-11-05 06:56:13 | Epoch: 0 | Step: 131720 | Dataset: 0-1047104 | Loss: 0.757 | 912 ms/step , 6899.22 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 06:56:22 | Epoch: 0 | Step: 131730 | Dataset: 0-1047424 | Loss: 0.786 | 913 ms/step , 6890.61 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 06:56:31 | Epoch: 0 | Step: 131740 | Dataset: 0-1047744 | Loss: 0.580 | 912 ms/step , 6892.86 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 06:56:40 | Epoch: 0 | Step: 131750 | Dataset: 0-1048064 | Loss: 0.676 | 913 ms/step , 6892.57 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 06:56:49 | Epoch: 0 | Step: 131760 | Dataset: 0-1048384 | Loss: 0.855 | 916 ms/step , 6869.50 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 06:56:58 | Epoch: 0 | Step: 131770 | Dataset: 0-1048704 | Loss: 0.699 | 912 ms/step , 6897.13 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 06:57:08 | Epoch: 0 | Step: 131780 | Dataset: 0-1049024 | Loss: 0.834 | 913 ms/step , 6887.33 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 06:57:17 | Epoch: 0 | Step: 131790 | Dataset: 0-1049344 | Loss: 0.774 | 913 ms/step , 6886.49 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 06:57:26 | Epoch: 0 | Step: 131800 | Dataset: 0-1049664 | Loss: 0.567 | 913 ms/step , 6892.24 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 06:57:27 | Validation | Step: 131800 | Val_loss: 0.670 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:57:37 | Epoch: 0 | Step: 131810 | Dataset: 0-1049984 | Loss: 0.613 | 912 ms/step , 6894.41 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-05 06:57:46 | Epoch: 0 | Step: 131820 | Dataset: 0-1050304 | Loss: 0.864 | 916 ms/step , 6866.07 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 06:57:55 | Epoch: 0 | Step: 131830 | Dataset: 0-1050624 | Loss: 0.727 | 913 ms/step , 6885.78 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 06:58:04 | Epoch: 0 | Step: 131840 | Dataset: 0-1050944 | Loss: 0.775 | 914 ms/step , 6883.87 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 06:58:13 | Epoch: 0 | Step: 131850 | Dataset: 0-1051264 | Loss: 0.784 | 914 ms/step , 6882.91 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 06:58:22 | Epoch: 0 | Step: 131860 | Dataset: 0-1051584 | Loss: 0.729 | 913 ms/step , 6887.11 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 06:58:31 | Epoch: 0 | Step: 131870 | Dataset: 0-1051904 | Loss: 0.890 | 915 ms/step , 6872.90 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 06:58:41 | Epoch: 0 | Step: 131880 | Dataset: 0-1052224 | Loss: 0.762 | 914 ms/step , 6883.62 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 06:58:50 | Epoch: 0 | Step: 131890 | Dataset: 0-1052544 | Loss: 0.719 | 913 ms/step , 6886.42 GFLOP/s , 17894.3 tokens/s INFO:__main__:2024-11-05 06:58:59 | Epoch: 0 | Step: 131900 | Dataset: 0-1052864 | Loss: 0.729 | 914 ms/step , 6880.49 GFLOP/s , 17895.5 tokens/s INFO:__main__:2024-11-05 06:59:00 | Validation | Step: 131900 | Val_loss: 0.670 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 06:59:10 | Epoch: 0 | Step: 131910 | Dataset: 0-1053184 | Loss: 0.750 | 913 ms/step , 6888.60 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-05 06:59:19 | Epoch: 0 | Step: 131920 | Dataset: 0-1053504 | Loss: 0.807 | 913 ms/step , 6885.70 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 06:59:28 | Epoch: 0 | Step: 131930 | Dataset: 0-1053824 | Loss: 0.747 | 913 ms/step , 6885.85 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 06:59:37 | Epoch: 0 | Step: 131940 | Dataset: 0-1054144 | Loss: 0.576 | 912 ms/step , 6894.36 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 06:59:46 | Epoch: 0 | Step: 131950 | Dataset: 0-1054464 | Loss: 0.728 | 914 ms/step , 6883.74 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 06:59:55 | Epoch: 0 | Step: 131960 | Dataset: 0-1054784 | Loss: 0.735 | 913 ms/step , 6887.59 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 07:00:04 | Epoch: 0 | Step: 131970 | Dataset: 0-1055104 | Loss: 0.755 | 913 ms/step , 6891.65 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 07:00:14 | Epoch: 0 | Step: 131980 | Dataset: 0-1055424 | Loss: 0.848 | 913 ms/step , 6891.40 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 07:00:23 | Epoch: 0 | Step: 131990 | Dataset: 0-1055744 | Loss: 0.728 | 913 ms/step , 6892.17 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 07:00:32 | Epoch: 0 | Step: 132000 | Dataset: 0-1056064 | Loss: 0.859 | 914 ms/step , 6881.14 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 07:00:33 | Validation | Step: 132000 | Val_loss: 0.671 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:00:33 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_070033_step_132000.pt` INFO:__main__:2024-11-05 07:00:44 | Epoch: 0 | Step: 132010 | Dataset: 0-1056384 | Loss: 0.706 | 914 ms/step , 6883.50 GFLOP/s , 13780.2 tokens/s INFO:__main__:2024-11-05 07:00:53 | Epoch: 0 | Step: 132020 | Dataset: 0-1056704 | Loss: 1.002 | 913 ms/step , 6887.22 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 07:01:02 | Epoch: 0 | Step: 132030 | Dataset: 0-1057024 | Loss: 0.791 | 913 ms/step , 6891.90 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 07:01:11 | Epoch: 0 | Step: 132040 | Dataset: 0-1057344 | Loss: 0.827 | 914 ms/step , 6882.61 GFLOP/s , 17873.0 tokens/s INFO:__main__:2024-11-05 07:01:20 | Epoch: 0 | Step: 132050 | Dataset: 0-1057664 | Loss: 0.808 | 913 ms/step , 6889.35 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 07:01:29 | Epoch: 0 | Step: 132060 | Dataset: 0-1057984 | Loss: 0.715 | 912 ms/step , 6893.88 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 07:01:39 | Epoch: 0 | Step: 132070 | Dataset: 0-1058304 | Loss: 0.817 | 914 ms/step , 6879.90 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 07:01:48 | Epoch: 0 | Step: 132080 | Dataset: 0-1058624 | Loss: 0.862 | 912 ms/step , 6896.64 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 07:01:57 | Epoch: 0 | Step: 132090 | Dataset: 0-1058944 | Loss: 0.708 | 913 ms/step , 6887.02 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 07:02:06 | Epoch: 0 | Step: 132100 | Dataset: 0-1059264 | Loss: 0.814 | 914 ms/step , 6884.82 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 07:02:07 | Validation | Step: 132100 | Val_loss: 0.643 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:02:17 | Epoch: 0 | Step: 132110 | Dataset: 0-1059584 | Loss: 0.743 | 913 ms/step , 6892.09 GFLOP/s , 15282.4 tokens/s INFO:__main__:2024-11-05 07:02:26 | Epoch: 0 | Step: 132120 | Dataset: 0-1059904 | Loss: 0.817 | 913 ms/step , 6889.67 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 07:02:35 | Epoch: 0 | Step: 132130 | Dataset: 0-1060224 | Loss: 0.724 | 912 ms/step , 6896.88 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 07:02:44 | Epoch: 0 | Step: 132140 | Dataset: 0-1060544 | Loss: 0.844 | 912 ms/step , 6894.02 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 07:02:53 | Epoch: 0 | Step: 132150 | Dataset: 0-1060864 | Loss: 0.778 | 912 ms/step , 6895.96 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 07:03:02 | Epoch: 0 | Step: 132160 | Dataset: 0-1061184 | Loss: 0.745 | 914 ms/step , 6883.84 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 07:03:11 | Epoch: 0 | Step: 132170 | Dataset: 0-1061504 | Loss: 0.777 | 913 ms/step , 6888.20 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 07:03:21 | Epoch: 0 | Step: 132180 | Dataset: 0-1061824 | Loss: 0.927 | 913 ms/step , 6888.35 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 07:03:30 | Epoch: 0 | Step: 132190 | Dataset: 0-1062144 | Loss: 0.870 | 912 ms/step , 6896.40 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 07:03:39 | Epoch: 0 | Step: 132200 | Dataset: 0-1062464 | Loss: 0.901 | 914 ms/step , 6882.70 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 07:03:40 | Validation | Step: 132200 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:03:50 | Epoch: 0 | Step: 132210 | Dataset: 0-1062784 | Loss: 0.761 | 914 ms/step , 6881.32 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 07:03:59 | Epoch: 0 | Step: 132220 | Dataset: 0-1063104 | Loss: 0.640 | 913 ms/step , 6890.51 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 07:04:08 | Epoch: 0 | Step: 132230 | Dataset: 0-1063424 | Loss: 0.846 | 913 ms/step , 6891.88 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 07:04:17 | Epoch: 0 | Step: 132240 | Dataset: 0-1063744 | Loss: 0.890 | 914 ms/step , 6882.84 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 07:04:26 | Epoch: 0 | Step: 132250 | Dataset: 0-1064064 | Loss: 0.717 | 912 ms/step , 6892.94 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 07:04:35 | Epoch: 0 | Step: 132260 | Dataset: 0-1064384 | Loss: 0.559 | 913 ms/step , 6892.47 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 07:04:44 | Epoch: 0 | Step: 132270 | Dataset: 0-1064704 | Loss: 0.798 | 914 ms/step , 6884.22 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 07:04:53 | Epoch: 0 | Step: 132280 | Dataset: 0-1065024 | Loss: 0.724 | 912 ms/step , 6896.37 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 07:05:03 | Epoch: 0 | Step: 132290 | Dataset: 0-1065344 | Loss: 0.670 | 913 ms/step , 6886.56 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 07:05:12 | Epoch: 0 | Step: 132300 | Dataset: 0-1065664 | Loss: 0.783 | 914 ms/step , 6879.76 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 07:05:13 | Validation | Step: 132300 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:05:22 | Epoch: 0 | Step: 132310 | Dataset: 0-1065984 | Loss: 0.799 | 913 ms/step , 6889.21 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-05 07:05:32 | Epoch: 0 | Step: 132320 | Dataset: 0-1066304 | Loss: 0.896 | 913 ms/step , 6887.80 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 07:05:41 | Epoch: 0 | Step: 132330 | Dataset: 0-1066624 | Loss: 0.861 | 913 ms/step , 6886.31 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 07:05:50 | Epoch: 0 | Step: 132340 | Dataset: 0-1066944 | Loss: 0.953 | 913 ms/step , 6885.76 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 07:05:59 | Epoch: 0 | Step: 132350 | Dataset: 0-1067264 | Loss: 0.814 | 913 ms/step , 6885.25 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 07:06:08 | Epoch: 0 | Step: 132360 | Dataset: 0-1067584 | Loss: 0.845 | 913 ms/step , 6885.33 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 07:06:17 | Epoch: 0 | Step: 132370 | Dataset: 0-1067904 | Loss: 0.901 | 916 ms/step , 6864.84 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 07:06:26 | Epoch: 0 | Step: 132380 | Dataset: 0-1068224 | Loss: 0.842 | 913 ms/step , 6886.70 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 07:06:36 | Epoch: 0 | Step: 132390 | Dataset: 0-1068544 | Loss: 0.704 | 914 ms/step , 6883.02 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 07:06:45 | Epoch: 0 | Step: 132400 | Dataset: 0-1068864 | Loss: 0.741 | 914 ms/step , 6878.90 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 07:06:46 | Validation | Step: 132400 | Val_loss: 0.677 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:06:55 | Epoch: 0 | Step: 132410 | Dataset: 0-1069184 | Loss: 0.848 | 913 ms/step , 6885.92 GFLOP/s , 15297.7 tokens/s INFO:__main__:2024-11-05 07:07:05 | Epoch: 0 | Step: 132420 | Dataset: 0-1069504 | Loss: 0.795 | 915 ms/step , 6873.63 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 07:07:14 | Epoch: 0 | Step: 132430 | Dataset: 0-1069824 | Loss: 0.820 | 913 ms/step , 6885.93 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 07:07:23 | Epoch: 0 | Step: 132440 | Dataset: 0-1070144 | Loss: 0.816 | 913 ms/step , 6891.34 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 07:07:32 | Epoch: 0 | Step: 132450 | Dataset: 0-1070464 | Loss: 0.812 | 914 ms/step , 6882.16 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 07:07:41 | Epoch: 0 | Step: 132460 | Dataset: 0-1070784 | Loss: 0.843 | 914 ms/step , 6884.69 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 07:07:50 | Epoch: 0 | Step: 132470 | Dataset: 0-1071104 | Loss: 0.854 | 915 ms/step , 6877.38 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 07:07:59 | Epoch: 0 | Step: 132480 | Dataset: 0-1071424 | Loss: 0.914 | 915 ms/step , 6870.61 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 07:08:09 | Epoch: 0 | Step: 132490 | Dataset: 0-1071744 | Loss: 0.774 | 915 ms/step , 6873.53 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 07:08:18 | Epoch: 0 | Step: 132500 | Dataset: 0-1072064 | Loss: 0.757 | 915 ms/step , 6876.93 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 07:08:19 | Validation | Step: 132500 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:08:28 | Epoch: 0 | Step: 132510 | Dataset: 0-1072384 | Loss: 0.815 | 913 ms/step , 6887.82 GFLOP/s , 15269.2 tokens/s INFO:__main__:2024-11-05 07:08:38 | Epoch: 0 | Step: 132520 | Dataset: 0-1072704 | Loss: 0.780 | 913 ms/step , 6890.60 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 07:08:47 | Epoch: 0 | Step: 132530 | Dataset: 0-1073024 | Loss: 0.788 | 914 ms/step , 6883.98 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 07:08:56 | Epoch: 0 | Step: 132540 | Dataset: 0-1073344 | Loss: 0.826 | 914 ms/step , 6882.78 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 07:09:05 | Epoch: 0 | Step: 132550 | Dataset: 0-1073664 | Loss: 0.854 | 913 ms/step , 6886.18 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 07:09:14 | Epoch: 0 | Step: 132560 | Dataset: 0-1073984 | Loss: 0.745 | 914 ms/step , 6881.75 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 07:09:23 | Epoch: 0 | Step: 132570 | Dataset: 0-1074304 | Loss: 0.793 | 913 ms/step , 6889.02 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 07:09:32 | Epoch: 0 | Step: 132580 | Dataset: 0-1074624 | Loss: 0.786 | 913 ms/step , 6886.07 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 07:09:42 | Epoch: 0 | Step: 132590 | Dataset: 0-1074944 | Loss: 0.829 | 913 ms/step , 6888.52 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 07:09:51 | Epoch: 0 | Step: 132600 | Dataset: 0-1075264 | Loss: 0.821 | 915 ms/step , 6875.91 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 07:09:52 | Validation | Step: 132600 | Val_loss: 0.674 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:10:01 | Epoch: 0 | Step: 132610 | Dataset: 0-1075584 | Loss: 0.753 | 914 ms/step , 6881.87 GFLOP/s , 15271.4 tokens/s INFO:__main__:2024-11-05 07:10:11 | Epoch: 0 | Step: 132620 | Dataset: 0-1075904 | Loss: 0.857 | 914 ms/step , 6880.32 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 07:10:20 | Epoch: 0 | Step: 132630 | Dataset: 0-1076224 | Loss: 0.798 | 915 ms/step , 6874.19 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 07:10:29 | Epoch: 0 | Step: 132640 | Dataset: 0-1076544 | Loss: 0.859 | 914 ms/step , 6879.29 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 07:10:38 | Epoch: 0 | Step: 132650 | Dataset: 0-1076864 | Loss: 0.903 | 915 ms/step , 6875.12 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 07:10:47 | Epoch: 0 | Step: 132660 | Dataset: 0-1077184 | Loss: 0.800 | 913 ms/step , 6888.18 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 07:10:56 | Epoch: 0 | Step: 132670 | Dataset: 0-1077504 | Loss: 0.772 | 914 ms/step , 6882.04 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 07:11:05 | Epoch: 0 | Step: 132680 | Dataset: 0-1077824 | Loss: 0.845 | 915 ms/step , 6873.75 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 07:11:15 | Epoch: 0 | Step: 132690 | Dataset: 0-1078144 | Loss: 0.822 | 914 ms/step , 6879.12 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 07:11:24 | Epoch: 0 | Step: 132700 | Dataset: 0-1078464 | Loss: 0.780 | 915 ms/step , 6877.10 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 07:11:25 | Validation | Step: 132700 | Val_loss: 0.693 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:11:34 | Epoch: 0 | Step: 132710 | Dataset: 0-1078784 | Loss: 0.738 | 913 ms/step , 6890.62 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 07:11:44 | Epoch: 0 | Step: 132720 | Dataset: 0-1079104 | Loss: 0.831 | 914 ms/step , 6879.58 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 07:11:53 | Epoch: 0 | Step: 132730 | Dataset: 0-1079424 | Loss: 0.847 | 914 ms/step , 6884.70 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 07:12:02 | Epoch: 0 | Step: 132740 | Dataset: 0-1079744 | Loss: 0.804 | 914 ms/step , 6883.87 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 07:12:11 | Epoch: 0 | Step: 132750 | Dataset: 0-1080064 | Loss: 0.841 | 914 ms/step , 6878.77 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 07:12:20 | Epoch: 0 | Step: 132760 | Dataset: 0-1080384 | Loss: 0.884 | 913 ms/step , 6889.53 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 07:12:29 | Epoch: 0 | Step: 132770 | Dataset: 0-1080704 | Loss: 0.739 | 913 ms/step , 6886.38 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 07:12:38 | Epoch: 0 | Step: 132780 | Dataset: 0-1081024 | Loss: 0.781 | 914 ms/step , 6882.41 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 07:12:48 | Epoch: 0 | Step: 132790 | Dataset: 0-1081344 | Loss: 0.753 | 914 ms/step , 6881.37 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 07:12:57 | Epoch: 0 | Step: 132800 | Dataset: 0-1081664 | Loss: 0.756 | 914 ms/step , 6881.04 GFLOP/s , 17907.7 tokens/s INFO:__main__:2024-11-05 07:12:58 | Validation | Step: 132800 | Val_loss: 0.671 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:13:07 | Epoch: 0 | Step: 132810 | Dataset: 0-1081984 | Loss: 0.799 | 913 ms/step , 6888.13 GFLOP/s , 15271.1 tokens/s INFO:__main__:2024-11-05 07:13:17 | Epoch: 0 | Step: 132820 | Dataset: 0-1082304 | Loss: 0.850 | 914 ms/step , 6885.02 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 07:13:26 | Epoch: 0 | Step: 132830 | Dataset: 0-1082624 | Loss: 0.866 | 914 ms/step , 6878.73 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 07:13:35 | Epoch: 0 | Step: 132840 | Dataset: 0-1082944 | Loss: 0.809 | 915 ms/step , 6873.92 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 07:13:44 | Epoch: 0 | Step: 132850 | Dataset: 0-1083264 | Loss: 0.776 | 913 ms/step , 6886.30 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 07:13:53 | Epoch: 0 | Step: 132860 | Dataset: 0-1083584 | Loss: 0.774 | 925 ms/step , 6802.76 GFLOP/s , 17844.8 tokens/s INFO:__main__:2024-11-05 07:14:02 | Epoch: 0 | Step: 132870 | Dataset: 0-1083904 | Loss: 0.804 | 913 ms/step , 6885.91 GFLOP/s , 17823.9 tokens/s INFO:__main__:2024-11-05 07:14:12 | Epoch: 0 | Step: 132880 | Dataset: 0-1084224 | Loss: 0.780 | 913 ms/step , 6888.96 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 07:14:21 | Epoch: 0 | Step: 132890 | Dataset: 0-1084544 | Loss: 0.766 | 914 ms/step , 6879.90 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 07:14:30 | Epoch: 0 | Step: 132900 | Dataset: 0-1084864 | Loss: 0.933 | 915 ms/step , 6876.76 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 07:14:31 | Validation | Step: 132900 | Val_loss: 0.674 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:14:41 | Epoch: 0 | Step: 132910 | Dataset: 0-1085184 | Loss: 0.786 | 915 ms/step , 6874.46 GFLOP/s , 15266.2 tokens/s INFO:__main__:2024-11-05 07:14:50 | Epoch: 0 | Step: 132920 | Dataset: 0-1085504 | Loss: 0.793 | 913 ms/step , 6889.14 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 07:14:59 | Epoch: 0 | Step: 132930 | Dataset: 0-1085824 | Loss: 0.748 | 913 ms/step , 6887.93 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 07:15:08 | Epoch: 0 | Step: 132940 | Dataset: 0-1086144 | Loss: 0.824 | 914 ms/step , 6880.90 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 07:15:17 | Epoch: 0 | Step: 132950 | Dataset: 0-1086464 | Loss: 0.844 | 914 ms/step , 6881.21 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 07:15:26 | Epoch: 0 | Step: 132960 | Dataset: 0-1086784 | Loss: 0.831 | 914 ms/step , 6880.72 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 07:15:35 | Epoch: 0 | Step: 132970 | Dataset: 0-1087104 | Loss: 0.771 | 914 ms/step , 6881.41 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 07:15:45 | Epoch: 0 | Step: 132980 | Dataset: 0-1087424 | Loss: 0.799 | 913 ms/step , 6889.02 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 07:15:54 | Epoch: 0 | Step: 132990 | Dataset: 0-1087744 | Loss: 0.752 | 915 ms/step , 6876.97 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 07:16:03 | Epoch: 0 | Step: 133000 | Dataset: 0-1088064 | Loss: 0.825 | 913 ms/step , 6887.57 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 07:16:04 | Validation | Step: 133000 | Val_loss: 0.671 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:16:04 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_071604_step_133000.pt` INFO:__main__:2024-11-05 07:16:15 | Epoch: 0 | Step: 133010 | Dataset: 0-1088384 | Loss: 0.810 | 914 ms/step , 6882.76 GFLOP/s , 13795.5 tokens/s INFO:__main__:2024-11-05 07:16:24 | Epoch: 0 | Step: 133020 | Dataset: 0-1088704 | Loss: 0.794 | 914 ms/step , 6883.21 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 07:16:33 | Epoch: 0 | Step: 133030 | Dataset: 0-1089024 | Loss: 0.771 | 914 ms/step , 6880.98 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 07:16:42 | Epoch: 0 | Step: 133040 | Dataset: 0-1089344 | Loss: 0.850 | 914 ms/step , 6879.38 GFLOP/s , 17838.8 tokens/s INFO:__main__:2024-11-05 07:16:51 | Epoch: 0 | Step: 133050 | Dataset: 0-1089664 | Loss: 0.807 | 914 ms/step , 6877.72 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 07:17:00 | Epoch: 0 | Step: 133060 | Dataset: 0-1089984 | Loss: 0.767 | 913 ms/step , 6889.93 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 07:17:10 | Epoch: 0 | Step: 133070 | Dataset: 0-1090304 | Loss: 0.826 | 914 ms/step , 6878.86 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 07:17:19 | Epoch: 0 | Step: 133080 | Dataset: 0-1090624 | Loss: 0.886 | 914 ms/step , 6883.78 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 07:17:28 | Epoch: 0 | Step: 133090 | Dataset: 0-1090944 | Loss: 0.862 | 914 ms/step , 6878.28 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 07:17:37 | Epoch: 0 | Step: 133100 | Dataset: 0-1091264 | Loss: 0.794 | 914 ms/step , 6883.16 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 07:17:39 | Validation | Step: 133100 | Val_loss: 0.665 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:17:48 | Epoch: 0 | Step: 133110 | Dataset: 0-1091584 | Loss: 0.815 | 913 ms/step , 6886.54 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-05 07:17:57 | Epoch: 0 | Step: 133120 | Dataset: 0-1091904 | Loss: 0.785 | 913 ms/step , 6886.47 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 07:18:06 | Epoch: 0 | Step: 133130 | Dataset: 0-1092224 | Loss: 0.793 | 914 ms/step , 6881.15 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 07:18:15 | Epoch: 0 | Step: 133140 | Dataset: 0-1092544 | Loss: 0.823 | 915 ms/step , 6873.87 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 07:18:24 | Epoch: 0 | Step: 133150 | Dataset: 0-1092864 | Loss: 0.821 | 913 ms/step , 6892.56 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 07:18:33 | Epoch: 0 | Step: 133160 | Dataset: 0-1093184 | Loss: 0.855 | 913 ms/step , 6886.79 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 07:18:43 | Epoch: 0 | Step: 133170 | Dataset: 0-1093504 | Loss: 0.809 | 913 ms/step , 6892.28 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 07:18:52 | Epoch: 0 | Step: 133180 | Dataset: 0-1093824 | Loss: 0.834 | 915 ms/step , 6875.39 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 07:19:01 | Epoch: 0 | Step: 133190 | Dataset: 0-1094144 | Loss: 0.868 | 914 ms/step , 6881.71 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 07:19:10 | Epoch: 0 | Step: 133200 | Dataset: 0-1094464 | Loss: 0.808 | 913 ms/step , 6888.92 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 07:19:12 | Validation | Step: 133200 | Val_loss: 0.694 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:19:21 | Epoch: 0 | Step: 133210 | Dataset: 0-1094784 | Loss: 0.854 | 914 ms/step , 6880.74 GFLOP/s , 15277.1 tokens/s INFO:__main__:2024-11-05 07:19:30 | Epoch: 0 | Step: 133220 | Dataset: 0-1095104 | Loss: 0.814 | 914 ms/step , 6883.95 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 07:19:39 | Epoch: 0 | Step: 133230 | Dataset: 0-1095424 | Loss: 0.795 | 913 ms/step , 6889.59 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 07:19:48 | Epoch: 0 | Step: 133240 | Dataset: 0-1095744 | Loss: 0.838 | 915 ms/step , 6877.43 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 07:19:57 | Epoch: 0 | Step: 133250 | Dataset: 0-1096064 | Loss: 0.813 | 912 ms/step , 6894.41 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 07:20:06 | Epoch: 0 | Step: 133260 | Dataset: 0-1096384 | Loss: 0.860 | 915 ms/step , 6875.89 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 07:20:16 | Epoch: 0 | Step: 133270 | Dataset: 0-1096704 | Loss: 0.819 | 912 ms/step , 6892.68 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 07:20:25 | Epoch: 0 | Step: 133280 | Dataset: 0-1097024 | Loss: 0.837 | 914 ms/step , 6877.68 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 07:20:34 | Epoch: 0 | Step: 133290 | Dataset: 0-1097344 | Loss: 0.827 | 916 ms/step , 6868.96 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 07:20:43 | Epoch: 0 | Step: 133300 | Dataset: 0-1097664 | Loss: 0.818 | 913 ms/step , 6887.79 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 07:20:45 | Validation | Step: 133300 | Val_loss: 0.723 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:20:54 | Epoch: 0 | Step: 133310 | Dataset: 0-1097984 | Loss: 0.870 | 914 ms/step , 6884.08 GFLOP/s , 15264.4 tokens/s INFO:__main__:2024-11-05 07:21:03 | Epoch: 0 | Step: 133320 | Dataset: 0-1098304 | Loss: 0.746 | 914 ms/step , 6884.88 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 07:21:12 | Epoch: 0 | Step: 133330 | Dataset: 0-1098624 | Loss: 0.830 | 914 ms/step , 6879.91 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 07:21:21 | Epoch: 0 | Step: 133340 | Dataset: 0-1098944 | Loss: 0.716 | 913 ms/step , 6886.56 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 07:21:30 | Epoch: 0 | Step: 133350 | Dataset: 0-1099264 | Loss: 0.895 | 913 ms/step , 6886.96 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 07:21:40 | Epoch: 0 | Step: 133360 | Dataset: 0-1099584 | Loss: 0.795 | 914 ms/step , 6882.07 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 07:21:49 | Epoch: 0 | Step: 133370 | Dataset: 0-1099904 | Loss: 0.770 | 913 ms/step , 6887.93 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 07:21:58 | Epoch: 0 | Step: 133380 | Dataset: 0-1100224 | Loss: 0.775 | 914 ms/step , 6883.83 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 07:22:07 | Epoch: 0 | Step: 133390 | Dataset: 0-1100544 | Loss: 0.825 | 915 ms/step , 6875.44 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 07:22:16 | Epoch: 0 | Step: 133400 | Dataset: 0-1100864 | Loss: 0.739 | 913 ms/step , 6887.20 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 07:22:18 | Validation | Step: 133400 | Val_loss: 0.685 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:22:27 | Epoch: 0 | Step: 133410 | Dataset: 0-1101184 | Loss: 0.852 | 915 ms/step , 6876.05 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-05 07:22:36 | Epoch: 0 | Step: 133420 | Dataset: 0-1101504 | Loss: 0.797 | 913 ms/step , 6885.28 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 07:22:45 | Epoch: 0 | Step: 133430 | Dataset: 0-1101824 | Loss: 0.897 | 913 ms/step , 6886.55 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 07:22:54 | Epoch: 0 | Step: 133440 | Dataset: 0-1102144 | Loss: 0.787 | 913 ms/step , 6890.60 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 07:23:03 | Epoch: 0 | Step: 133450 | Dataset: 0-1102464 | Loss: 0.810 | 914 ms/step , 6879.60 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 07:23:13 | Epoch: 0 | Step: 133460 | Dataset: 0-1102784 | Loss: 0.802 | 912 ms/step , 6898.17 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 07:23:22 | Epoch: 0 | Step: 133470 | Dataset: 0-1103104 | Loss: 0.773 | 913 ms/step , 6889.70 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 07:23:31 | Epoch: 0 | Step: 133480 | Dataset: 0-1103424 | Loss: 0.890 | 914 ms/step , 6880.47 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 07:23:40 | Epoch: 0 | Step: 133490 | Dataset: 0-1103744 | Loss: 0.810 | 914 ms/step , 6879.80 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 07:23:49 | Epoch: 0 | Step: 133500 | Dataset: 0-1104064 | Loss: 0.889 | 915 ms/step , 6875.90 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 07:23:51 | Validation | Step: 133500 | Val_loss: 0.700 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:24:00 | Epoch: 0 | Step: 133510 | Dataset: 0-1104384 | Loss: 0.764 | 913 ms/step , 6889.50 GFLOP/s , 15273.6 tokens/s INFO:__main__:2024-11-05 07:24:09 | Epoch: 0 | Step: 133520 | Dataset: 0-1104704 | Loss: 0.815 | 912 ms/step , 6896.04 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 07:24:18 | Epoch: 0 | Step: 133530 | Dataset: 0-1105024 | Loss: 0.723 | 912 ms/step , 6894.07 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 07:24:27 | Epoch: 0 | Step: 133540 | Dataset: 0-1105344 | Loss: 0.777 | 913 ms/step , 6892.53 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 07:24:36 | Epoch: 0 | Step: 133550 | Dataset: 0-1105664 | Loss: 0.799 | 914 ms/step , 6880.41 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 07:24:46 | Epoch: 0 | Step: 133560 | Dataset: 0-1105984 | Loss: 0.829 | 914 ms/step , 6881.75 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 07:24:55 | Epoch: 0 | Step: 133570 | Dataset: 0-1106304 | Loss: 0.783 | 913 ms/step , 6887.89 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 07:25:04 | Epoch: 0 | Step: 133580 | Dataset: 0-1106624 | Loss: 0.848 | 914 ms/step , 6878.85 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 07:25:13 | Epoch: 0 | Step: 133590 | Dataset: 0-1106944 | Loss: 0.869 | 913 ms/step , 6886.33 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 07:25:22 | Epoch: 0 | Step: 133600 | Dataset: 0-1107264 | Loss: 0.860 | 915 ms/step , 6875.51 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 07:25:24 | Validation | Step: 133600 | Val_loss: 0.705 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:25:33 | Epoch: 0 | Step: 133610 | Dataset: 0-1107584 | Loss: 0.837 | 914 ms/step , 6879.04 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 07:25:42 | Epoch: 0 | Step: 133620 | Dataset: 0-1107904 | Loss: 0.792 | 914 ms/step , 6879.77 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 07:25:51 | Epoch: 0 | Step: 133630 | Dataset: 0-1108224 | Loss: 0.767 | 915 ms/step , 6874.31 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 07:26:00 | Epoch: 0 | Step: 133640 | Dataset: 0-1108544 | Loss: 0.839 | 914 ms/step , 6880.47 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 07:26:09 | Epoch: 0 | Step: 133650 | Dataset: 0-1108864 | Loss: 0.776 | 915 ms/step , 6874.97 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 07:26:19 | Epoch: 0 | Step: 133660 | Dataset: 0-1109184 | Loss: 0.843 | 914 ms/step , 6882.31 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 07:26:28 | Epoch: 0 | Step: 133670 | Dataset: 0-1109504 | Loss: 0.762 | 913 ms/step , 6888.61 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 07:26:37 | Epoch: 0 | Step: 133680 | Dataset: 0-1109824 | Loss: 0.793 | 914 ms/step , 6879.62 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 07:26:46 | Epoch: 0 | Step: 133690 | Dataset: 0-1110144 | Loss: 0.792 | 914 ms/step , 6879.35 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 07:26:55 | Epoch: 0 | Step: 133700 | Dataset: 0-1110464 | Loss: 0.804 | 915 ms/step , 6871.38 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 07:26:57 | Validation | Step: 133700 | Val_loss: 0.687 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:27:06 | Epoch: 0 | Step: 133710 | Dataset: 0-1110784 | Loss: 0.686 | 913 ms/step , 6886.45 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 07:27:15 | Epoch: 0 | Step: 133720 | Dataset: 0-1111104 | Loss: 0.759 | 914 ms/step , 6881.89 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 07:27:24 | Epoch: 0 | Step: 133730 | Dataset: 0-1111424 | Loss: 0.763 | 914 ms/step , 6882.55 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 07:27:33 | Epoch: 0 | Step: 133740 | Dataset: 0-1111744 | Loss: 0.769 | 915 ms/step , 6876.55 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 07:27:42 | Epoch: 0 | Step: 133750 | Dataset: 0-1112064 | Loss: 0.813 | 914 ms/step , 6881.89 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 07:27:52 | Epoch: 0 | Step: 133760 | Dataset: 0-1112384 | Loss: 0.792 | 914 ms/step , 6879.59 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 07:28:01 | Epoch: 0 | Step: 133770 | Dataset: 0-1112704 | Loss: 0.790 | 912 ms/step , 6894.05 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 07:28:10 | Epoch: 0 | Step: 133780 | Dataset: 0-1113024 | Loss: 0.739 | 913 ms/step , 6886.48 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 07:28:19 | Epoch: 0 | Step: 133790 | Dataset: 0-1113344 | Loss: 0.805 | 912 ms/step , 6893.40 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 07:28:28 | Epoch: 0 | Step: 133800 | Dataset: 0-1113664 | Loss: 0.855 | 914 ms/step , 6881.03 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 07:28:30 | Validation | Step: 133800 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:28:39 | Epoch: 0 | Step: 133810 | Dataset: 0-1113984 | Loss: 0.786 | 914 ms/step , 6883.12 GFLOP/s , 15267.0 tokens/s INFO:__main__:2024-11-05 07:28:48 | Epoch: 0 | Step: 133820 | Dataset: 0-1114304 | Loss: 0.761 | 913 ms/step , 6886.87 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 07:28:57 | Epoch: 0 | Step: 133830 | Dataset: 0-1114624 | Loss: 0.823 | 913 ms/step , 6891.27 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 07:29:06 | Epoch: 0 | Step: 133840 | Dataset: 0-1114944 | Loss: 0.819 | 915 ms/step , 6872.80 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 07:29:15 | Epoch: 0 | Step: 133850 | Dataset: 0-1115264 | Loss: 0.834 | 914 ms/step , 6882.40 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 07:29:25 | Epoch: 0 | Step: 133860 | Dataset: 0-1115584 | Loss: 0.809 | 913 ms/step , 6886.99 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 07:29:34 | Epoch: 0 | Step: 133870 | Dataset: 0-1115904 | Loss: 0.841 | 914 ms/step , 6883.39 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 07:29:43 | Epoch: 0 | Step: 133880 | Dataset: 0-1116224 | Loss: 0.788 | 914 ms/step , 6878.23 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 07:29:52 | Epoch: 0 | Step: 133890 | Dataset: 0-1116544 | Loss: 0.834 | 914 ms/step , 6881.50 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 07:30:01 | Epoch: 0 | Step: 133900 | Dataset: 0-1116864 | Loss: 0.805 | 914 ms/step , 6880.07 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 07:30:03 | Validation | Step: 133900 | Val_loss: 0.707 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:30:12 | Epoch: 0 | Step: 133910 | Dataset: 0-1117184 | Loss: 0.791 | 914 ms/step , 6880.76 GFLOP/s , 15265.3 tokens/s INFO:__main__:2024-11-05 07:30:21 | Epoch: 0 | Step: 133920 | Dataset: 0-1117504 | Loss: 0.845 | 915 ms/step , 6875.26 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 07:30:30 | Epoch: 0 | Step: 133930 | Dataset: 0-1117824 | Loss: 0.741 | 915 ms/step , 6877.48 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 07:30:39 | Epoch: 0 | Step: 133940 | Dataset: 0-1118144 | Loss: 0.773 | 914 ms/step , 6884.09 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 07:30:48 | Epoch: 0 | Step: 133950 | Dataset: 0-1118464 | Loss: 0.764 | 914 ms/step , 6879.99 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 07:30:58 | Epoch: 0 | Step: 133960 | Dataset: 0-1118784 | Loss: 0.794 | 914 ms/step , 6878.77 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 07:31:07 | Epoch: 0 | Step: 133970 | Dataset: 0-1119104 | Loss: 0.816 | 915 ms/step , 6873.98 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 07:31:16 | Epoch: 0 | Step: 133980 | Dataset: 0-1119424 | Loss: 0.880 | 914 ms/step , 6881.62 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 07:31:25 | Epoch: 0 | Step: 133990 | Dataset: 0-1119744 | Loss: 0.829 | 914 ms/step , 6880.88 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 07:31:34 | Epoch: 0 | Step: 134000 | Dataset: 0-1120064 | Loss: 0.807 | 914 ms/step , 6879.65 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 07:31:36 | Validation | Step: 134000 | Val_loss: 0.760 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:31:36 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_073136_step_134000.pt` INFO:__main__:2024-11-05 07:31:46 | Epoch: 0 | Step: 134010 | Dataset: 0-1120384 | Loss: 0.795 | 914 ms/step , 6880.57 GFLOP/s , 13809.3 tokens/s INFO:__main__:2024-11-05 07:31:55 | Epoch: 0 | Step: 134020 | Dataset: 0-1120704 | Loss: 0.819 | 913 ms/step , 6892.21 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 07:32:04 | Epoch: 0 | Step: 134030 | Dataset: 0-1121024 | Loss: 0.795 | 913 ms/step , 6891.83 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 07:32:14 | Epoch: 0 | Step: 134040 | Dataset: 0-1121344 | Loss: 0.767 | 916 ms/step , 6864.96 GFLOP/s , 17859.7 tokens/s INFO:__main__:2024-11-05 07:32:23 | Epoch: 0 | Step: 134050 | Dataset: 0-1121664 | Loss: 0.805 | 914 ms/step , 6880.58 GFLOP/s , 17904.6 tokens/s INFO:__main__:2024-11-05 07:32:32 | Epoch: 0 | Step: 134060 | Dataset: 0-1121984 | Loss: 0.824 | 915 ms/step , 6876.85 GFLOP/s , 17897.8 tokens/s INFO:__main__:2024-11-05 07:32:41 | Epoch: 0 | Step: 134070 | Dataset: 0-1122304 | Loss: 0.828 | 914 ms/step , 6881.94 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 07:32:50 | Epoch: 0 | Step: 134080 | Dataset: 0-1122624 | Loss: 0.805 | 914 ms/step , 6878.28 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 07:32:59 | Epoch: 0 | Step: 134090 | Dataset: 0-1122944 | Loss: 0.817 | 913 ms/step , 6892.38 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 07:33:08 | Epoch: 0 | Step: 134100 | Dataset: 0-1123264 | Loss: 0.773 | 914 ms/step , 6878.96 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 07:33:10 | Validation | Step: 134100 | Val_loss: 0.746 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:33:19 | Epoch: 0 | Step: 134110 | Dataset: 0-1123584 | Loss: 0.803 | 914 ms/step , 6882.76 GFLOP/s , 15271.2 tokens/s INFO:__main__:2024-11-05 07:33:28 | Epoch: 0 | Step: 134120 | Dataset: 0-1123904 | Loss: 0.730 | 913 ms/step , 6887.86 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 07:33:37 | Epoch: 0 | Step: 134130 | Dataset: 0-1124224 | Loss: 0.886 | 914 ms/step , 6883.69 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 07:33:47 | Epoch: 0 | Step: 134140 | Dataset: 0-1124544 | Loss: 0.831 | 915 ms/step , 6875.58 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 07:33:56 | Epoch: 0 | Step: 134150 | Dataset: 0-1124864 | Loss: 0.768 | 914 ms/step , 6878.12 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 07:34:05 | Epoch: 0 | Step: 134160 | Dataset: 0-1125184 | Loss: 0.755 | 914 ms/step , 6879.17 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 07:34:14 | Epoch: 0 | Step: 134170 | Dataset: 0-1125504 | Loss: 0.792 | 915 ms/step , 6873.12 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 07:34:23 | Epoch: 0 | Step: 134180 | Dataset: 0-1125824 | Loss: 0.828 | 913 ms/step , 6891.22 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 07:34:32 | Epoch: 0 | Step: 134190 | Dataset: 0-1126144 | Loss: 0.762 | 913 ms/step , 6885.52 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 07:34:41 | Epoch: 0 | Step: 134200 | Dataset: 0-1126464 | Loss: 0.799 | 913 ms/step , 6889.78 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 07:34:43 | Validation | Step: 134200 | Val_loss: 0.729 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:34:52 | Epoch: 0 | Step: 134210 | Dataset: 0-1126784 | Loss: 0.796 | 913 ms/step , 6886.34 GFLOP/s , 15259.5 tokens/s INFO:__main__:2024-11-05 07:35:01 | Epoch: 0 | Step: 134220 | Dataset: 0-1127104 | Loss: 0.766 | 915 ms/step , 6873.03 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 07:35:10 | Epoch: 0 | Step: 134230 | Dataset: 0-1127424 | Loss: 0.745 | 913 ms/step , 6888.71 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 07:35:20 | Epoch: 0 | Step: 134240 | Dataset: 0-1127744 | Loss: 0.823 | 913 ms/step , 6892.35 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 07:35:29 | Epoch: 0 | Step: 134250 | Dataset: 0-1128064 | Loss: 0.799 | 914 ms/step , 6883.25 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 07:35:38 | Epoch: 0 | Step: 134260 | Dataset: 0-1128384 | Loss: 0.763 | 914 ms/step , 6884.65 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 07:35:47 | Epoch: 0 | Step: 134270 | Dataset: 0-1128704 | Loss: 0.817 | 914 ms/step , 6884.79 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 07:35:56 | Epoch: 0 | Step: 134280 | Dataset: 0-1129024 | Loss: 0.801 | 913 ms/step , 6885.32 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 07:36:05 | Epoch: 0 | Step: 134290 | Dataset: 0-1129344 | Loss: 0.759 | 915 ms/step , 6876.27 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 07:36:14 | Epoch: 0 | Step: 134300 | Dataset: 0-1129664 | Loss: 0.794 | 912 ms/step , 6896.77 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 07:36:16 | Validation | Step: 134300 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:36:25 | Epoch: 0 | Step: 134310 | Dataset: 0-1129984 | Loss: 0.813 | 914 ms/step , 6881.79 GFLOP/s , 15266.1 tokens/s INFO:__main__:2024-11-05 07:36:34 | Epoch: 0 | Step: 134320 | Dataset: 0-1130304 | Loss: 0.725 | 913 ms/step , 6886.35 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 07:36:43 | Epoch: 0 | Step: 134330 | Dataset: 0-1130624 | Loss: 0.858 | 912 ms/step , 6892.81 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 07:36:53 | Epoch: 0 | Step: 134340 | Dataset: 0-1130944 | Loss: 0.813 | 914 ms/step , 6879.20 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 07:37:02 | Epoch: 0 | Step: 134350 | Dataset: 0-1131264 | Loss: 0.792 | 915 ms/step , 6873.45 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 07:37:11 | Epoch: 0 | Step: 134360 | Dataset: 0-1131584 | Loss: 0.815 | 914 ms/step , 6884.86 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 07:37:20 | Epoch: 0 | Step: 134370 | Dataset: 0-1131904 | Loss: 0.862 | 914 ms/step , 6879.33 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 07:37:29 | Epoch: 0 | Step: 134380 | Dataset: 0-1132224 | Loss: 0.788 | 914 ms/step , 6882.34 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 07:37:38 | Epoch: 0 | Step: 134390 | Dataset: 0-1132544 | Loss: 0.783 | 913 ms/step , 6888.40 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 07:37:47 | Epoch: 0 | Step: 134400 | Dataset: 0-1132864 | Loss: 0.826 | 914 ms/step , 6877.92 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 07:37:49 | Validation | Step: 134400 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:37:58 | Epoch: 0 | Step: 134410 | Dataset: 0-1133184 | Loss: 0.787 | 914 ms/step , 6884.03 GFLOP/s , 15272.9 tokens/s INFO:__main__:2024-11-05 07:38:07 | Epoch: 0 | Step: 134420 | Dataset: 0-1133504 | Loss: 0.794 | 914 ms/step , 6880.48 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 07:38:16 | Epoch: 0 | Step: 134430 | Dataset: 0-1133824 | Loss: 0.803 | 912 ms/step , 6894.29 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 07:38:26 | Epoch: 0 | Step: 134440 | Dataset: 0-1134144 | Loss: 0.851 | 914 ms/step , 6881.53 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 07:38:35 | Epoch: 0 | Step: 134450 | Dataset: 0-1134464 | Loss: 0.687 | 914 ms/step , 6881.53 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 07:38:44 | Epoch: 0 | Step: 134460 | Dataset: 0-1134784 | Loss: 0.818 | 914 ms/step , 6882.33 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 07:38:53 | Epoch: 0 | Step: 134470 | Dataset: 0-1135104 | Loss: 0.815 | 914 ms/step , 6879.60 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 07:39:02 | Epoch: 0 | Step: 134480 | Dataset: 0-1135424 | Loss: 0.721 | 913 ms/step , 6887.70 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 07:39:11 | Epoch: 0 | Step: 134490 | Dataset: 0-1135744 | Loss: 0.870 | 914 ms/step , 6881.56 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 07:39:20 | Epoch: 0 | Step: 134500 | Dataset: 0-1136064 | Loss: 0.817 | 914 ms/step , 6877.62 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 07:39:22 | Validation | Step: 134500 | Val_loss: 0.826 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:39:31 | Epoch: 0 | Step: 134510 | Dataset: 0-1136384 | Loss: 0.807 | 913 ms/step , 6888.94 GFLOP/s , 15265.4 tokens/s INFO:__main__:2024-11-05 07:39:40 | Epoch: 0 | Step: 134520 | Dataset: 0-1136704 | Loss: 0.792 | 914 ms/step , 6878.99 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 07:39:49 | Epoch: 0 | Step: 134530 | Dataset: 0-1137024 | Loss: 0.764 | 914 ms/step , 6877.78 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 07:39:59 | Epoch: 0 | Step: 134540 | Dataset: 0-1137344 | Loss: 0.885 | 913 ms/step , 6888.82 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 07:40:08 | Epoch: 0 | Step: 134550 | Dataset: 0-1137664 | Loss: 0.813 | 913 ms/step , 6886.63 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 07:40:17 | Epoch: 0 | Step: 134560 | Dataset: 0-1137984 | Loss: 0.756 | 914 ms/step , 6881.20 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 07:40:26 | Epoch: 0 | Step: 134570 | Dataset: 0-1138304 | Loss: 0.736 | 914 ms/step , 6879.74 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 07:40:35 | Epoch: 0 | Step: 134580 | Dataset: 0-1138624 | Loss: 0.847 | 915 ms/step , 6873.20 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 07:40:44 | Epoch: 0 | Step: 134590 | Dataset: 0-1138944 | Loss: 0.888 | 914 ms/step , 6878.92 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 07:40:53 | Epoch: 0 | Step: 134600 | Dataset: 0-1139264 | Loss: 0.695 | 915 ms/step , 6873.53 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 07:40:55 | Validation | Step: 134600 | Val_loss: 0.784 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:41:04 | Epoch: 0 | Step: 134610 | Dataset: 0-1139584 | Loss: 0.750 | 915 ms/step , 6873.63 GFLOP/s , 15283.7 tokens/s INFO:__main__:2024-11-05 07:41:13 | Epoch: 0 | Step: 134620 | Dataset: 0-1139904 | Loss: 0.768 | 915 ms/step , 6874.72 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 07:41:22 | Epoch: 0 | Step: 134630 | Dataset: 0-1140224 | Loss: 0.795 | 915 ms/step , 6873.78 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 07:41:32 | Epoch: 0 | Step: 134640 | Dataset: 0-1140544 | Loss: 0.597 | 912 ms/step , 6896.51 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 07:41:41 | Epoch: 0 | Step: 134650 | Dataset: 0-1140864 | Loss: 0.830 | 913 ms/step , 6887.73 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 07:41:50 | Epoch: 0 | Step: 134660 | Dataset: 0-1141184 | Loss: 0.725 | 914 ms/step , 6884.19 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 07:41:59 | Epoch: 0 | Step: 134670 | Dataset: 0-1141504 | Loss: 0.782 | 913 ms/step , 6889.89 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 07:42:08 | Epoch: 0 | Step: 134680 | Dataset: 0-1141824 | Loss: 0.797 | 914 ms/step , 6882.26 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 07:42:17 | Epoch: 0 | Step: 134690 | Dataset: 0-1142144 | Loss: 0.768 | 914 ms/step , 6883.89 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 07:42:26 | Epoch: 0 | Step: 134700 | Dataset: 0-1142464 | Loss: 0.765 | 914 ms/step , 6881.92 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 07:42:28 | Validation | Step: 134700 | Val_loss: 0.855 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:42:37 | Epoch: 0 | Step: 134710 | Dataset: 0-1142784 | Loss: 0.750 | 913 ms/step , 6886.85 GFLOP/s , 15272.8 tokens/s INFO:__main__:2024-11-05 07:42:46 | Epoch: 0 | Step: 134720 | Dataset: 0-1143104 | Loss: 0.856 | 914 ms/step , 6878.69 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 07:42:55 | Epoch: 0 | Step: 134730 | Dataset: 0-1143424 | Loss: 0.878 | 914 ms/step , 6884.33 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 07:43:05 | Epoch: 0 | Step: 134740 | Dataset: 0-1143744 | Loss: 0.841 | 914 ms/step , 6883.64 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 07:43:14 | Epoch: 0 | Step: 134750 | Dataset: 0-1144064 | Loss: 0.719 | 912 ms/step , 6893.80 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 07:43:23 | Epoch: 0 | Step: 134760 | Dataset: 0-1144384 | Loss: 0.789 | 912 ms/step , 6894.40 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 07:43:32 | Epoch: 0 | Step: 134770 | Dataset: 0-1144704 | Loss: 0.689 | 913 ms/step , 6892.47 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 07:43:41 | Epoch: 0 | Step: 134780 | Dataset: 0-1145024 | Loss: 0.775 | 912 ms/step , 6893.65 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 07:43:50 | Epoch: 0 | Step: 134790 | Dataset: 0-1145344 | Loss: 0.766 | 912 ms/step , 6892.75 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 07:43:59 | Epoch: 0 | Step: 134800 | Dataset: 0-1145664 | Loss: 0.738 | 915 ms/step , 6875.78 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 07:44:01 | Validation | Step: 134800 | Val_loss: 0.844 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:44:10 | Epoch: 0 | Step: 134810 | Dataset: 0-1145984 | Loss: 0.784 | 914 ms/step , 6881.26 GFLOP/s , 15285.6 tokens/s INFO:__main__:2024-11-05 07:44:19 | Epoch: 0 | Step: 134820 | Dataset: 0-1146304 | Loss: 0.768 | 914 ms/step , 6878.53 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 07:44:28 | Epoch: 0 | Step: 134830 | Dataset: 0-1146624 | Loss: 0.818 | 913 ms/step , 6889.44 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 07:44:38 | Epoch: 0 | Step: 134840 | Dataset: 0-1146944 | Loss: 0.733 | 914 ms/step , 6883.10 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 07:44:47 | Epoch: 0 | Step: 134850 | Dataset: 0-1147264 | Loss: 0.775 | 914 ms/step , 6881.39 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 07:44:56 | Epoch: 0 | Step: 134860 | Dataset: 0-1147584 | Loss: 0.789 | 914 ms/step , 6882.36 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 07:45:05 | Epoch: 0 | Step: 134870 | Dataset: 0-1147904 | Loss: 0.842 | 914 ms/step , 6881.27 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 07:45:14 | Epoch: 0 | Step: 134880 | Dataset: 0-1148224 | Loss: 0.735 | 914 ms/step , 6882.56 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 07:45:23 | Epoch: 0 | Step: 134890 | Dataset: 0-1148544 | Loss: 0.787 | 915 ms/step , 6877.52 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 07:45:32 | Epoch: 0 | Step: 134900 | Dataset: 0-1148864 | Loss: 0.752 | 913 ms/step , 6886.83 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 07:45:34 | Validation | Step: 134900 | Val_loss: 0.807 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:45:43 | Epoch: 0 | Step: 134910 | Dataset: 0-1149184 | Loss: 0.654 | 912 ms/step , 6895.16 GFLOP/s , 15277.3 tokens/s INFO:__main__:2024-11-05 07:45:52 | Epoch: 0 | Step: 134920 | Dataset: 0-1149504 | Loss: 0.743 | 914 ms/step , 6881.37 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 07:46:01 | Epoch: 0 | Step: 134930 | Dataset: 0-1149824 | Loss: 0.747 | 914 ms/step , 6880.14 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 07:46:10 | Epoch: 0 | Step: 134940 | Dataset: 0-1150144 | Loss: 0.648 | 913 ms/step , 6885.50 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 07:46:20 | Epoch: 0 | Step: 134950 | Dataset: 0-1150464 | Loss: 0.807 | 913 ms/step , 6891.93 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 07:46:29 | Epoch: 0 | Step: 134960 | Dataset: 0-1150784 | Loss: 0.699 | 915 ms/step , 6875.95 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 07:46:38 | Epoch: 0 | Step: 134970 | Dataset: 0-1151104 | Loss: 0.735 | 912 ms/step , 6895.62 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 07:46:47 | Epoch: 0 | Step: 134980 | Dataset: 0-1151424 | Loss: 0.744 | 913 ms/step , 6886.04 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 07:46:56 | Epoch: 0 | Step: 134990 | Dataset: 0-1151744 | Loss: 0.724 | 913 ms/step , 6890.97 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 07:47:05 | Epoch: 0 | Step: 135000 | Dataset: 0-1152064 | Loss: 0.806 | 913 ms/step , 6891.55 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 07:47:07 | Validation | Step: 135000 | Val_loss: 0.786 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:47:07 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_074707_step_135000.pt` INFO:__main__:2024-11-05 07:47:17 | Epoch: 0 | Step: 135010 | Dataset: 0-1152384 | Loss: 0.798 | 913 ms/step , 6890.44 GFLOP/s , 13830.3 tokens/s INFO:__main__:2024-11-05 07:47:26 | Epoch: 0 | Step: 135020 | Dataset: 0-1152704 | Loss: 0.713 | 912 ms/step , 6893.66 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 07:47:35 | Epoch: 0 | Step: 135030 | Dataset: 0-1153024 | Loss: 0.697 | 915 ms/step , 6875.33 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 07:47:45 | Epoch: 0 | Step: 135040 | Dataset: 0-1153344 | Loss: 0.822 | 913 ms/step , 6889.51 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-05 07:47:54 | Epoch: 0 | Step: 135050 | Dataset: 0-1153664 | Loss: 0.892 | 914 ms/step , 6880.13 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 07:48:03 | Epoch: 0 | Step: 135060 | Dataset: 0-1153984 | Loss: 0.636 | 913 ms/step , 6890.16 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 07:48:12 | Epoch: 0 | Step: 135070 | Dataset: 0-1154304 | Loss: 0.820 | 914 ms/step , 6884.43 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 07:48:21 | Epoch: 0 | Step: 135080 | Dataset: 0-1154624 | Loss: 0.751 | 912 ms/step , 6896.42 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 07:48:30 | Epoch: 0 | Step: 135090 | Dataset: 0-1154944 | Loss: 0.738 | 913 ms/step , 6886.04 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 07:48:39 | Epoch: 0 | Step: 135100 | Dataset: 0-1155264 | Loss: 0.788 | 913 ms/step , 6885.98 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 07:48:41 | Validation | Step: 135100 | Val_loss: 0.863 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:48:50 | Epoch: 0 | Step: 135110 | Dataset: 0-1155584 | Loss: 0.844 | 914 ms/step , 6884.58 GFLOP/s , 15280.6 tokens/s INFO:__main__:2024-11-05 07:48:59 | Epoch: 0 | Step: 135120 | Dataset: 0-1155904 | Loss: 0.702 | 912 ms/step , 6897.92 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 07:49:08 | Epoch: 0 | Step: 135130 | Dataset: 0-1156224 | Loss: 0.786 | 912 ms/step , 6893.68 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 07:49:18 | Epoch: 0 | Step: 135140 | Dataset: 0-1156544 | Loss: 0.665 | 913 ms/step , 6888.14 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 07:49:27 | Epoch: 0 | Step: 135150 | Dataset: 0-1156864 | Loss: 0.697 | 913 ms/step , 6887.48 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 07:49:36 | Epoch: 0 | Step: 135160 | Dataset: 0-1157184 | Loss: 0.831 | 914 ms/step , 6884.86 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 07:49:45 | Epoch: 0 | Step: 135170 | Dataset: 0-1157504 | Loss: 0.729 | 914 ms/step , 6884.64 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 07:49:54 | Epoch: 0 | Step: 135180 | Dataset: 0-1157824 | Loss: 0.829 | 912 ms/step , 6893.44 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 07:50:03 | Epoch: 0 | Step: 135190 | Dataset: 0-1158144 | Loss: 0.736 | 912 ms/step , 6892.71 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 07:50:12 | Epoch: 0 | Step: 135200 | Dataset: 0-1158464 | Loss: 0.733 | 913 ms/step , 6888.68 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 07:50:14 | Validation | Step: 135200 | Val_loss: 0.811 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:50:23 | Epoch: 0 | Step: 135210 | Dataset: 0-1158784 | Loss: 0.715 | 913 ms/step , 6891.76 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-05 07:50:32 | Epoch: 0 | Step: 135220 | Dataset: 0-1159104 | Loss: 0.803 | 913 ms/step , 6892.50 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 07:50:41 | Epoch: 0 | Step: 135230 | Dataset: 0-1159424 | Loss: 0.753 | 913 ms/step , 6886.74 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 07:50:50 | Epoch: 0 | Step: 135240 | Dataset: 0-1159744 | Loss: 0.723 | 914 ms/step , 6879.65 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 07:51:00 | Epoch: 0 | Step: 135250 | Dataset: 0-1160064 | Loss: 0.830 | 913 ms/step , 6892.52 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 07:51:09 | Epoch: 0 | Step: 135260 | Dataset: 0-1160384 | Loss: 0.735 | 915 ms/step , 6875.23 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 07:51:18 | Epoch: 0 | Step: 135270 | Dataset: 0-1160704 | Loss: 0.824 | 913 ms/step , 6888.20 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 07:51:27 | Epoch: 0 | Step: 135280 | Dataset: 0-1161024 | Loss: 0.783 | 913 ms/step , 6886.67 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 07:51:36 | Epoch: 0 | Step: 135290 | Dataset: 0-1161344 | Loss: 0.825 | 914 ms/step , 6881.20 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 07:51:45 | Epoch: 0 | Step: 135300 | Dataset: 0-1161664 | Loss: 0.812 | 912 ms/step , 6895.42 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 07:51:47 | Validation | Step: 135300 | Val_loss: 0.813 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:51:56 | Epoch: 0 | Step: 135310 | Dataset: 0-1161984 | Loss: 0.738 | 913 ms/step , 6886.18 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-05 07:52:05 | Epoch: 0 | Step: 135320 | Dataset: 0-1162304 | Loss: 0.696 | 914 ms/step , 6881.18 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 07:52:14 | Epoch: 0 | Step: 135330 | Dataset: 0-1162624 | Loss: 0.708 | 912 ms/step , 6898.88 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 07:52:23 | Epoch: 0 | Step: 135340 | Dataset: 0-1162944 | Loss: 0.669 | 913 ms/step , 6892.24 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 07:52:33 | Epoch: 0 | Step: 135350 | Dataset: 0-1163264 | Loss: 0.724 | 913 ms/step , 6891.50 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 07:52:42 | Epoch: 0 | Step: 135360 | Dataset: 0-1163584 | Loss: 0.769 | 911 ms/step , 6900.49 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 07:52:51 | Epoch: 0 | Step: 135370 | Dataset: 0-1163904 | Loss: 0.720 | 914 ms/step , 6878.59 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 07:53:00 | Epoch: 0 | Step: 135380 | Dataset: 0-1164224 | Loss: 0.717 | 925 ms/step , 6801.63 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-05 07:53:09 | Epoch: 0 | Step: 135390 | Dataset: 0-1164544 | Loss: 0.657 | 913 ms/step , 6889.85 GFLOP/s , 17878.7 tokens/s INFO:__main__:2024-11-05 07:53:18 | Epoch: 0 | Step: 135400 | Dataset: 0-1164864 | Loss: 0.786 | 913 ms/step , 6888.98 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 07:53:20 | Validation | Step: 135400 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:53:29 | Epoch: 0 | Step: 135410 | Dataset: 0-1165184 | Loss: 0.813 | 913 ms/step , 6887.65 GFLOP/s , 15286.4 tokens/s INFO:__main__:2024-11-05 07:53:38 | Epoch: 0 | Step: 135420 | Dataset: 0-1165504 | Loss: 0.796 | 914 ms/step , 6884.87 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 07:53:47 | Epoch: 0 | Step: 135430 | Dataset: 0-1165824 | Loss: 0.799 | 913 ms/step , 6887.81 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 07:53:56 | Epoch: 0 | Step: 135440 | Dataset: 0-1166144 | Loss: 0.801 | 914 ms/step , 6884.38 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 07:54:06 | Epoch: 0 | Step: 135450 | Dataset: 0-1166464 | Loss: 0.725 | 914 ms/step , 6882.31 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 07:54:15 | Epoch: 0 | Step: 135460 | Dataset: 0-1166784 | Loss: 0.908 | 914 ms/step , 6882.16 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 07:54:24 | Epoch: 0 | Step: 135470 | Dataset: 0-1167104 | Loss: 0.773 | 913 ms/step , 6888.32 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 07:54:33 | Epoch: 0 | Step: 135480 | Dataset: 0-1167424 | Loss: 0.888 | 914 ms/step , 6880.50 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 07:54:42 | Epoch: 0 | Step: 135490 | Dataset: 0-1167744 | Loss: 0.886 | 912 ms/step , 6893.57 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 07:54:51 | Epoch: 0 | Step: 135500 | Dataset: 0-1168064 | Loss: 0.693 | 914 ms/step , 6884.94 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 07:54:53 | Validation | Step: 135500 | Val_loss: 0.804 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:55:02 | Epoch: 0 | Step: 135510 | Dataset: 0-1168384 | Loss: 0.798 | 914 ms/step , 6877.78 GFLOP/s , 15277.3 tokens/s INFO:__main__:2024-11-05 07:55:11 | Epoch: 0 | Step: 135520 | Dataset: 0-1168704 | Loss: 0.747 | 913 ms/step , 6889.23 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 07:55:20 | Epoch: 0 | Step: 135530 | Dataset: 0-1169024 | Loss: 0.758 | 913 ms/step , 6887.30 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 07:55:29 | Epoch: 0 | Step: 135540 | Dataset: 0-1169344 | Loss: 0.833 | 914 ms/step , 6883.18 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 07:55:38 | Epoch: 0 | Step: 135550 | Dataset: 0-1169664 | Loss: 0.688 | 914 ms/step , 6883.75 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 07:55:48 | Epoch: 0 | Step: 135560 | Dataset: 0-1169984 | Loss: 0.806 | 912 ms/step , 6893.74 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 07:55:57 | Epoch: 0 | Step: 135570 | Dataset: 0-1170304 | Loss: 0.763 | 913 ms/step , 6891.86 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 07:56:06 | Epoch: 0 | Step: 135580 | Dataset: 0-1170624 | Loss: 0.841 | 914 ms/step , 6879.35 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 07:56:15 | Epoch: 0 | Step: 135590 | Dataset: 0-1170944 | Loss: 0.765 | 912 ms/step , 6893.13 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 07:56:24 | Epoch: 0 | Step: 135600 | Dataset: 0-1171264 | Loss: 0.856 | 912 ms/step , 6894.98 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 07:56:26 | Validation | Step: 135600 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:56:35 | Epoch: 0 | Step: 135610 | Dataset: 0-1171584 | Loss: 0.706 | 911 ms/step , 6901.05 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-05 07:56:44 | Epoch: 0 | Step: 135620 | Dataset: 0-1171904 | Loss: 0.654 | 913 ms/step , 6890.21 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 07:56:53 | Epoch: 0 | Step: 135630 | Dataset: 0-1172224 | Loss: 0.709 | 913 ms/step , 6892.04 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 07:57:02 | Epoch: 0 | Step: 135640 | Dataset: 0-1172544 | Loss: 0.715 | 913 ms/step , 6887.56 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 07:57:11 | Epoch: 0 | Step: 135650 | Dataset: 0-1172864 | Loss: 0.779 | 914 ms/step , 6880.87 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 07:57:21 | Epoch: 0 | Step: 135660 | Dataset: 0-1173184 | Loss: 0.653 | 912 ms/step , 6896.83 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 07:57:30 | Epoch: 0 | Step: 135670 | Dataset: 0-1173504 | Loss: 0.775 | 913 ms/step , 6889.91 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 07:57:39 | Epoch: 0 | Step: 135680 | Dataset: 0-1173824 | Loss: 0.784 | 915 ms/step , 6875.66 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 07:57:48 | Epoch: 0 | Step: 135690 | Dataset: 0-1174144 | Loss: 0.665 | 914 ms/step , 6884.17 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 07:57:57 | Epoch: 0 | Step: 135700 | Dataset: 0-1174464 | Loss: 0.660 | 913 ms/step , 6887.17 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 07:57:59 | Validation | Step: 135700 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:58:08 | Epoch: 0 | Step: 135710 | Dataset: 0-1174784 | Loss: 0.715 | 913 ms/step , 6890.27 GFLOP/s , 15284.0 tokens/s INFO:__main__:2024-11-05 07:58:17 | Epoch: 0 | Step: 135720 | Dataset: 0-1175104 | Loss: 0.707 | 912 ms/step , 6893.46 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 07:58:26 | Epoch: 0 | Step: 135730 | Dataset: 0-1175424 | Loss: 0.750 | 912 ms/step , 6897.02 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 07:58:35 | Epoch: 0 | Step: 135740 | Dataset: 0-1175744 | Loss: 0.767 | 913 ms/step , 6889.80 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 07:58:44 | Epoch: 0 | Step: 135750 | Dataset: 0-1176064 | Loss: 0.767 | 913 ms/step , 6885.71 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 07:58:53 | Epoch: 0 | Step: 135760 | Dataset: 0-1176384 | Loss: 0.769 | 914 ms/step , 6877.80 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 07:59:03 | Epoch: 0 | Step: 135770 | Dataset: 0-1176704 | Loss: 0.644 | 912 ms/step , 6898.20 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 07:59:12 | Epoch: 0 | Step: 135780 | Dataset: 0-1177024 | Loss: 0.739 | 913 ms/step , 6888.31 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 07:59:21 | Epoch: 0 | Step: 135790 | Dataset: 0-1177344 | Loss: 0.700 | 913 ms/step , 6889.19 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 07:59:30 | Epoch: 0 | Step: 135800 | Dataset: 0-1177664 | Loss: 0.789 | 914 ms/step , 6879.02 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 07:59:32 | Validation | Step: 135800 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 07:59:41 | Epoch: 0 | Step: 135810 | Dataset: 0-1177984 | Loss: 0.801 | 914 ms/step , 6881.70 GFLOP/s , 15278.7 tokens/s INFO:__main__:2024-11-05 07:59:50 | Epoch: 0 | Step: 135820 | Dataset: 0-1178304 | Loss: 0.799 | 913 ms/step , 6886.04 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 07:59:59 | Epoch: 0 | Step: 135830 | Dataset: 0-1178624 | Loss: 0.881 | 913 ms/step , 6890.30 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 08:00:08 | Epoch: 0 | Step: 135840 | Dataset: 0-1178944 | Loss: 0.638 | 913 ms/step , 6886.16 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 08:00:17 | Epoch: 0 | Step: 135850 | Dataset: 0-1179264 | Loss: 0.689 | 914 ms/step , 6884.35 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 08:00:26 | Epoch: 0 | Step: 135860 | Dataset: 0-1179584 | Loss: 0.783 | 914 ms/step , 6879.39 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 08:00:36 | Epoch: 0 | Step: 135870 | Dataset: 0-1179904 | Loss: 0.830 | 913 ms/step , 6887.32 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 08:00:45 | Epoch: 0 | Step: 135880 | Dataset: 0-1180224 | Loss: 0.686 | 913 ms/step , 6890.68 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 08:00:54 | Epoch: 0 | Step: 135890 | Dataset: 0-1180544 | Loss: 0.787 | 914 ms/step , 6883.52 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 08:01:03 | Epoch: 0 | Step: 135900 | Dataset: 0-1180864 | Loss: 0.795 | 913 ms/step , 6888.70 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 08:01:05 | Validation | Step: 135900 | Val_loss: 0.787 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:01:14 | Epoch: 0 | Step: 135910 | Dataset: 0-1181184 | Loss: 0.718 | 912 ms/step , 6893.29 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-05 08:01:23 | Epoch: 0 | Step: 135920 | Dataset: 0-1181504 | Loss: 0.879 | 913 ms/step , 6885.07 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 08:01:32 | Epoch: 0 | Step: 135930 | Dataset: 0-1181824 | Loss: 0.820 | 914 ms/step , 6883.39 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 08:01:41 | Epoch: 0 | Step: 135940 | Dataset: 0-1182144 | Loss: 0.739 | 914 ms/step , 6879.99 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 08:01:50 | Epoch: 0 | Step: 135950 | Dataset: 0-1182464 | Loss: 0.766 | 913 ms/step , 6890.18 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 08:01:59 | Epoch: 0 | Step: 135960 | Dataset: 0-1182784 | Loss: 0.725 | 914 ms/step , 6879.25 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 08:02:09 | Epoch: 0 | Step: 135970 | Dataset: 0-1183104 | Loss: 0.749 | 913 ms/step , 6886.08 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 08:02:18 | Epoch: 0 | Step: 135980 | Dataset: 0-1183424 | Loss: 0.836 | 913 ms/step , 6890.98 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 08:02:27 | Epoch: 0 | Step: 135990 | Dataset: 0-1183744 | Loss: 0.664 | 912 ms/step , 6896.06 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 08:02:36 | Epoch: 0 | Step: 136000 | Dataset: 0-1184064 | Loss: 0.775 | 913 ms/step , 6887.13 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 08:02:38 | Validation | Step: 136000 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:02:38 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_080238_step_136000.pt` INFO:__main__:2024-11-05 08:02:48 | Epoch: 0 | Step: 136010 | Dataset: 0-1184384 | Loss: 0.731 | 913 ms/step , 6886.58 GFLOP/s , 13795.5 tokens/s INFO:__main__:2024-11-05 08:02:57 | Epoch: 0 | Step: 136020 | Dataset: 0-1184704 | Loss: 0.659 | 912 ms/step , 6893.24 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 08:03:06 | Epoch: 0 | Step: 136030 | Dataset: 0-1185024 | Loss: 0.723 | 913 ms/step , 6890.46 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 08:03:15 | Epoch: 0 | Step: 136040 | Dataset: 0-1185344 | Loss: 0.869 | 912 ms/step , 6897.14 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 08:03:24 | Epoch: 0 | Step: 136050 | Dataset: 0-1185664 | Loss: 0.769 | 914 ms/step , 6883.03 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 08:03:33 | Epoch: 0 | Step: 136060 | Dataset: 0-1185984 | Loss: 0.879 | 914 ms/step , 6884.19 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 08:03:43 | Epoch: 0 | Step: 136070 | Dataset: 0-1186304 | Loss: 0.799 | 914 ms/step , 6883.29 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 08:03:52 | Epoch: 0 | Step: 136080 | Dataset: 0-1186624 | Loss: 0.756 | 912 ms/step , 6896.19 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 08:04:01 | Epoch: 0 | Step: 136090 | Dataset: 0-1186944 | Loss: 0.710 | 912 ms/step , 6896.84 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 08:04:10 | Epoch: 0 | Step: 136100 | Dataset: 0-1187264 | Loss: 0.670 | 913 ms/step , 6889.28 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 08:04:12 | Validation | Step: 136100 | Val_loss: 0.729 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:04:21 | Epoch: 0 | Step: 136110 | Dataset: 0-1187584 | Loss: 0.892 | 914 ms/step , 6879.03 GFLOP/s , 15280.2 tokens/s INFO:__main__:2024-11-05 08:04:30 | Epoch: 0 | Step: 136120 | Dataset: 0-1187904 | Loss: 0.788 | 912 ms/step , 6897.29 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 08:04:39 | Epoch: 0 | Step: 136130 | Dataset: 0-1188224 | Loss: 0.823 | 913 ms/step , 6892.00 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 08:04:48 | Epoch: 0 | Step: 136140 | Dataset: 0-1188544 | Loss: 0.726 | 912 ms/step , 6899.84 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 08:04:57 | Epoch: 0 | Step: 136150 | Dataset: 0-1188864 | Loss: 0.775 | 913 ms/step , 6885.15 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 08:05:06 | Epoch: 0 | Step: 136160 | Dataset: 0-1189184 | Loss: 0.803 | 914 ms/step , 6881.94 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 08:05:16 | Epoch: 0 | Step: 136170 | Dataset: 0-1189504 | Loss: 0.738 | 913 ms/step , 6888.01 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 08:05:25 | Epoch: 0 | Step: 136180 | Dataset: 0-1189824 | Loss: 0.740 | 912 ms/step , 6893.12 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 08:05:34 | Epoch: 0 | Step: 136190 | Dataset: 0-1190144 | Loss: 0.860 | 912 ms/step , 6895.24 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 08:05:43 | Epoch: 0 | Step: 136200 | Dataset: 0-1190464 | Loss: 0.776 | 912 ms/step , 6899.38 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 08:05:45 | Validation | Step: 136200 | Val_loss: 0.719 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:05:54 | Epoch: 0 | Step: 136210 | Dataset: 0-1190784 | Loss: 0.777 | 914 ms/step , 6884.29 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-05 08:06:03 | Epoch: 0 | Step: 136220 | Dataset: 0-1191104 | Loss: 0.795 | 912 ms/step , 6894.28 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 08:06:12 | Epoch: 0 | Step: 136230 | Dataset: 0-1191424 | Loss: 0.806 | 913 ms/step , 6891.01 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 08:06:21 | Epoch: 0 | Step: 136240 | Dataset: 0-1191744 | Loss: 0.845 | 914 ms/step , 6881.74 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 08:06:30 | Epoch: 0 | Step: 136250 | Dataset: 0-1192064 | Loss: 0.776 | 913 ms/step , 6888.59 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 08:06:39 | Epoch: 0 | Step: 136260 | Dataset: 0-1192384 | Loss: 0.666 | 912 ms/step , 6893.61 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 08:06:48 | Epoch: 0 | Step: 136270 | Dataset: 0-1192704 | Loss: 0.511 | 912 ms/step , 6896.88 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 08:06:58 | Epoch: 0 | Step: 136280 | Dataset: 0-1193024 | Loss: 0.843 | 915 ms/step , 6874.21 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 08:07:07 | Epoch: 0 | Step: 136290 | Dataset: 0-1193344 | Loss: 0.815 | 913 ms/step , 6892.50 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 08:07:16 | Epoch: 0 | Step: 136300 | Dataset: 0-1193664 | Loss: 0.788 | 913 ms/step , 6886.10 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 08:07:17 | Validation | Step: 136300 | Val_loss: 0.801 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:07:27 | Epoch: 0 | Step: 136310 | Dataset: 0-1193984 | Loss: 0.746 | 913 ms/step , 6885.53 GFLOP/s , 15301.3 tokens/s INFO:__main__:2024-11-05 08:07:36 | Epoch: 0 | Step: 136320 | Dataset: 0-1194304 | Loss: 0.751 | 913 ms/step , 6887.00 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 08:07:45 | Epoch: 0 | Step: 136330 | Dataset: 0-1194624 | Loss: 0.833 | 913 ms/step , 6886.85 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 08:07:54 | Epoch: 0 | Step: 136340 | Dataset: 0-1194944 | Loss: 0.774 | 913 ms/step , 6890.25 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 08:08:03 | Epoch: 0 | Step: 136350 | Dataset: 0-1195264 | Loss: 0.861 | 913 ms/step , 6890.95 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 08:08:12 | Epoch: 0 | Step: 136360 | Dataset: 0-1195584 | Loss: 0.739 | 914 ms/step , 6884.61 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 08:08:21 | Epoch: 0 | Step: 136370 | Dataset: 0-1195904 | Loss: 0.703 | 912 ms/step , 6895.55 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 08:08:30 | Epoch: 0 | Step: 136380 | Dataset: 0-1196224 | Loss: 0.761 | 914 ms/step , 6881.88 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 08:08:40 | Epoch: 0 | Step: 136390 | Dataset: 0-1196544 | Loss: 0.639 | 912 ms/step , 6899.35 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 08:08:49 | Epoch: 0 | Step: 136400 | Dataset: 0-1196864 | Loss: 0.821 | 912 ms/step , 6893.04 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 08:08:50 | Validation | Step: 136400 | Val_loss: 0.713 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:08:59 | Epoch: 0 | Step: 136410 | Dataset: 0-1197184 | Loss: 0.750 | 913 ms/step , 6891.47 GFLOP/s , 15282.5 tokens/s INFO:__main__:2024-11-05 08:09:09 | Epoch: 0 | Step: 136420 | Dataset: 0-1197504 | Loss: 0.785 | 914 ms/step , 6881.36 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 08:09:18 | Epoch: 0 | Step: 136430 | Dataset: 0-1197824 | Loss: 0.866 | 913 ms/step , 6887.88 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 08:09:27 | Epoch: 0 | Step: 136440 | Dataset: 0-1198144 | Loss: 0.711 | 912 ms/step , 6893.43 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 08:09:36 | Epoch: 0 | Step: 136450 | Dataset: 0-1198464 | Loss: 0.708 | 915 ms/step , 6875.56 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 08:09:45 | Epoch: 0 | Step: 136460 | Dataset: 0-1198784 | Loss: 0.865 | 912 ms/step , 6893.20 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 08:09:54 | Epoch: 0 | Step: 136470 | Dataset: 0-1199104 | Loss: 0.768 | 912 ms/step , 6893.13 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 08:10:03 | Epoch: 0 | Step: 136480 | Dataset: 0-1199424 | Loss: 0.772 | 912 ms/step , 6896.91 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 08:10:13 | Epoch: 0 | Step: 136490 | Dataset: 0-1199744 | Loss: 0.853 | 913 ms/step , 6888.44 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 08:10:22 | Epoch: 0 | Step: 136500 | Dataset: 0-1200064 | Loss: 0.556 | 913 ms/step , 6889.78 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 08:10:23 | Validation | Step: 136500 | Val_loss: 0.764 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:10:32 | Epoch: 0 | Step: 136510 | Dataset: 0-1200384 | Loss: 0.760 | 913 ms/step , 6892.37 GFLOP/s , 15295.0 tokens/s INFO:__main__:2024-11-05 08:10:42 | Epoch: 0 | Step: 136520 | Dataset: 0-1200704 | Loss: 0.682 | 911 ms/step , 6901.88 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 08:10:51 | Epoch: 0 | Step: 136530 | Dataset: 0-1201024 | Loss: 0.852 | 914 ms/step , 6884.40 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 08:11:00 | Epoch: 0 | Step: 136540 | Dataset: 0-1201344 | Loss: 0.737 | 913 ms/step , 6889.43 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 08:11:09 | Epoch: 0 | Step: 136550 | Dataset: 0-1201664 | Loss: 0.534 | 912 ms/step , 6896.42 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 08:11:18 | Epoch: 0 | Step: 136560 | Dataset: 0-1201984 | Loss: 0.778 | 912 ms/step , 6895.02 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 08:11:27 | Epoch: 0 | Step: 136570 | Dataset: 0-1202304 | Loss: 0.520 | 911 ms/step , 6903.53 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 08:11:36 | Epoch: 0 | Step: 136580 | Dataset: 0-1202624 | Loss: 0.840 | 913 ms/step , 6885.20 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 08:11:45 | Epoch: 0 | Step: 136590 | Dataset: 0-1202944 | Loss: 0.817 | 913 ms/step , 6889.56 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 08:11:55 | Epoch: 0 | Step: 136600 | Dataset: 0-1203264 | Loss: 0.696 | 912 ms/step , 6894.83 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 08:11:56 | Validation | Step: 136600 | Val_loss: 0.763 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:12:05 | Epoch: 0 | Step: 136610 | Dataset: 0-1203584 | Loss: 0.810 | 913 ms/step , 6890.06 GFLOP/s , 15282.6 tokens/s INFO:__main__:2024-11-05 08:12:14 | Epoch: 0 | Step: 136620 | Dataset: 0-1203904 | Loss: 0.707 | 913 ms/step , 6891.51 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 08:12:24 | Epoch: 0 | Step: 136630 | Dataset: 0-1204224 | Loss: 0.777 | 913 ms/step , 6887.17 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 08:12:33 | Epoch: 0 | Step: 136640 | Dataset: 0-1204544 | Loss: 0.801 | 913 ms/step , 6886.96 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 08:12:42 | Epoch: 0 | Step: 136650 | Dataset: 0-1204864 | Loss: 0.729 | 913 ms/step , 6886.18 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 08:12:51 | Epoch: 0 | Step: 136660 | Dataset: 0-1205184 | Loss: 0.839 | 913 ms/step , 6889.77 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 08:13:00 | Epoch: 0 | Step: 136670 | Dataset: 0-1205504 | Loss: 0.767 | 915 ms/step , 6870.16 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 08:13:09 | Epoch: 0 | Step: 136680 | Dataset: 0-1205824 | Loss: 0.857 | 913 ms/step , 6890.47 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 08:13:18 | Epoch: 0 | Step: 136690 | Dataset: 0-1206144 | Loss: 0.793 | 912 ms/step , 6892.79 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:13:27 | Epoch: 0 | Step: 136700 | Dataset: 0-1206464 | Loss: 0.847 | 914 ms/step , 6879.92 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 08:13:29 | Validation | Step: 136700 | Val_loss: 0.822 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:13:38 | Epoch: 0 | Step: 136710 | Dataset: 0-1206784 | Loss: 0.703 | 913 ms/step , 6888.60 GFLOP/s , 15285.1 tokens/s INFO:__main__:2024-11-05 08:13:47 | Epoch: 0 | Step: 136720 | Dataset: 0-1207104 | Loss: 0.708 | 912 ms/step , 6892.85 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 08:13:56 | Epoch: 0 | Step: 136730 | Dataset: 0-1207424 | Loss: 0.719 | 913 ms/step , 6887.16 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 08:14:06 | Epoch: 0 | Step: 136740 | Dataset: 0-1207744 | Loss: 0.705 | 912 ms/step , 6895.67 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 08:14:15 | Epoch: 0 | Step: 136750 | Dataset: 0-1208064 | Loss: 0.885 | 912 ms/step , 6896.69 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 08:14:24 | Epoch: 0 | Step: 136760 | Dataset: 0-1208384 | Loss: 0.866 | 913 ms/step , 6889.93 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 08:14:33 | Epoch: 0 | Step: 136770 | Dataset: 0-1208704 | Loss: 0.870 | 916 ms/step , 6865.46 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 08:14:42 | Epoch: 0 | Step: 136780 | Dataset: 0-1209024 | Loss: 0.828 | 913 ms/step , 6890.81 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 08:14:51 | Epoch: 0 | Step: 136790 | Dataset: 0-1209344 | Loss: 0.765 | 913 ms/step , 6892.18 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 08:15:00 | Epoch: 0 | Step: 136800 | Dataset: 0-1209664 | Loss: 0.827 | 914 ms/step , 6882.31 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 08:15:02 | Validation | Step: 136800 | Val_loss: 0.808 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:15:11 | Epoch: 0 | Step: 136810 | Dataset: 0-1209984 | Loss: 0.630 | 913 ms/step , 6889.46 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-05 08:15:20 | Epoch: 0 | Step: 136820 | Dataset: 0-1210304 | Loss: 0.959 | 914 ms/step , 6881.60 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 08:15:29 | Epoch: 0 | Step: 136830 | Dataset: 0-1210624 | Loss: 0.781 | 913 ms/step , 6888.46 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 08:15:39 | Epoch: 0 | Step: 136840 | Dataset: 0-1210944 | Loss: 0.744 | 913 ms/step , 6889.74 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 08:15:48 | Epoch: 0 | Step: 136850 | Dataset: 0-1211264 | Loss: 0.671 | 912 ms/step , 6893.09 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 08:15:57 | Epoch: 0 | Step: 136860 | Dataset: 0-1211584 | Loss: 0.820 | 912 ms/step , 6896.70 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 08:16:06 | Epoch: 0 | Step: 136870 | Dataset: 0-1211904 | Loss: 0.739 | 912 ms/step , 6896.36 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 08:16:15 | Epoch: 0 | Step: 136880 | Dataset: 0-1212224 | Loss: 0.594 | 913 ms/step , 6887.26 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 08:16:24 | Epoch: 0 | Step: 136890 | Dataset: 0-1212544 | Loss: 0.898 | 913 ms/step , 6889.44 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 08:16:33 | Epoch: 0 | Step: 136900 | Dataset: 0-1212864 | Loss: 0.587 | 912 ms/step , 6898.74 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 08:16:35 | Validation | Step: 136900 | Val_loss: 0.809 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:16:44 | Epoch: 0 | Step: 136910 | Dataset: 0-1213184 | Loss: 0.825 | 913 ms/step , 6890.72 GFLOP/s , 15282.0 tokens/s INFO:__main__:2024-11-05 08:16:53 | Epoch: 0 | Step: 136920 | Dataset: 0-1213504 | Loss: 0.848 | 912 ms/step , 6897.08 GFLOP/s , 17951.9 tokens/s INFO:__main__:2024-11-05 08:17:02 | Epoch: 0 | Step: 136930 | Dataset: 0-1213824 | Loss: 0.698 | 912 ms/step , 6897.13 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 08:17:11 | Epoch: 0 | Step: 136940 | Dataset: 0-1214144 | Loss: 0.831 | 913 ms/step , 6889.53 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 08:17:21 | Epoch: 0 | Step: 136950 | Dataset: 0-1214464 | Loss: 0.783 | 913 ms/step , 6892.20 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 08:17:30 | Epoch: 0 | Step: 136960 | Dataset: 0-1214784 | Loss: 0.726 | 912 ms/step , 6892.82 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 08:17:39 | Epoch: 0 | Step: 136970 | Dataset: 0-1215104 | Loss: 0.786 | 913 ms/step , 6886.48 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 08:17:48 | Epoch: 0 | Step: 136980 | Dataset: 0-1215424 | Loss: 0.907 | 912 ms/step , 6893.44 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 08:17:57 | Epoch: 0 | Step: 136990 | Dataset: 0-1215744 | Loss: 0.794 | 913 ms/step , 6885.84 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 08:18:06 | Epoch: 0 | Step: 137000 | Dataset: 0-1216064 | Loss: 0.818 | 911 ms/step , 6902.58 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 08:18:08 | Validation | Step: 137000 | Val_loss: 0.816 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:18:08 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_081808_step_137000.pt` INFO:__main__:2024-11-05 08:18:18 | Epoch: 0 | Step: 137010 | Dataset: 0-1216384 | Loss: 0.814 | 913 ms/step , 6888.30 GFLOP/s , 13805.9 tokens/s INFO:__main__:2024-11-05 08:18:27 | Epoch: 0 | Step: 137020 | Dataset: 0-1216704 | Loss: 0.859 | 913 ms/step , 6888.77 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 08:18:36 | Epoch: 0 | Step: 137030 | Dataset: 0-1217024 | Loss: 0.694 | 912 ms/step , 6893.65 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 08:18:46 | Epoch: 0 | Step: 137040 | Dataset: 0-1217344 | Loss: 0.693 | 914 ms/step , 6884.00 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-05 08:18:55 | Epoch: 0 | Step: 137050 | Dataset: 0-1217664 | Loss: 0.714 | 912 ms/step , 6897.92 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 08:19:04 | Epoch: 0 | Step: 137060 | Dataset: 0-1217984 | Loss: 0.880 | 914 ms/step , 6883.75 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 08:19:13 | Epoch: 0 | Step: 137070 | Dataset: 0-1218304 | Loss: 0.857 | 914 ms/step , 6884.78 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 08:19:22 | Epoch: 0 | Step: 137080 | Dataset: 0-1218624 | Loss: 0.841 | 912 ms/step , 6892.65 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 08:19:31 | Epoch: 0 | Step: 137090 | Dataset: 0-1218944 | Loss: 0.791 | 913 ms/step , 6890.12 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 08:19:40 | Epoch: 0 | Step: 137100 | Dataset: 0-1219264 | Loss: 0.712 | 913 ms/step , 6889.20 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 08:19:42 | Validation | Step: 137100 | Val_loss: 0.800 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:19:51 | Epoch: 0 | Step: 137110 | Dataset: 0-1219584 | Loss: 0.780 | 912 ms/step , 6896.27 GFLOP/s , 15282.3 tokens/s INFO:__main__:2024-11-05 08:20:00 | Epoch: 0 | Step: 137120 | Dataset: 0-1219904 | Loss: 0.623 | 913 ms/step , 6890.54 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 08:20:09 | Epoch: 0 | Step: 137130 | Dataset: 0-1220224 | Loss: 0.674 | 913 ms/step , 6890.30 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 08:20:18 | Epoch: 0 | Step: 137140 | Dataset: 0-1220544 | Loss: 0.637 | 912 ms/step , 6899.43 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 08:20:28 | Epoch: 0 | Step: 137150 | Dataset: 0-1220864 | Loss: 0.844 | 913 ms/step , 6889.37 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 08:20:37 | Epoch: 0 | Step: 137160 | Dataset: 0-1221184 | Loss: 0.761 | 911 ms/step , 6901.57 GFLOP/s , 17953.9 tokens/s INFO:__main__:2024-11-05 08:20:46 | Epoch: 0 | Step: 137170 | Dataset: 0-1221504 | Loss: 0.786 | 912 ms/step , 6893.47 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 08:20:55 | Epoch: 0 | Step: 137180 | Dataset: 0-1221824 | Loss: 0.675 | 912 ms/step , 6899.03 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 08:21:04 | Epoch: 0 | Step: 137190 | Dataset: 0-1222144 | Loss: 0.840 | 913 ms/step , 6891.27 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 08:21:13 | Epoch: 0 | Step: 137200 | Dataset: 0-1222464 | Loss: 0.830 | 914 ms/step , 6883.50 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 08:21:15 | Validation | Step: 137200 | Val_loss: 0.795 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:21:24 | Epoch: 0 | Step: 137210 | Dataset: 0-1222784 | Loss: 0.766 | 913 ms/step , 6888.18 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 08:21:33 | Epoch: 0 | Step: 137220 | Dataset: 0-1223104 | Loss: 0.785 | 912 ms/step , 6895.10 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 08:21:42 | Epoch: 0 | Step: 137230 | Dataset: 0-1223424 | Loss: 0.794 | 913 ms/step , 6888.75 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 08:21:51 | Epoch: 0 | Step: 137240 | Dataset: 0-1223744 | Loss: 0.689 | 912 ms/step , 6892.68 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 08:22:00 | Epoch: 0 | Step: 137250 | Dataset: 0-1224064 | Loss: 0.720 | 911 ms/step , 6902.41 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 08:22:10 | Epoch: 0 | Step: 137260 | Dataset: 0-1224384 | Loss: 0.995 | 913 ms/step , 6889.06 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 08:22:19 | Epoch: 0 | Step: 137270 | Dataset: 0-1224704 | Loss: 0.790 | 912 ms/step , 6896.89 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 08:22:28 | Epoch: 0 | Step: 137280 | Dataset: 0-1225024 | Loss: 0.823 | 913 ms/step , 6891.92 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 08:22:37 | Epoch: 0 | Step: 137290 | Dataset: 0-1225344 | Loss: 0.797 | 913 ms/step , 6890.24 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 08:22:46 | Epoch: 0 | Step: 137300 | Dataset: 0-1225664 | Loss: 0.854 | 913 ms/step , 6890.79 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 08:22:48 | Validation | Step: 137300 | Val_loss: 0.808 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:22:57 | Epoch: 0 | Step: 137310 | Dataset: 0-1225984 | Loss: 0.821 | 912 ms/step , 6895.96 GFLOP/s , 15283.0 tokens/s INFO:__main__:2024-11-05 08:23:06 | Epoch: 0 | Step: 137320 | Dataset: 0-1226304 | Loss: 0.770 | 913 ms/step , 6891.86 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-05 08:23:15 | Epoch: 0 | Step: 137330 | Dataset: 0-1226624 | Loss: 0.733 | 913 ms/step , 6887.72 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 08:23:24 | Epoch: 0 | Step: 137340 | Dataset: 0-1226944 | Loss: 0.782 | 912 ms/step , 6899.26 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 08:23:33 | Epoch: 0 | Step: 137350 | Dataset: 0-1227264 | Loss: 0.778 | 914 ms/step , 6881.55 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 08:23:42 | Epoch: 0 | Step: 137360 | Dataset: 0-1227584 | Loss: 0.771 | 913 ms/step , 6887.90 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 08:23:52 | Epoch: 0 | Step: 137370 | Dataset: 0-1227904 | Loss: 0.696 | 913 ms/step , 6888.61 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 08:24:01 | Epoch: 0 | Step: 137380 | Dataset: 0-1228224 | Loss: 0.572 | 912 ms/step , 6896.93 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 08:24:10 | Epoch: 0 | Step: 137390 | Dataset: 0-1228544 | Loss: 0.704 | 911 ms/step , 6901.05 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 08:24:19 | Epoch: 0 | Step: 137400 | Dataset: 0-1228864 | Loss: 0.861 | 913 ms/step , 6891.37 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 08:24:21 | Validation | Step: 137400 | Val_loss: 0.793 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:24:30 | Epoch: 0 | Step: 137410 | Dataset: 0-1229184 | Loss: 0.718 | 913 ms/step , 6888.56 GFLOP/s , 15279.9 tokens/s INFO:__main__:2024-11-05 08:24:39 | Epoch: 0 | Step: 137420 | Dataset: 0-1229504 | Loss: 0.682 | 913 ms/step , 6889.45 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 08:24:48 | Epoch: 0 | Step: 137430 | Dataset: 0-1229824 | Loss: 0.683 | 913 ms/step , 6889.77 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 08:24:57 | Epoch: 0 | Step: 137440 | Dataset: 0-1230144 | Loss: 0.768 | 912 ms/step , 6893.09 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 08:25:06 | Epoch: 0 | Step: 137450 | Dataset: 0-1230464 | Loss: 0.823 | 913 ms/step , 6890.53 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 08:25:15 | Epoch: 0 | Step: 137460 | Dataset: 0-1230784 | Loss: 0.710 | 912 ms/step , 6893.29 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 08:25:25 | Epoch: 0 | Step: 137470 | Dataset: 0-1231104 | Loss: 0.697 | 913 ms/step , 6888.91 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 08:25:34 | Epoch: 0 | Step: 137480 | Dataset: 0-1231424 | Loss: 0.658 | 913 ms/step , 6891.56 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 08:25:43 | Epoch: 0 | Step: 137490 | Dataset: 0-1231744 | Loss: 0.717 | 913 ms/step , 6889.61 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 08:25:52 | Epoch: 0 | Step: 137500 | Dataset: 0-1232064 | Loss: 0.688 | 913 ms/step , 6891.08 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 08:25:54 | Validation | Step: 137500 | Val_loss: 0.798 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:26:03 | Epoch: 0 | Step: 137510 | Dataset: 0-1232384 | Loss: 0.766 | 913 ms/step , 6888.97 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-05 08:26:12 | Epoch: 0 | Step: 137520 | Dataset: 0-1232704 | Loss: 0.657 | 913 ms/step , 6892.47 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 08:26:21 | Epoch: 0 | Step: 137530 | Dataset: 0-1233024 | Loss: 0.787 | 912 ms/step , 6896.64 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 08:26:30 | Epoch: 0 | Step: 137540 | Dataset: 0-1233344 | Loss: 0.781 | 913 ms/step , 6890.76 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 08:26:39 | Epoch: 0 | Step: 137550 | Dataset: 0-1233664 | Loss: 0.814 | 913 ms/step , 6890.55 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 08:26:48 | Epoch: 0 | Step: 137560 | Dataset: 0-1233984 | Loss: 0.835 | 913 ms/step , 6892.07 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 08:26:57 | Epoch: 0 | Step: 137570 | Dataset: 0-1234304 | Loss: 0.798 | 914 ms/step , 6879.80 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:27:07 | Epoch: 0 | Step: 137580 | Dataset: 0-1234624 | Loss: 0.853 | 913 ms/step , 6891.17 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 08:27:16 | Epoch: 0 | Step: 137590 | Dataset: 0-1234944 | Loss: 0.819 | 912 ms/step , 6894.43 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 08:27:25 | Epoch: 0 | Step: 137600 | Dataset: 0-1235264 | Loss: 0.821 | 912 ms/step , 6893.54 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 08:27:26 | Validation | Step: 137600 | Val_loss: 0.800 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:27:36 | Epoch: 0 | Step: 137610 | Dataset: 0-1235584 | Loss: 0.756 | 914 ms/step , 6883.54 GFLOP/s , 15291.7 tokens/s INFO:__main__:2024-11-05 08:27:45 | Epoch: 0 | Step: 137620 | Dataset: 0-1235904 | Loss: 0.722 | 913 ms/step , 6889.95 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 08:27:54 | Epoch: 0 | Step: 137630 | Dataset: 0-1236224 | Loss: 0.677 | 914 ms/step , 6883.75 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 08:28:03 | Epoch: 0 | Step: 137640 | Dataset: 0-1236544 | Loss: 0.696 | 912 ms/step , 6897.21 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 08:28:12 | Epoch: 0 | Step: 137650 | Dataset: 0-1236864 | Loss: 0.762 | 914 ms/step , 6884.24 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 08:28:21 | Epoch: 0 | Step: 137660 | Dataset: 0-1237184 | Loss: 0.672 | 912 ms/step , 6894.40 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 08:28:30 | Epoch: 0 | Step: 137670 | Dataset: 0-1237504 | Loss: 0.709 | 912 ms/step , 6892.80 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 08:28:40 | Epoch: 0 | Step: 137680 | Dataset: 0-1237824 | Loss: 0.878 | 913 ms/step , 6887.81 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 08:28:49 | Epoch: 0 | Step: 137690 | Dataset: 0-1238144 | Loss: 0.732 | 914 ms/step , 6884.86 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 08:28:58 | Epoch: 0 | Step: 137700 | Dataset: 0-1238464 | Loss: 0.803 | 912 ms/step , 6898.24 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 08:28:59 | Validation | Step: 137700 | Val_loss: 0.791 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:29:09 | Epoch: 0 | Step: 137710 | Dataset: 0-1238784 | Loss: 0.784 | 912 ms/step , 6895.63 GFLOP/s , 15279.8 tokens/s INFO:__main__:2024-11-05 08:29:18 | Epoch: 0 | Step: 137720 | Dataset: 0-1239104 | Loss: 0.682 | 914 ms/step , 6879.27 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 08:29:27 | Epoch: 0 | Step: 137730 | Dataset: 0-1239424 | Loss: 0.730 | 913 ms/step , 6889.70 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 08:29:36 | Epoch: 0 | Step: 137740 | Dataset: 0-1239744 | Loss: 0.774 | 913 ms/step , 6891.17 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 08:29:45 | Epoch: 0 | Step: 137750 | Dataset: 0-1240064 | Loss: 0.817 | 912 ms/step , 6894.96 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 08:29:54 | Epoch: 0 | Step: 137760 | Dataset: 0-1240384 | Loss: 0.784 | 913 ms/step , 6885.18 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 08:30:03 | Epoch: 0 | Step: 137770 | Dataset: 0-1240704 | Loss: 0.734 | 914 ms/step , 6882.85 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 08:30:12 | Epoch: 0 | Step: 137780 | Dataset: 0-1241024 | Loss: 0.717 | 913 ms/step , 6889.66 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 08:30:22 | Epoch: 0 | Step: 137790 | Dataset: 0-1241344 | Loss: 0.746 | 914 ms/step , 6883.90 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 08:30:31 | Epoch: 0 | Step: 137800 | Dataset: 0-1241664 | Loss: 0.780 | 912 ms/step , 6893.36 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 08:30:32 | Validation | Step: 137800 | Val_loss: 0.786 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:30:41 | Epoch: 0 | Step: 137810 | Dataset: 0-1241984 | Loss: 0.761 | 916 ms/step , 6869.93 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-05 08:30:51 | Epoch: 0 | Step: 137820 | Dataset: 0-1242304 | Loss: 0.744 | 914 ms/step , 6883.84 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 08:31:00 | Epoch: 0 | Step: 137830 | Dataset: 0-1242624 | Loss: 0.807 | 914 ms/step , 6883.67 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 08:31:09 | Epoch: 0 | Step: 137840 | Dataset: 0-1242944 | Loss: 0.685 | 913 ms/step , 6886.37 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 08:31:18 | Epoch: 0 | Step: 137850 | Dataset: 0-1243264 | Loss: 0.833 | 914 ms/step , 6878.24 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 08:31:27 | Epoch: 0 | Step: 137860 | Dataset: 0-1243584 | Loss: 0.746 | 913 ms/step , 6885.87 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 08:31:36 | Epoch: 0 | Step: 137870 | Dataset: 0-1243904 | Loss: 0.724 | 913 ms/step , 6889.96 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 08:31:45 | Epoch: 0 | Step: 137880 | Dataset: 0-1244224 | Loss: 0.743 | 912 ms/step , 6892.94 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 08:31:55 | Epoch: 0 | Step: 137890 | Dataset: 0-1244544 | Loss: 0.746 | 913 ms/step , 6891.05 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 08:32:04 | Epoch: 0 | Step: 137900 | Dataset: 0-1244864 | Loss: 0.726 | 913 ms/step , 6888.37 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 08:32:05 | Validation | Step: 137900 | Val_loss: 0.778 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:32:14 | Epoch: 0 | Step: 137910 | Dataset: 0-1245184 | Loss: 0.849 | 913 ms/step , 6889.87 GFLOP/s , 15277.2 tokens/s INFO:__main__:2024-11-05 08:32:24 | Epoch: 0 | Step: 137920 | Dataset: 0-1245504 | Loss: 0.809 | 912 ms/step , 6895.04 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 08:32:33 | Epoch: 0 | Step: 137930 | Dataset: 0-1245824 | Loss: 0.800 | 913 ms/step , 6887.44 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 08:32:42 | Epoch: 0 | Step: 137940 | Dataset: 0-1246144 | Loss: 0.746 | 913 ms/step , 6887.20 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 08:32:51 | Epoch: 0 | Step: 137950 | Dataset: 0-1246464 | Loss: 0.728 | 914 ms/step , 6881.95 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 08:33:00 | Epoch: 0 | Step: 137960 | Dataset: 0-1246784 | Loss: 0.777 | 911 ms/step , 6903.00 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 08:33:09 | Epoch: 0 | Step: 137970 | Dataset: 0-1247104 | Loss: 0.760 | 913 ms/step , 6885.66 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 08:33:18 | Epoch: 0 | Step: 137980 | Dataset: 0-1247424 | Loss: 0.742 | 912 ms/step , 6896.68 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 08:33:27 | Epoch: 0 | Step: 137990 | Dataset: 0-1247744 | Loss: 0.796 | 913 ms/step , 6888.10 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 08:33:37 | Epoch: 0 | Step: 138000 | Dataset: 0-1248064 | Loss: 0.703 | 913 ms/step , 6885.73 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 08:33:38 | Validation | Step: 138000 | Val_loss: 0.780 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:33:38 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_083338_step_138000.pt` INFO:__main__:2024-11-05 08:33:48 | Epoch: 0 | Step: 138010 | Dataset: 0-1248384 | Loss: 0.774 | 913 ms/step , 6887.63 GFLOP/s , 13783.3 tokens/s INFO:__main__:2024-11-05 08:33:58 | Epoch: 0 | Step: 138020 | Dataset: 0-1248704 | Loss: 0.773 | 913 ms/step , 6888.81 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 08:34:07 | Epoch: 0 | Step: 138030 | Dataset: 0-1249024 | Loss: 0.779 | 913 ms/step , 6889.40 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 08:34:16 | Epoch: 0 | Step: 138040 | Dataset: 0-1249344 | Loss: 0.752 | 913 ms/step , 6885.09 GFLOP/s , 17907.6 tokens/s INFO:__main__:2024-11-05 08:34:25 | Epoch: 0 | Step: 138050 | Dataset: 0-1249664 | Loss: 0.705 | 914 ms/step , 6879.33 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 08:34:34 | Epoch: 0 | Step: 138060 | Dataset: 0-1249984 | Loss: 0.787 | 913 ms/step , 6887.61 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 08:34:43 | Epoch: 0 | Step: 138070 | Dataset: 0-1250304 | Loss: 0.608 | 912 ms/step , 6897.17 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 08:34:52 | Epoch: 0 | Step: 138080 | Dataset: 0-1250624 | Loss: 0.778 | 913 ms/step , 6891.03 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 08:35:02 | Epoch: 0 | Step: 138090 | Dataset: 0-1250944 | Loss: 0.702 | 913 ms/step , 6887.74 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 08:35:11 | Epoch: 0 | Step: 138100 | Dataset: 0-1251264 | Loss: 0.778 | 914 ms/step , 6884.63 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 08:35:12 | Validation | Step: 138100 | Val_loss: 0.781 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:35:21 | Epoch: 0 | Step: 138110 | Dataset: 0-1251584 | Loss: 0.809 | 913 ms/step , 6888.46 GFLOP/s , 15282.4 tokens/s INFO:__main__:2024-11-05 08:35:31 | Epoch: 0 | Step: 138120 | Dataset: 0-1251904 | Loss: 0.702 | 912 ms/step , 6894.83 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 08:35:40 | Epoch: 0 | Step: 138130 | Dataset: 0-1252224 | Loss: 0.825 | 913 ms/step , 6889.64 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 08:35:49 | Epoch: 0 | Step: 138140 | Dataset: 0-1252544 | Loss: 0.761 | 913 ms/step , 6891.63 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 08:35:58 | Epoch: 0 | Step: 138150 | Dataset: 0-1252864 | Loss: 0.758 | 913 ms/step , 6886.91 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 08:36:07 | Epoch: 0 | Step: 138160 | Dataset: 0-1253184 | Loss: 0.813 | 914 ms/step , 6883.48 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 08:36:16 | Epoch: 0 | Step: 138170 | Dataset: 0-1253504 | Loss: 0.742 | 913 ms/step , 6885.12 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 08:36:25 | Epoch: 0 | Step: 138180 | Dataset: 0-1253824 | Loss: 0.793 | 913 ms/step , 6885.18 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 08:36:35 | Epoch: 0 | Step: 138190 | Dataset: 0-1254144 | Loss: 0.640 | 912 ms/step , 6894.56 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:36:44 | Epoch: 0 | Step: 138200 | Dataset: 0-1254464 | Loss: 0.705 | 913 ms/step , 6890.12 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 08:36:45 | Validation | Step: 138200 | Val_loss: 0.825 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:36:54 | Epoch: 0 | Step: 138210 | Dataset: 0-1254784 | Loss: 0.785 | 913 ms/step , 6885.86 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 08:37:03 | Epoch: 0 | Step: 138220 | Dataset: 0-1255104 | Loss: 0.726 | 913 ms/step , 6886.75 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 08:37:13 | Epoch: 0 | Step: 138230 | Dataset: 0-1255424 | Loss: 0.673 | 913 ms/step , 6885.69 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 08:37:22 | Epoch: 0 | Step: 138240 | Dataset: 0-1255744 | Loss: 0.784 | 912 ms/step , 6894.01 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 08:37:31 | Epoch: 0 | Step: 138250 | Dataset: 0-1256064 | Loss: 0.708 | 912 ms/step , 6895.21 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 08:37:40 | Epoch: 0 | Step: 138260 | Dataset: 0-1256384 | Loss: 0.692 | 914 ms/step , 6884.37 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 08:37:49 | Epoch: 0 | Step: 138270 | Dataset: 0-1256704 | Loss: 0.750 | 913 ms/step , 6891.67 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 08:37:58 | Epoch: 0 | Step: 138280 | Dataset: 0-1257024 | Loss: 0.791 | 913 ms/step , 6889.64 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 08:38:07 | Epoch: 0 | Step: 138290 | Dataset: 0-1257344 | Loss: 0.719 | 913 ms/step , 6892.04 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 08:38:17 | Epoch: 0 | Step: 138300 | Dataset: 0-1257664 | Loss: 0.753 | 911 ms/step , 6905.27 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 08:38:18 | Validation | Step: 138300 | Val_loss: 0.806 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:38:27 | Epoch: 0 | Step: 138310 | Dataset: 0-1257984 | Loss: 0.822 | 912 ms/step , 6892.60 GFLOP/s , 15276.3 tokens/s INFO:__main__:2024-11-05 08:38:36 | Epoch: 0 | Step: 138320 | Dataset: 0-1258304 | Loss: 0.824 | 914 ms/step , 6881.70 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 08:38:46 | Epoch: 0 | Step: 138330 | Dataset: 0-1258624 | Loss: 0.724 | 912 ms/step , 6898.70 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 08:38:55 | Epoch: 0 | Step: 138340 | Dataset: 0-1258944 | Loss: 0.835 | 913 ms/step , 6892.07 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 08:39:04 | Epoch: 0 | Step: 138350 | Dataset: 0-1259264 | Loss: 0.795 | 913 ms/step , 6886.09 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 08:39:13 | Epoch: 0 | Step: 138360 | Dataset: 0-1259584 | Loss: 0.768 | 913 ms/step , 6888.31 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:39:22 | Epoch: 0 | Step: 138370 | Dataset: 0-1259904 | Loss: 0.878 | 914 ms/step , 6882.48 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 08:39:31 | Epoch: 0 | Step: 138380 | Dataset: 0-1260224 | Loss: 0.701 | 913 ms/step , 6890.01 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 08:39:40 | Epoch: 0 | Step: 138390 | Dataset: 0-1260544 | Loss: 0.792 | 912 ms/step , 6898.55 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 08:39:50 | Epoch: 0 | Step: 138400 | Dataset: 0-1260864 | Loss: 0.712 | 914 ms/step , 6879.07 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 08:39:51 | Validation | Step: 138400 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:40:00 | Epoch: 0 | Step: 138410 | Dataset: 0-1261184 | Loss: 0.790 | 913 ms/step , 6888.10 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 08:40:09 | Epoch: 0 | Step: 138420 | Dataset: 0-1261504 | Loss: 0.656 | 911 ms/step , 6900.76 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 08:40:19 | Epoch: 0 | Step: 138430 | Dataset: 0-1261824 | Loss: 0.844 | 913 ms/step , 6888.08 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 08:40:28 | Epoch: 0 | Step: 138440 | Dataset: 0-1262144 | Loss: 0.705 | 913 ms/step , 6886.17 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 08:40:37 | Epoch: 0 | Step: 138450 | Dataset: 0-1262464 | Loss: 0.767 | 914 ms/step , 6880.01 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 08:40:46 | Epoch: 0 | Step: 138460 | Dataset: 0-1262784 | Loss: 0.743 | 914 ms/step , 6880.00 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 08:40:55 | Epoch: 0 | Step: 138470 | Dataset: 0-1263104 | Loss: 0.855 | 914 ms/step , 6883.93 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 08:41:04 | Epoch: 0 | Step: 138480 | Dataset: 0-1263424 | Loss: 0.700 | 914 ms/step , 6880.23 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 08:41:13 | Epoch: 0 | Step: 138490 | Dataset: 0-1263744 | Loss: 0.711 | 914 ms/step , 6883.17 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 08:41:22 | Epoch: 0 | Step: 138500 | Dataset: 0-1264064 | Loss: 0.789 | 913 ms/step , 6887.13 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 08:41:24 | Validation | Step: 138500 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:41:33 | Epoch: 0 | Step: 138510 | Dataset: 0-1264384 | Loss: 0.770 | 913 ms/step , 6890.92 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 08:41:42 | Epoch: 0 | Step: 138520 | Dataset: 0-1264704 | Loss: 0.676 | 913 ms/step , 6889.74 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 08:41:51 | Epoch: 0 | Step: 138530 | Dataset: 0-1265024 | Loss: 0.766 | 912 ms/step , 6893.20 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 08:42:01 | Epoch: 0 | Step: 138540 | Dataset: 0-1265344 | Loss: 0.842 | 913 ms/step , 6887.72 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 08:42:10 | Epoch: 0 | Step: 138550 | Dataset: 0-1265664 | Loss: 0.682 | 912 ms/step , 6893.62 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 08:42:19 | Epoch: 0 | Step: 138560 | Dataset: 0-1265984 | Loss: 0.789 | 913 ms/step , 6888.06 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 08:42:28 | Epoch: 0 | Step: 138570 | Dataset: 0-1266304 | Loss: 0.803 | 913 ms/step , 6889.02 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 08:42:37 | Epoch: 0 | Step: 138580 | Dataset: 0-1266624 | Loss: 0.879 | 912 ms/step , 6893.43 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 08:42:46 | Epoch: 0 | Step: 138590 | Dataset: 0-1266944 | Loss: 0.823 | 913 ms/step , 6890.94 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 08:42:55 | Epoch: 0 | Step: 138600 | Dataset: 0-1267264 | Loss: 0.665 | 912 ms/step , 6897.15 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 08:42:57 | Validation | Step: 138600 | Val_loss: 0.785 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:43:06 | Epoch: 0 | Step: 138610 | Dataset: 0-1267584 | Loss: 0.817 | 913 ms/step , 6887.01 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-05 08:43:15 | Epoch: 0 | Step: 138620 | Dataset: 0-1267904 | Loss: 0.755 | 913 ms/step , 6886.74 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 08:43:24 | Epoch: 0 | Step: 138630 | Dataset: 0-1268224 | Loss: 0.753 | 916 ms/step , 6866.94 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 08:43:34 | Epoch: 0 | Step: 138640 | Dataset: 0-1268544 | Loss: 0.801 | 914 ms/step , 6882.86 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 08:43:43 | Epoch: 0 | Step: 138650 | Dataset: 0-1268864 | Loss: 0.664 | 912 ms/step , 6893.18 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 08:43:52 | Epoch: 0 | Step: 138660 | Dataset: 0-1269184 | Loss: 0.657 | 913 ms/step , 6887.06 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 08:44:01 | Epoch: 0 | Step: 138670 | Dataset: 0-1269504 | Loss: 0.735 | 913 ms/step , 6887.88 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 08:44:10 | Epoch: 0 | Step: 138680 | Dataset: 0-1269824 | Loss: 0.773 | 913 ms/step , 6887.50 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 08:44:19 | Epoch: 0 | Step: 138690 | Dataset: 0-1270144 | Loss: 0.775 | 914 ms/step , 6883.55 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 08:44:28 | Epoch: 0 | Step: 138700 | Dataset: 0-1270464 | Loss: 0.834 | 913 ms/step , 6886.01 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 08:44:30 | Validation | Step: 138700 | Val_loss: 0.783 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:44:39 | Epoch: 0 | Step: 138710 | Dataset: 0-1270784 | Loss: 0.735 | 914 ms/step , 6884.11 GFLOP/s , 15274.5 tokens/s INFO:__main__:2024-11-05 08:44:48 | Epoch: 0 | Step: 138720 | Dataset: 0-1271104 | Loss: 0.665 | 913 ms/step , 6891.51 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 08:44:57 | Epoch: 0 | Step: 138730 | Dataset: 0-1271424 | Loss: 0.710 | 913 ms/step , 6889.36 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 08:45:06 | Epoch: 0 | Step: 138740 | Dataset: 0-1271744 | Loss: 0.789 | 915 ms/step , 6877.03 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:45:16 | Epoch: 0 | Step: 138750 | Dataset: 0-1272064 | Loss: 0.853 | 914 ms/step , 6883.09 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 08:45:25 | Epoch: 0 | Step: 138760 | Dataset: 0-1272384 | Loss: 0.865 | 913 ms/step , 6886.72 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 08:45:34 | Epoch: 0 | Step: 138770 | Dataset: 0-1272704 | Loss: 0.773 | 914 ms/step , 6884.66 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 08:45:43 | Epoch: 0 | Step: 138780 | Dataset: 0-1273024 | Loss: 0.757 | 914 ms/step , 6879.22 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 08:45:52 | Epoch: 0 | Step: 138790 | Dataset: 0-1273344 | Loss: 0.673 | 913 ms/step , 6891.41 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 08:46:01 | Epoch: 0 | Step: 138800 | Dataset: 0-1273664 | Loss: 0.807 | 913 ms/step , 6891.74 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 08:46:03 | Validation | Step: 138800 | Val_loss: 0.803 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:46:12 | Epoch: 0 | Step: 138810 | Dataset: 0-1273984 | Loss: 0.739 | 912 ms/step , 6895.41 GFLOP/s , 15274.5 tokens/s INFO:__main__:2024-11-05 08:46:21 | Epoch: 0 | Step: 138820 | Dataset: 0-1274304 | Loss: 0.732 | 913 ms/step , 6889.79 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 08:46:30 | Epoch: 0 | Step: 138830 | Dataset: 0-1274624 | Loss: 0.685 | 912 ms/step , 6892.97 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 08:46:39 | Epoch: 0 | Step: 138840 | Dataset: 0-1274944 | Loss: 0.720 | 912 ms/step , 6897.50 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 08:46:49 | Epoch: 0 | Step: 138850 | Dataset: 0-1275264 | Loss: 0.787 | 915 ms/step , 6877.18 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 08:46:58 | Epoch: 0 | Step: 138860 | Dataset: 0-1275584 | Loss: 0.742 | 914 ms/step , 6884.84 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 08:47:07 | Epoch: 0 | Step: 138870 | Dataset: 0-1275904 | Loss: 0.778 | 913 ms/step , 6887.52 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 08:47:16 | Epoch: 0 | Step: 138880 | Dataset: 0-1276224 | Loss: 0.675 | 913 ms/step , 6889.00 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:47:25 | Epoch: 0 | Step: 138890 | Dataset: 0-1276544 | Loss: 0.671 | 913 ms/step , 6886.59 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 08:47:34 | Epoch: 0 | Step: 138900 | Dataset: 0-1276864 | Loss: 0.705 | 913 ms/step , 6889.22 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 08:47:36 | Validation | Step: 138900 | Val_loss: 0.817 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:47:45 | Epoch: 0 | Step: 138910 | Dataset: 0-1277184 | Loss: 0.728 | 913 ms/step , 6890.04 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 08:47:54 | Epoch: 0 | Step: 138920 | Dataset: 0-1277504 | Loss: 0.720 | 912 ms/step , 6892.87 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 08:48:03 | Epoch: 0 | Step: 138930 | Dataset: 0-1277824 | Loss: 0.746 | 914 ms/step , 6879.05 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 08:48:12 | Epoch: 0 | Step: 138940 | Dataset: 0-1278144 | Loss: 0.697 | 913 ms/step , 6890.30 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 08:48:21 | Epoch: 0 | Step: 138950 | Dataset: 0-1278464 | Loss: 0.727 | 912 ms/step , 6896.98 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 08:48:31 | Epoch: 0 | Step: 138960 | Dataset: 0-1278784 | Loss: 0.673 | 913 ms/step , 6889.95 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 08:48:40 | Epoch: 0 | Step: 138970 | Dataset: 0-1279104 | Loss: 0.733 | 914 ms/step , 6882.29 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 08:48:49 | Epoch: 0 | Step: 138980 | Dataset: 0-1279424 | Loss: 0.595 | 913 ms/step , 6890.48 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 08:48:58 | Epoch: 0 | Step: 138990 | Dataset: 0-1279744 | Loss: 0.719 | 913 ms/step , 6890.31 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 08:49:07 | Epoch: 0 | Step: 139000 | Dataset: 0-1280064 | Loss: 0.687 | 912 ms/step , 6892.88 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 08:49:09 | Validation | Step: 139000 | Val_loss: 0.818 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:49:09 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_084909_step_139000.pt` INFO:__main__:2024-11-05 08:49:19 | Epoch: 0 | Step: 139010 | Dataset: 0-1280384 | Loss: 0.775 | 914 ms/step , 6884.59 GFLOP/s , 13773.5 tokens/s INFO:__main__:2024-11-05 08:49:28 | Epoch: 0 | Step: 139020 | Dataset: 0-1280704 | Loss: 0.773 | 912 ms/step , 6898.37 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 08:49:37 | Epoch: 0 | Step: 139030 | Dataset: 0-1281024 | Loss: 0.717 | 913 ms/step , 6887.90 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 08:49:46 | Epoch: 0 | Step: 139040 | Dataset: 0-1281344 | Loss: 0.703 | 914 ms/step , 6884.74 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 08:49:56 | Epoch: 0 | Step: 139050 | Dataset: 0-1281664 | Loss: 0.633 | 912 ms/step , 6892.83 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 08:50:05 | Epoch: 0 | Step: 139060 | Dataset: 0-1281984 | Loss: 0.734 | 914 ms/step , 6877.74 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 08:50:14 | Epoch: 0 | Step: 139070 | Dataset: 0-1282304 | Loss: 0.706 | 913 ms/step , 6888.24 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 08:50:23 | Epoch: 0 | Step: 139080 | Dataset: 0-1282624 | Loss: 0.814 | 913 ms/step , 6887.65 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 08:50:32 | Epoch: 0 | Step: 139090 | Dataset: 0-1282944 | Loss: 0.707 | 913 ms/step , 6891.85 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 08:50:41 | Epoch: 0 | Step: 139100 | Dataset: 0-1283264 | Loss: 0.778 | 913 ms/step , 6890.72 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 08:50:43 | Validation | Step: 139100 | Val_loss: 0.781 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:50:52 | Epoch: 0 | Step: 139110 | Dataset: 0-1283584 | Loss: 0.749 | 913 ms/step , 6885.84 GFLOP/s , 15280.8 tokens/s INFO:__main__:2024-11-05 08:51:01 | Epoch: 0 | Step: 139120 | Dataset: 0-1283904 | Loss: 0.761 | 913 ms/step , 6885.77 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 08:51:10 | Epoch: 0 | Step: 139130 | Dataset: 0-1284224 | Loss: 0.681 | 912 ms/step , 6898.37 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 08:51:19 | Epoch: 0 | Step: 139140 | Dataset: 0-1284544 | Loss: 0.811 | 914 ms/step , 6882.03 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 08:51:29 | Epoch: 0 | Step: 139150 | Dataset: 0-1284864 | Loss: 0.775 | 913 ms/step , 6885.92 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 08:51:38 | Epoch: 0 | Step: 139160 | Dataset: 0-1285184 | Loss: 0.755 | 913 ms/step , 6887.46 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 08:51:47 | Epoch: 0 | Step: 139170 | Dataset: 0-1285504 | Loss: 0.833 | 913 ms/step , 6888.88 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 08:51:56 | Epoch: 0 | Step: 139180 | Dataset: 0-1285824 | Loss: 0.727 | 913 ms/step , 6890.29 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 08:52:05 | Epoch: 0 | Step: 139190 | Dataset: 0-1286144 | Loss: 0.708 | 914 ms/step , 6877.97 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 08:52:14 | Epoch: 0 | Step: 139200 | Dataset: 0-1286464 | Loss: 0.637 | 912 ms/step , 6900.13 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 08:52:16 | Validation | Step: 139200 | Val_loss: 0.804 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:52:25 | Epoch: 0 | Step: 139210 | Dataset: 0-1286784 | Loss: 0.626 | 912 ms/step , 6898.93 GFLOP/s , 15281.1 tokens/s INFO:__main__:2024-11-05 08:52:34 | Epoch: 0 | Step: 139220 | Dataset: 0-1287104 | Loss: 0.773 | 914 ms/step , 6884.77 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 08:52:43 | Epoch: 0 | Step: 139230 | Dataset: 0-1287424 | Loss: 0.684 | 912 ms/step , 6895.44 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 08:52:52 | Epoch: 0 | Step: 139240 | Dataset: 0-1287744 | Loss: 0.764 | 912 ms/step , 6893.27 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 08:53:01 | Epoch: 0 | Step: 139250 | Dataset: 0-1288064 | Loss: 0.729 | 913 ms/step , 6891.16 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 08:53:11 | Epoch: 0 | Step: 139260 | Dataset: 0-1288384 | Loss: 0.782 | 914 ms/step , 6884.72 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 08:53:20 | Epoch: 0 | Step: 139270 | Dataset: 0-1288704 | Loss: 0.668 | 914 ms/step , 6881.94 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 08:53:29 | Epoch: 0 | Step: 139280 | Dataset: 0-1289024 | Loss: 0.728 | 914 ms/step , 6881.11 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 08:53:38 | Epoch: 0 | Step: 139290 | Dataset: 0-1289344 | Loss: 0.669 | 913 ms/step , 6887.04 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 08:53:47 | Epoch: 0 | Step: 139300 | Dataset: 0-1289664 | Loss: 0.703 | 912 ms/step , 6896.90 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 08:53:49 | Validation | Step: 139300 | Val_loss: 0.874 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:53:58 | Epoch: 0 | Step: 139310 | Dataset: 0-1289984 | Loss: 0.691 | 911 ms/step , 6901.06 GFLOP/s , 15283.5 tokens/s INFO:__main__:2024-11-05 08:54:07 | Epoch: 0 | Step: 139320 | Dataset: 0-1290304 | Loss: 0.574 | 912 ms/step , 6897.34 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 08:54:16 | Epoch: 0 | Step: 139330 | Dataset: 0-1290624 | Loss: 0.854 | 913 ms/step , 6889.93 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 08:54:25 | Epoch: 0 | Step: 139340 | Dataset: 0-1290944 | Loss: 0.784 | 913 ms/step , 6891.47 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:54:34 | Epoch: 0 | Step: 139350 | Dataset: 0-1291264 | Loss: 0.740 | 912 ms/step , 6897.19 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 08:54:44 | Epoch: 0 | Step: 139360 | Dataset: 0-1291584 | Loss: 0.729 | 914 ms/step , 6878.67 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 08:54:53 | Epoch: 0 | Step: 139370 | Dataset: 0-1291904 | Loss: 0.842 | 912 ms/step , 6895.90 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 08:55:02 | Epoch: 0 | Step: 139380 | Dataset: 0-1292224 | Loss: 0.816 | 914 ms/step , 6883.78 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 08:55:11 | Epoch: 0 | Step: 139390 | Dataset: 0-1292544 | Loss: 0.804 | 912 ms/step , 6893.00 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 08:55:20 | Epoch: 0 | Step: 139400 | Dataset: 0-1292864 | Loss: 0.752 | 913 ms/step , 6885.84 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 08:55:22 | Validation | Step: 139400 | Val_loss: 0.821 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:55:31 | Epoch: 0 | Step: 139410 | Dataset: 0-1293184 | Loss: 0.828 | 912 ms/step , 6896.27 GFLOP/s , 15281.3 tokens/s INFO:__main__:2024-11-05 08:55:40 | Epoch: 0 | Step: 139420 | Dataset: 0-1293504 | Loss: 0.762 | 915 ms/step , 6876.06 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 08:55:49 | Epoch: 0 | Step: 139430 | Dataset: 0-1293824 | Loss: 0.805 | 915 ms/step , 6873.85 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 08:55:58 | Epoch: 0 | Step: 139440 | Dataset: 0-1294144 | Loss: 0.773 | 912 ms/step , 6894.28 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 08:56:07 | Epoch: 0 | Step: 139450 | Dataset: 0-1294464 | Loss: 0.665 | 913 ms/step , 6888.62 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 08:56:16 | Epoch: 0 | Step: 139460 | Dataset: 0-1294784 | Loss: 0.757 | 913 ms/step , 6886.22 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 08:56:26 | Epoch: 0 | Step: 139470 | Dataset: 0-1295104 | Loss: 0.763 | 912 ms/step , 6893.19 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 08:56:35 | Epoch: 0 | Step: 139480 | Dataset: 0-1295424 | Loss: 0.698 | 912 ms/step , 6892.79 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 08:56:44 | Epoch: 0 | Step: 139490 | Dataset: 0-1295744 | Loss: 0.661 | 914 ms/step , 6880.98 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 08:56:53 | Epoch: 0 | Step: 139500 | Dataset: 0-1296064 | Loss: 0.720 | 913 ms/step , 6886.13 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 08:56:55 | Validation | Step: 139500 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:57:04 | Epoch: 0 | Step: 139510 | Dataset: 0-1296384 | Loss: 0.702 | 912 ms/step , 6900.08 GFLOP/s , 15288.4 tokens/s INFO:__main__:2024-11-05 08:57:13 | Epoch: 0 | Step: 139520 | Dataset: 0-1296704 | Loss: 0.714 | 912 ms/step , 6898.81 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 08:57:22 | Epoch: 0 | Step: 139530 | Dataset: 0-1297024 | Loss: 0.734 | 912 ms/step , 6897.86 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 08:57:31 | Epoch: 0 | Step: 139540 | Dataset: 0-1297344 | Loss: 0.823 | 915 ms/step , 6877.25 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 08:57:40 | Epoch: 0 | Step: 139550 | Dataset: 0-1297664 | Loss: 0.760 | 915 ms/step , 6876.84 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 08:57:49 | Epoch: 0 | Step: 139560 | Dataset: 0-1297984 | Loss: 0.737 | 913 ms/step , 6891.40 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 08:57:59 | Epoch: 0 | Step: 139570 | Dataset: 0-1298304 | Loss: 0.684 | 911 ms/step , 6900.95 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 08:58:08 | Epoch: 0 | Step: 139580 | Dataset: 0-1298624 | Loss: 0.641 | 913 ms/step , 6889.94 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 08:58:17 | Epoch: 0 | Step: 139590 | Dataset: 0-1298944 | Loss: 0.732 | 912 ms/step , 6894.23 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 08:58:26 | Epoch: 0 | Step: 139600 | Dataset: 0-1299264 | Loss: 0.839 | 912 ms/step , 6895.09 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 08:58:28 | Validation | Step: 139600 | Val_loss: 0.788 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 08:58:37 | Epoch: 0 | Step: 139610 | Dataset: 0-1299584 | Loss: 0.694 | 912 ms/step , 6895.97 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 08:58:46 | Epoch: 0 | Step: 139620 | Dataset: 0-1299904 | Loss: 0.752 | 913 ms/step , 6885.66 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 08:58:55 | Epoch: 0 | Step: 139630 | Dataset: 0-1300224 | Loss: 0.730 | 913 ms/step , 6888.55 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 08:59:04 | Epoch: 0 | Step: 139640 | Dataset: 0-1300544 | Loss: 0.739 | 914 ms/step , 6881.75 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 08:59:13 | Epoch: 0 | Step: 139650 | Dataset: 0-1300864 | Loss: 0.724 | 914 ms/step , 6882.77 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 08:59:22 | Epoch: 0 | Step: 139660 | Dataset: 0-1301184 | Loss: 0.775 | 912 ms/step , 6892.70 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 08:59:31 | Epoch: 0 | Step: 139670 | Dataset: 0-1301504 | Loss: 0.779 | 913 ms/step , 6885.32 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 08:59:41 | Epoch: 0 | Step: 139680 | Dataset: 0-1301824 | Loss: 0.793 | 913 ms/step , 6891.02 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 08:59:50 | Epoch: 0 | Step: 139690 | Dataset: 0-1302144 | Loss: 0.681 | 913 ms/step , 6888.47 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 08:59:59 | Epoch: 0 | Step: 139700 | Dataset: 0-1302464 | Loss: 0.756 | 913 ms/step , 6886.19 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 09:00:00 | Validation | Step: 139700 | Val_loss: 0.769 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:00:10 | Epoch: 0 | Step: 139710 | Dataset: 0-1302784 | Loss: 0.778 | 912 ms/step , 6894.14 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-05 09:00:19 | Epoch: 0 | Step: 139720 | Dataset: 0-1303104 | Loss: 0.761 | 912 ms/step , 6893.76 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 09:00:28 | Epoch: 0 | Step: 139730 | Dataset: 0-1303424 | Loss: 0.774 | 912 ms/step , 6893.02 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 09:00:37 | Epoch: 0 | Step: 139740 | Dataset: 0-1303744 | Loss: 0.741 | 913 ms/step , 6885.56 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 09:00:46 | Epoch: 0 | Step: 139750 | Dataset: 0-1304064 | Loss: 0.755 | 912 ms/step , 6895.52 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 09:00:55 | Epoch: 0 | Step: 139760 | Dataset: 0-1304384 | Loss: 0.636 | 912 ms/step , 6896.18 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 09:01:04 | Epoch: 0 | Step: 139770 | Dataset: 0-1304704 | Loss: 0.779 | 914 ms/step , 6883.18 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 09:01:14 | Epoch: 0 | Step: 139780 | Dataset: 0-1305024 | Loss: 0.714 | 914 ms/step , 6882.16 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 09:01:23 | Epoch: 0 | Step: 139790 | Dataset: 0-1305344 | Loss: 0.875 | 912 ms/step , 6893.01 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 09:01:32 | Epoch: 0 | Step: 139800 | Dataset: 0-1305664 | Loss: 0.748 | 913 ms/step , 6892.05 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 09:01:33 | Validation | Step: 139800 | Val_loss: 0.800 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:01:43 | Epoch: 0 | Step: 139810 | Dataset: 0-1305984 | Loss: 0.742 | 913 ms/step , 6888.86 GFLOP/s , 15278.6 tokens/s INFO:__main__:2024-11-05 09:01:52 | Epoch: 0 | Step: 139820 | Dataset: 0-1306304 | Loss: 0.792 | 913 ms/step , 6890.92 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 09:02:01 | Epoch: 0 | Step: 139830 | Dataset: 0-1306624 | Loss: 0.699 | 912 ms/step , 6897.74 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 09:02:10 | Epoch: 0 | Step: 139840 | Dataset: 0-1306944 | Loss: 0.696 | 913 ms/step , 6889.57 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 09:02:19 | Epoch: 0 | Step: 139850 | Dataset: 0-1307264 | Loss: 0.755 | 913 ms/step , 6889.19 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 09:02:28 | Epoch: 0 | Step: 139860 | Dataset: 0-1307584 | Loss: 0.662 | 913 ms/step , 6890.44 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 09:02:37 | Epoch: 0 | Step: 139870 | Dataset: 0-1307904 | Loss: 0.802 | 913 ms/step , 6888.12 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 09:02:46 | Epoch: 0 | Step: 139880 | Dataset: 0-1308224 | Loss: 0.750 | 913 ms/step , 6890.46 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 09:02:56 | Epoch: 0 | Step: 139890 | Dataset: 0-1308544 | Loss: 0.727 | 912 ms/step , 6897.73 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 09:03:05 | Epoch: 0 | Step: 139900 | Dataset: 0-1308864 | Loss: 0.754 | 913 ms/step , 6890.06 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 09:03:06 | Validation | Step: 139900 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:03:15 | Epoch: 0 | Step: 139910 | Dataset: 0-1309184 | Loss: 0.737 | 913 ms/step , 6890.25 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-05 09:03:25 | Epoch: 0 | Step: 139920 | Dataset: 0-1309504 | Loss: 0.647 | 912 ms/step , 6895.92 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 09:03:34 | Epoch: 0 | Step: 139930 | Dataset: 0-1309824 | Loss: 0.804 | 914 ms/step , 6884.09 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 09:03:43 | Epoch: 0 | Step: 139940 | Dataset: 0-1310144 | Loss: 0.724 | 912 ms/step , 6898.60 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 09:03:52 | Epoch: 0 | Step: 139950 | Dataset: 0-1310464 | Loss: 0.740 | 913 ms/step , 6887.35 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 09:04:01 | Epoch: 0 | Step: 139960 | Dataset: 0-1310784 | Loss: 0.803 | 913 ms/step , 6888.15 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 09:04:10 | Epoch: 0 | Step: 139970 | Dataset: 0-1311104 | Loss: 0.830 | 912 ms/step , 6894.15 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 09:04:19 | Epoch: 0 | Step: 139980 | Dataset: 0-1311424 | Loss: 0.781 | 913 ms/step , 6886.45 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 09:04:29 | Epoch: 0 | Step: 139990 | Dataset: 0-1311744 | Loss: 0.680 | 912 ms/step , 6895.67 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 09:04:38 | Epoch: 0 | Step: 140000 | Dataset: 0-1312064 | Loss: 0.828 | 914 ms/step , 6878.23 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 09:04:39 | Validation | Step: 140000 | Val_loss: 0.710 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:04:39 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_090439_step_140000.pt` INFO:__main__:2024-11-05 09:04:50 | Epoch: 0 | Step: 140010 | Dataset: 0-1312384 | Loss: 0.823 | 914 ms/step , 6882.75 GFLOP/s , 13784.1 tokens/s INFO:__main__:2024-11-05 09:04:59 | Epoch: 0 | Step: 140020 | Dataset: 0-1312704 | Loss: 0.678 | 913 ms/step , 6886.06 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 09:05:08 | Epoch: 0 | Step: 140030 | Dataset: 0-1313024 | Loss: 0.627 | 912 ms/step , 6893.07 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 09:05:17 | Epoch: 0 | Step: 140040 | Dataset: 0-1313344 | Loss: 0.625 | 912 ms/step , 6893.22 GFLOP/s , 17892.5 tokens/s INFO:__main__:2024-11-05 09:05:26 | Epoch: 0 | Step: 140050 | Dataset: 0-1313664 | Loss: 0.761 | 912 ms/step , 6893.34 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 09:05:35 | Epoch: 0 | Step: 140060 | Dataset: 0-1313984 | Loss: 0.649 | 912 ms/step , 6896.55 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 09:05:44 | Epoch: 0 | Step: 140070 | Dataset: 0-1314304 | Loss: 0.661 | 913 ms/step , 6891.00 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 09:05:54 | Epoch: 0 | Step: 140080 | Dataset: 0-1314624 | Loss: 0.721 | 913 ms/step , 6887.33 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 09:06:03 | Epoch: 0 | Step: 140090 | Dataset: 0-1314944 | Loss: 0.723 | 913 ms/step , 6892.27 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 09:06:12 | Epoch: 0 | Step: 140100 | Dataset: 0-1315264 | Loss: 0.748 | 914 ms/step , 6884.81 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 09:06:13 | Validation | Step: 140100 | Val_loss: 0.719 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:06:23 | Epoch: 0 | Step: 140110 | Dataset: 0-1315584 | Loss: 0.749 | 912 ms/step , 6893.22 GFLOP/s , 15290.0 tokens/s INFO:__main__:2024-11-05 09:06:32 | Epoch: 0 | Step: 140120 | Dataset: 0-1315904 | Loss: 0.702 | 912 ms/step , 6896.39 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 09:06:41 | Epoch: 0 | Step: 140130 | Dataset: 0-1316224 | Loss: 0.784 | 912 ms/step , 6895.92 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 09:06:50 | Epoch: 0 | Step: 140140 | Dataset: 0-1316544 | Loss: 0.647 | 912 ms/step , 6896.15 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 09:06:59 | Epoch: 0 | Step: 140150 | Dataset: 0-1316864 | Loss: 0.617 | 912 ms/step , 6899.74 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 09:07:08 | Epoch: 0 | Step: 140160 | Dataset: 0-1317184 | Loss: 0.768 | 914 ms/step , 6883.72 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 09:07:17 | Epoch: 0 | Step: 140170 | Dataset: 0-1317504 | Loss: 0.523 | 912 ms/step , 6893.29 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 09:07:26 | Epoch: 0 | Step: 140180 | Dataset: 0-1317824 | Loss: 0.700 | 912 ms/step , 6893.97 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 09:07:36 | Epoch: 0 | Step: 140190 | Dataset: 0-1318144 | Loss: 0.737 | 913 ms/step , 6891.75 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 09:07:45 | Epoch: 0 | Step: 140200 | Dataset: 0-1318464 | Loss: 0.812 | 912 ms/step , 6897.01 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-05 09:07:46 | Validation | Step: 140200 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:07:55 | Epoch: 0 | Step: 140210 | Dataset: 0-1318784 | Loss: 0.728 | 912 ms/step , 6893.82 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 09:08:05 | Epoch: 0 | Step: 140220 | Dataset: 0-1319104 | Loss: 0.875 | 913 ms/step , 6890.30 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 09:08:14 | Epoch: 0 | Step: 140230 | Dataset: 0-1319424 | Loss: 0.865 | 914 ms/step , 6879.28 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 09:08:23 | Epoch: 0 | Step: 140240 | Dataset: 0-1319744 | Loss: 0.868 | 913 ms/step , 6891.59 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 09:08:32 | Epoch: 0 | Step: 140250 | Dataset: 0-1320064 | Loss: 0.640 | 913 ms/step , 6887.88 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 09:08:41 | Epoch: 0 | Step: 140260 | Dataset: 0-1320384 | Loss: 0.836 | 912 ms/step , 6896.07 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 09:08:50 | Epoch: 0 | Step: 140270 | Dataset: 0-1320704 | Loss: 0.788 | 913 ms/step , 6890.26 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 09:08:59 | Epoch: 0 | Step: 140280 | Dataset: 0-1321024 | Loss: 0.822 | 913 ms/step , 6886.79 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 09:09:08 | Epoch: 0 | Step: 140290 | Dataset: 0-1321344 | Loss: 0.730 | 912 ms/step , 6895.30 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 09:09:18 | Epoch: 0 | Step: 140300 | Dataset: 0-1321664 | Loss: 0.768 | 914 ms/step , 6881.54 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 09:09:19 | Validation | Step: 140300 | Val_loss: 0.712 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:09:28 | Epoch: 0 | Step: 140310 | Dataset: 0-1321984 | Loss: 0.892 | 914 ms/step , 6884.91 GFLOP/s , 15286.1 tokens/s INFO:__main__:2024-11-05 09:09:37 | Epoch: 0 | Step: 140320 | Dataset: 0-1322304 | Loss: 0.789 | 912 ms/step , 6894.96 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 09:09:47 | Epoch: 0 | Step: 140330 | Dataset: 0-1322624 | Loss: 0.707 | 911 ms/step , 6900.94 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 09:09:56 | Epoch: 0 | Step: 140340 | Dataset: 0-1322944 | Loss: 0.721 | 912 ms/step , 6897.94 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 09:10:05 | Epoch: 0 | Step: 140350 | Dataset: 0-1323264 | Loss: 0.814 | 912 ms/step , 6893.01 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 09:10:14 | Epoch: 0 | Step: 140360 | Dataset: 0-1323584 | Loss: 0.709 | 912 ms/step , 6893.26 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 09:10:23 | Epoch: 0 | Step: 140370 | Dataset: 0-1323904 | Loss: 0.679 | 913 ms/step , 6887.38 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 09:10:32 | Epoch: 0 | Step: 140380 | Dataset: 0-1324224 | Loss: 0.725 | 913 ms/step , 6892.57 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 09:10:41 | Epoch: 0 | Step: 140390 | Dataset: 0-1324544 | Loss: 0.778 | 913 ms/step , 6891.33 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 09:10:50 | Epoch: 0 | Step: 140400 | Dataset: 0-1324864 | Loss: 0.834 | 912 ms/step , 6893.98 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 09:10:52 | Validation | Step: 140400 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:11:01 | Epoch: 0 | Step: 140410 | Dataset: 0-1325184 | Loss: 0.701 | 912 ms/step , 6898.37 GFLOP/s , 15282.1 tokens/s INFO:__main__:2024-11-05 09:11:10 | Epoch: 0 | Step: 140420 | Dataset: 0-1325504 | Loss: 0.677 | 913 ms/step , 6885.05 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 09:11:19 | Epoch: 0 | Step: 140430 | Dataset: 0-1325824 | Loss: 0.743 | 913 ms/step , 6891.87 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 09:11:29 | Epoch: 0 | Step: 140440 | Dataset: 0-1326144 | Loss: 0.752 | 913 ms/step , 6892.36 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 09:11:38 | Epoch: 0 | Step: 140450 | Dataset: 0-1326464 | Loss: 0.767 | 913 ms/step , 6887.76 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 09:11:47 | Epoch: 0 | Step: 140460 | Dataset: 0-1326784 | Loss: 0.765 | 914 ms/step , 6884.01 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 09:11:56 | Epoch: 0 | Step: 140470 | Dataset: 0-1327104 | Loss: 0.682 | 914 ms/step , 6884.98 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 09:12:05 | Epoch: 0 | Step: 140480 | Dataset: 0-1327424 | Loss: 0.762 | 913 ms/step , 6888.58 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 09:12:14 | Epoch: 0 | Step: 140490 | Dataset: 0-1327744 | Loss: 0.770 | 912 ms/step , 6893.88 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 09:12:23 | Epoch: 0 | Step: 140500 | Dataset: 0-1328064 | Loss: 0.718 | 913 ms/step , 6891.73 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 09:12:25 | Validation | Step: 140500 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:12:34 | Epoch: 0 | Step: 140510 | Dataset: 0-1328384 | Loss: 0.775 | 913 ms/step , 6885.45 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-05 09:12:43 | Epoch: 0 | Step: 140520 | Dataset: 0-1328704 | Loss: 0.765 | 912 ms/step , 6895.85 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 09:12:52 | Epoch: 0 | Step: 140530 | Dataset: 0-1329024 | Loss: 0.707 | 912 ms/step , 6894.88 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 09:13:02 | Epoch: 0 | Step: 140540 | Dataset: 0-1329344 | Loss: 0.686 | 913 ms/step , 6890.72 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 09:13:11 | Epoch: 0 | Step: 140550 | Dataset: 0-1329664 | Loss: 0.742 | 912 ms/step , 6893.07 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 09:13:20 | Epoch: 0 | Step: 140560 | Dataset: 0-1329984 | Loss: 0.899 | 913 ms/step , 6887.67 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 09:13:29 | Epoch: 0 | Step: 140570 | Dataset: 0-1330304 | Loss: 0.812 | 912 ms/step , 6894.26 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 09:13:38 | Epoch: 0 | Step: 140580 | Dataset: 0-1330624 | Loss: 0.799 | 913 ms/step , 6885.41 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 09:13:47 | Epoch: 0 | Step: 140590 | Dataset: 0-1330944 | Loss: 0.764 | 915 ms/step , 6874.22 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 09:13:56 | Epoch: 0 | Step: 140600 | Dataset: 0-1331264 | Loss: 0.607 | 912 ms/step , 6895.67 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 09:13:58 | Validation | Step: 140600 | Val_loss: 0.750 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:14:07 | Epoch: 0 | Step: 140610 | Dataset: 0-1331584 | Loss: 0.852 | 915 ms/step , 6874.72 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-05 09:14:16 | Epoch: 0 | Step: 140620 | Dataset: 0-1331904 | Loss: 0.692 | 913 ms/step , 6890.67 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 09:14:25 | Epoch: 0 | Step: 140630 | Dataset: 0-1332224 | Loss: 0.761 | 913 ms/step , 6891.30 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 09:14:34 | Epoch: 0 | Step: 140640 | Dataset: 0-1332544 | Loss: 0.796 | 913 ms/step , 6887.30 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 09:14:44 | Epoch: 0 | Step: 140650 | Dataset: 0-1332864 | Loss: 0.618 | 912 ms/step , 6892.99 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 09:14:53 | Epoch: 0 | Step: 140660 | Dataset: 0-1333184 | Loss: 0.939 | 914 ms/step , 6884.14 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 09:15:02 | Epoch: 0 | Step: 140670 | Dataset: 0-1333504 | Loss: 0.854 | 914 ms/step , 6879.45 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 09:15:11 | Epoch: 0 | Step: 140680 | Dataset: 0-1333824 | Loss: 0.764 | 914 ms/step , 6883.10 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 09:15:20 | Epoch: 0 | Step: 140690 | Dataset: 0-1334144 | Loss: 0.840 | 912 ms/step , 6894.80 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 09:15:29 | Epoch: 0 | Step: 140700 | Dataset: 0-1334464 | Loss: 0.810 | 914 ms/step , 6880.11 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 09:15:31 | Validation | Step: 140700 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:15:40 | Epoch: 0 | Step: 140710 | Dataset: 0-1334784 | Loss: 0.790 | 913 ms/step , 6889.41 GFLOP/s , 15291.6 tokens/s INFO:__main__:2024-11-05 09:15:49 | Epoch: 0 | Step: 140720 | Dataset: 0-1335104 | Loss: 0.613 | 911 ms/step , 6902.90 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 09:15:58 | Epoch: 0 | Step: 140730 | Dataset: 0-1335424 | Loss: 0.731 | 912 ms/step , 6893.21 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 09:16:07 | Epoch: 0 | Step: 140740 | Dataset: 0-1335744 | Loss: 0.749 | 912 ms/step , 6898.22 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 09:16:17 | Epoch: 0 | Step: 140750 | Dataset: 0-1336064 | Loss: 0.927 | 913 ms/step , 6891.32 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 09:16:26 | Epoch: 0 | Step: 140760 | Dataset: 0-1336384 | Loss: 0.730 | 912 ms/step , 6896.88 GFLOP/s , 17950.0 tokens/s INFO:__main__:2024-11-05 09:16:35 | Epoch: 0 | Step: 140770 | Dataset: 0-1336704 | Loss: 0.681 | 912 ms/step , 6898.07 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 09:16:44 | Epoch: 0 | Step: 140780 | Dataset: 0-1337024 | Loss: 0.776 | 913 ms/step , 6890.00 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 09:16:53 | Epoch: 0 | Step: 140790 | Dataset: 0-1337344 | Loss: 0.738 | 913 ms/step , 6889.98 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 09:17:02 | Epoch: 0 | Step: 140800 | Dataset: 0-1337664 | Loss: 0.744 | 914 ms/step , 6883.38 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 09:17:04 | Validation | Step: 140800 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:17:13 | Epoch: 0 | Step: 140810 | Dataset: 0-1337984 | Loss: 0.595 | 912 ms/step , 6899.84 GFLOP/s , 15284.4 tokens/s INFO:__main__:2024-11-05 09:17:22 | Epoch: 0 | Step: 140820 | Dataset: 0-1338304 | Loss: 0.677 | 913 ms/step , 6891.58 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 09:17:31 | Epoch: 0 | Step: 140830 | Dataset: 0-1338624 | Loss: 0.616 | 912 ms/step , 6897.84 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 09:17:40 | Epoch: 0 | Step: 140840 | Dataset: 0-1338944 | Loss: 0.735 | 912 ms/step , 6893.07 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 09:17:49 | Epoch: 0 | Step: 140850 | Dataset: 0-1339264 | Loss: 0.806 | 913 ms/step , 6891.12 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 09:17:59 | Epoch: 0 | Step: 140860 | Dataset: 0-1339584 | Loss: 0.466 | 911 ms/step , 6904.62 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 09:18:08 | Epoch: 0 | Step: 140870 | Dataset: 0-1339904 | Loss: 0.645 | 912 ms/step , 6895.17 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 09:18:17 | Epoch: 0 | Step: 140880 | Dataset: 0-1340224 | Loss: 0.734 | 912 ms/step , 6894.74 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 09:18:26 | Epoch: 0 | Step: 140890 | Dataset: 0-1340544 | Loss: 0.747 | 913 ms/step , 6887.47 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 09:18:35 | Epoch: 0 | Step: 140900 | Dataset: 0-1340864 | Loss: 0.706 | 912 ms/step , 6899.10 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 09:18:37 | Validation | Step: 140900 | Val_loss: 0.736 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:18:46 | Epoch: 0 | Step: 140910 | Dataset: 0-1341184 | Loss: 0.823 | 913 ms/step , 6890.65 GFLOP/s , 15286.4 tokens/s INFO:__main__:2024-11-05 09:18:55 | Epoch: 0 | Step: 140920 | Dataset: 0-1341504 | Loss: 0.733 | 912 ms/step , 6892.63 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 09:19:04 | Epoch: 0 | Step: 140930 | Dataset: 0-1341824 | Loss: 0.815 | 914 ms/step , 6880.96 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 09:19:13 | Epoch: 0 | Step: 140940 | Dataset: 0-1342144 | Loss: 0.790 | 913 ms/step , 6885.47 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 09:19:22 | Epoch: 0 | Step: 140950 | Dataset: 0-1342464 | Loss: 0.752 | 914 ms/step , 6882.41 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 09:19:31 | Epoch: 0 | Step: 140960 | Dataset: 0-1342784 | Loss: 0.756 | 913 ms/step , 6888.11 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 09:19:41 | Epoch: 0 | Step: 140970 | Dataset: 0-1343104 | Loss: 0.834 | 913 ms/step , 6888.23 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 09:19:50 | Epoch: 0 | Step: 140980 | Dataset: 0-1343424 | Loss: 0.940 | 912 ms/step , 6898.44 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 09:19:59 | Epoch: 0 | Step: 140990 | Dataset: 0-1343744 | Loss: 0.726 | 913 ms/step , 6890.88 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 09:20:08 | Epoch: 0 | Step: 141000 | Dataset: 0-1344064 | Loss: 0.741 | 913 ms/step , 6891.58 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 09:20:10 | Validation | Step: 141000 | Val_loss: 0.754 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:20:10 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_092010_step_141000.pt` INFO:__main__:2024-11-05 09:20:20 | Epoch: 0 | Step: 141010 | Dataset: 0-1344384 | Loss: 0.803 | 914 ms/step , 6883.39 GFLOP/s , 13784.6 tokens/s INFO:__main__:2024-11-05 09:20:29 | Epoch: 0 | Step: 141020 | Dataset: 0-1344704 | Loss: 0.869 | 913 ms/step , 6886.36 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 09:20:38 | Epoch: 0 | Step: 141030 | Dataset: 0-1345024 | Loss: 0.786 | 915 ms/step , 6872.73 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 09:20:47 | Epoch: 0 | Step: 141040 | Dataset: 0-1345344 | Loss: 0.802 | 914 ms/step , 6884.52 GFLOP/s , 17885.0 tokens/s INFO:__main__:2024-11-05 09:20:56 | Epoch: 0 | Step: 141050 | Dataset: 0-1345664 | Loss: 0.696 | 915 ms/step , 6877.45 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 09:21:06 | Epoch: 0 | Step: 141060 | Dataset: 0-1345984 | Loss: 0.817 | 915 ms/step , 6876.40 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 09:21:15 | Epoch: 0 | Step: 141070 | Dataset: 0-1346304 | Loss: 0.783 | 914 ms/step , 6884.99 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 09:21:24 | Epoch: 0 | Step: 141080 | Dataset: 0-1346624 | Loss: 0.851 | 913 ms/step , 6891.78 GFLOP/s , 17946.2 tokens/s INFO:__main__:2024-11-05 09:21:33 | Epoch: 0 | Step: 141090 | Dataset: 0-1346944 | Loss: 0.748 | 912 ms/step , 6894.17 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 09:21:42 | Epoch: 0 | Step: 141100 | Dataset: 0-1347264 | Loss: 0.669 | 912 ms/step , 6895.66 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 09:21:44 | Validation | Step: 141100 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:21:53 | Epoch: 0 | Step: 141110 | Dataset: 0-1347584 | Loss: 0.712 | 912 ms/step , 6896.61 GFLOP/s , 15288.8 tokens/s INFO:__main__:2024-11-05 09:22:02 | Epoch: 0 | Step: 141120 | Dataset: 0-1347904 | Loss: 0.912 | 913 ms/step , 6887.49 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 09:22:11 | Epoch: 0 | Step: 141130 | Dataset: 0-1348224 | Loss: 0.726 | 912 ms/step , 6894.90 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 09:22:20 | Epoch: 0 | Step: 141140 | Dataset: 0-1348544 | Loss: 0.732 | 912 ms/step , 6894.16 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 09:22:29 | Epoch: 0 | Step: 141150 | Dataset: 0-1348864 | Loss: 0.655 | 912 ms/step , 6892.86 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 09:22:38 | Epoch: 0 | Step: 141160 | Dataset: 0-1349184 | Loss: 0.785 | 914 ms/step , 6882.84 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 09:22:48 | Epoch: 0 | Step: 141170 | Dataset: 0-1349504 | Loss: 0.828 | 913 ms/step , 6886.32 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 09:22:57 | Epoch: 0 | Step: 141180 | Dataset: 0-1349824 | Loss: 0.699 | 913 ms/step , 6887.92 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 09:23:06 | Epoch: 0 | Step: 141190 | Dataset: 0-1350144 | Loss: 0.634 | 912 ms/step , 6897.42 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 09:23:15 | Epoch: 0 | Step: 141200 | Dataset: 0-1350464 | Loss: 0.758 | 912 ms/step , 6893.02 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 09:23:17 | Validation | Step: 141200 | Val_loss: 0.794 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:23:26 | Epoch: 0 | Step: 141210 | Dataset: 0-1350784 | Loss: 0.726 | 912 ms/step , 6897.31 GFLOP/s , 15284.0 tokens/s INFO:__main__:2024-11-05 09:23:35 | Epoch: 0 | Step: 141220 | Dataset: 0-1351104 | Loss: 0.807 | 913 ms/step , 6889.82 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 09:23:44 | Epoch: 0 | Step: 141230 | Dataset: 0-1351424 | Loss: 0.661 | 915 ms/step , 6874.24 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 09:23:53 | Epoch: 0 | Step: 141240 | Dataset: 0-1351744 | Loss: 0.876 | 913 ms/step , 6891.82 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 09:24:02 | Epoch: 0 | Step: 141250 | Dataset: 0-1352064 | Loss: 0.783 | 913 ms/step , 6890.59 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 09:24:11 | Epoch: 0 | Step: 141260 | Dataset: 0-1352384 | Loss: 0.872 | 914 ms/step , 6882.88 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 09:24:21 | Epoch: 0 | Step: 141270 | Dataset: 0-1352704 | Loss: 0.826 | 913 ms/step , 6890.44 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 09:24:30 | Epoch: 0 | Step: 141280 | Dataset: 0-1353024 | Loss: 0.891 | 913 ms/step , 6887.88 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 09:24:39 | Epoch: 0 | Step: 141290 | Dataset: 0-1353344 | Loss: 0.689 | 915 ms/step , 6876.54 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 09:24:48 | Epoch: 0 | Step: 141300 | Dataset: 0-1353664 | Loss: 0.687 | 913 ms/step , 6889.80 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 09:24:50 | Validation | Step: 141300 | Val_loss: 0.757 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:24:59 | Epoch: 0 | Step: 141310 | Dataset: 0-1353984 | Loss: 0.681 | 913 ms/step , 6892.21 GFLOP/s , 15281.8 tokens/s INFO:__main__:2024-11-05 09:25:08 | Epoch: 0 | Step: 141320 | Dataset: 0-1354304 | Loss: 0.834 | 913 ms/step , 6886.60 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 09:25:17 | Epoch: 0 | Step: 141330 | Dataset: 0-1354624 | Loss: 0.790 | 913 ms/step , 6886.60 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 09:25:26 | Epoch: 0 | Step: 141340 | Dataset: 0-1354944 | Loss: 0.771 | 913 ms/step , 6888.19 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 09:25:35 | Epoch: 0 | Step: 141350 | Dataset: 0-1355264 | Loss: 0.736 | 912 ms/step , 6893.21 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 09:25:44 | Epoch: 0 | Step: 141360 | Dataset: 0-1355584 | Loss: 0.671 | 913 ms/step , 6891.89 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 09:25:53 | Epoch: 0 | Step: 141370 | Dataset: 0-1355904 | Loss: 0.758 | 913 ms/step , 6888.98 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 09:26:03 | Epoch: 0 | Step: 141380 | Dataset: 0-1356224 | Loss: 0.797 | 914 ms/step , 6884.82 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 09:26:12 | Epoch: 0 | Step: 141390 | Dataset: 0-1356544 | Loss: 0.788 | 913 ms/step , 6890.11 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 09:26:21 | Epoch: 0 | Step: 141400 | Dataset: 0-1356864 | Loss: 0.750 | 914 ms/step , 6878.65 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 09:26:22 | Validation | Step: 141400 | Val_loss: 0.725 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:26:32 | Epoch: 0 | Step: 141410 | Dataset: 0-1357184 | Loss: 0.780 | 913 ms/step , 6890.93 GFLOP/s , 15271.1 tokens/s INFO:__main__:2024-11-05 09:26:41 | Epoch: 0 | Step: 141420 | Dataset: 0-1357504 | Loss: 0.780 | 912 ms/step , 6894.66 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 09:26:50 | Epoch: 0 | Step: 141430 | Dataset: 0-1357824 | Loss: 0.815 | 913 ms/step , 6889.07 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 09:26:59 | Epoch: 0 | Step: 141440 | Dataset: 0-1358144 | Loss: 0.797 | 913 ms/step , 6891.32 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 09:27:08 | Epoch: 0 | Step: 141450 | Dataset: 0-1358464 | Loss: 0.880 | 914 ms/step , 6883.98 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 09:27:17 | Epoch: 0 | Step: 141460 | Dataset: 0-1358784 | Loss: 0.805 | 913 ms/step , 6886.18 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 09:27:26 | Epoch: 0 | Step: 141470 | Dataset: 0-1359104 | Loss: 0.818 | 913 ms/step , 6885.90 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 09:27:36 | Epoch: 0 | Step: 141480 | Dataset: 0-1359424 | Loss: 0.759 | 913 ms/step , 6890.61 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 09:27:45 | Epoch: 0 | Step: 141490 | Dataset: 0-1359744 | Loss: 0.752 | 913 ms/step , 6888.84 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 09:27:54 | Epoch: 0 | Step: 141500 | Dataset: 0-1360064 | Loss: 0.729 | 915 ms/step , 6877.10 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 09:27:55 | Validation | Step: 141500 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:28:05 | Epoch: 0 | Step: 141510 | Dataset: 0-1360384 | Loss: 0.852 | 913 ms/step , 6886.28 GFLOP/s , 15269.0 tokens/s INFO:__main__:2024-11-05 09:28:14 | Epoch: 0 | Step: 141520 | Dataset: 0-1360704 | Loss: 0.866 | 914 ms/step , 6878.53 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 09:28:23 | Epoch: 0 | Step: 141530 | Dataset: 0-1361024 | Loss: 0.779 | 913 ms/step , 6890.83 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 09:28:32 | Epoch: 0 | Step: 141540 | Dataset: 0-1361344 | Loss: 0.835 | 914 ms/step , 6884.22 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 09:28:41 | Epoch: 0 | Step: 141550 | Dataset: 0-1361664 | Loss: 0.762 | 912 ms/step , 6897.44 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 09:28:50 | Epoch: 0 | Step: 141560 | Dataset: 0-1361984 | Loss: 0.828 | 913 ms/step , 6886.86 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 09:28:59 | Epoch: 0 | Step: 141570 | Dataset: 0-1362304 | Loss: 0.752 | 913 ms/step , 6885.90 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 09:29:09 | Epoch: 0 | Step: 141580 | Dataset: 0-1362624 | Loss: 0.787 | 914 ms/step , 6881.27 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 09:29:18 | Epoch: 0 | Step: 141590 | Dataset: 0-1362944 | Loss: 0.824 | 914 ms/step , 6879.00 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 09:29:27 | Epoch: 0 | Step: 141600 | Dataset: 0-1363264 | Loss: 0.796 | 913 ms/step , 6886.30 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 09:29:28 | Validation | Step: 141600 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:29:38 | Epoch: 0 | Step: 141610 | Dataset: 0-1363584 | Loss: 0.837 | 913 ms/step , 6889.19 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-05 09:29:47 | Epoch: 0 | Step: 141620 | Dataset: 0-1363904 | Loss: 0.798 | 913 ms/step , 6887.92 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 09:29:56 | Epoch: 0 | Step: 141630 | Dataset: 0-1364224 | Loss: 0.789 | 916 ms/step , 6869.95 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 09:30:05 | Epoch: 0 | Step: 141640 | Dataset: 0-1364544 | Loss: 0.576 | 913 ms/step , 6887.34 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 09:30:14 | Epoch: 0 | Step: 141650 | Dataset: 0-1364864 | Loss: 0.762 | 914 ms/step , 6881.74 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 09:30:23 | Epoch: 0 | Step: 141660 | Dataset: 0-1365184 | Loss: 0.748 | 914 ms/step , 6878.89 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 09:30:32 | Epoch: 0 | Step: 141670 | Dataset: 0-1365504 | Loss: 0.801 | 914 ms/step , 6884.72 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 09:30:42 | Epoch: 0 | Step: 141680 | Dataset: 0-1365824 | Loss: 0.845 | 914 ms/step , 6881.52 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 09:30:51 | Epoch: 0 | Step: 141690 | Dataset: 0-1366144 | Loss: 0.799 | 913 ms/step , 6888.98 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 09:31:00 | Epoch: 0 | Step: 141700 | Dataset: 0-1366464 | Loss: 0.795 | 914 ms/step , 6878.40 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 09:31:01 | Validation | Step: 141700 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:31:11 | Epoch: 0 | Step: 141710 | Dataset: 0-1366784 | Loss: 0.745 | 914 ms/step , 6883.78 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-05 09:31:20 | Epoch: 0 | Step: 141720 | Dataset: 0-1367104 | Loss: 0.801 | 914 ms/step , 6880.64 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 09:31:29 | Epoch: 0 | Step: 141730 | Dataset: 0-1367424 | Loss: 0.874 | 915 ms/step , 6875.48 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 09:31:38 | Epoch: 0 | Step: 141740 | Dataset: 0-1367744 | Loss: 0.790 | 914 ms/step , 6883.92 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 09:31:47 | Epoch: 0 | Step: 141750 | Dataset: 0-1368064 | Loss: 0.852 | 914 ms/step , 6882.98 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 09:31:56 | Epoch: 0 | Step: 141760 | Dataset: 0-1368384 | Loss: 0.735 | 911 ms/step , 6904.90 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 09:32:05 | Epoch: 0 | Step: 141770 | Dataset: 0-1368704 | Loss: 0.789 | 913 ms/step , 6890.74 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 09:32:15 | Epoch: 0 | Step: 141780 | Dataset: 0-1369024 | Loss: 0.839 | 913 ms/step , 6888.36 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 09:32:24 | Epoch: 0 | Step: 141790 | Dataset: 0-1369344 | Loss: 0.807 | 913 ms/step , 6888.70 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 09:32:33 | Epoch: 0 | Step: 141800 | Dataset: 0-1369664 | Loss: 0.828 | 914 ms/step , 6884.99 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 09:32:34 | Validation | Step: 141800 | Val_loss: 0.786 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:32:44 | Epoch: 0 | Step: 141810 | Dataset: 0-1369984 | Loss: 0.793 | 914 ms/step , 6878.29 GFLOP/s , 15263.6 tokens/s INFO:__main__:2024-11-05 09:32:53 | Epoch: 0 | Step: 141820 | Dataset: 0-1370304 | Loss: 0.792 | 913 ms/step , 6890.27 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 09:33:02 | Epoch: 0 | Step: 141830 | Dataset: 0-1370624 | Loss: 0.810 | 914 ms/step , 6882.88 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 09:33:11 | Epoch: 0 | Step: 141840 | Dataset: 0-1370944 | Loss: 0.787 | 914 ms/step , 6878.81 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 09:33:20 | Epoch: 0 | Step: 141850 | Dataset: 0-1371264 | Loss: 0.821 | 913 ms/step , 6886.01 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 09:33:29 | Epoch: 0 | Step: 141860 | Dataset: 0-1371584 | Loss: 0.710 | 913 ms/step , 6891.89 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 09:33:38 | Epoch: 0 | Step: 141870 | Dataset: 0-1371904 | Loss: 0.840 | 914 ms/step , 6882.33 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 09:33:48 | Epoch: 0 | Step: 141880 | Dataset: 0-1372224 | Loss: 0.792 | 914 ms/step , 6878.74 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 09:33:57 | Epoch: 0 | Step: 141890 | Dataset: 0-1372544 | Loss: 0.817 | 915 ms/step , 6873.74 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 09:34:06 | Epoch: 0 | Step: 141900 | Dataset: 0-1372864 | Loss: 0.777 | 913 ms/step , 6890.73 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 09:34:07 | Validation | Step: 141900 | Val_loss: 0.712 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:34:17 | Epoch: 0 | Step: 141910 | Dataset: 0-1373184 | Loss: 0.815 | 914 ms/step , 6881.87 GFLOP/s , 15279.6 tokens/s INFO:__main__:2024-11-05 09:34:26 | Epoch: 0 | Step: 141920 | Dataset: 0-1373504 | Loss: 0.754 | 913 ms/step , 6887.63 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 09:34:35 | Epoch: 0 | Step: 141930 | Dataset: 0-1373824 | Loss: 0.829 | 914 ms/step , 6878.93 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 09:34:44 | Epoch: 0 | Step: 141940 | Dataset: 0-1374144 | Loss: 0.783 | 913 ms/step , 6887.08 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 09:34:53 | Epoch: 0 | Step: 141950 | Dataset: 0-1374464 | Loss: 0.824 | 915 ms/step , 6874.69 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 09:35:02 | Epoch: 0 | Step: 141960 | Dataset: 0-1374784 | Loss: 0.716 | 912 ms/step , 6894.16 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 09:35:11 | Epoch: 0 | Step: 141970 | Dataset: 0-1375104 | Loss: 0.754 | 913 ms/step , 6886.66 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 09:35:21 | Epoch: 0 | Step: 141980 | Dataset: 0-1375424 | Loss: 0.770 | 913 ms/step , 6886.92 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 09:35:30 | Epoch: 0 | Step: 141990 | Dataset: 0-1375744 | Loss: 0.818 | 914 ms/step , 6883.09 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 09:35:39 | Epoch: 0 | Step: 142000 | Dataset: 0-1376064 | Loss: 0.764 | 913 ms/step , 6886.50 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 09:35:40 | Validation | Step: 142000 | Val_loss: 0.736 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:35:40 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_093540_step_142000.pt` INFO:__main__:2024-11-05 09:35:51 | Epoch: 0 | Step: 142010 | Dataset: 0-1376384 | Loss: 0.888 | 914 ms/step , 6883.24 GFLOP/s , 13832.5 tokens/s INFO:__main__:2024-11-05 09:36:00 | Epoch: 0 | Step: 142020 | Dataset: 0-1376704 | Loss: 0.841 | 913 ms/step , 6886.33 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 09:36:09 | Epoch: 0 | Step: 142030 | Dataset: 0-1377024 | Loss: 0.850 | 914 ms/step , 6881.09 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 09:36:18 | Epoch: 0 | Step: 142040 | Dataset: 0-1377344 | Loss: 0.800 | 915 ms/step , 6875.11 GFLOP/s , 17900.1 tokens/s INFO:__main__:2024-11-05 09:36:27 | Epoch: 0 | Step: 142050 | Dataset: 0-1377664 | Loss: 0.821 | 915 ms/step , 6873.72 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 09:36:36 | Epoch: 0 | Step: 142060 | Dataset: 0-1377984 | Loss: 0.785 | 913 ms/step , 6888.05 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 09:36:46 | Epoch: 0 | Step: 142070 | Dataset: 0-1378304 | Loss: 0.761 | 913 ms/step , 6885.56 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 09:36:55 | Epoch: 0 | Step: 142080 | Dataset: 0-1378624 | Loss: 0.880 | 915 ms/step , 6877.41 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 09:37:04 | Epoch: 0 | Step: 142090 | Dataset: 0-1378944 | Loss: 0.787 | 915 ms/step , 6871.32 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 09:37:13 | Epoch: 0 | Step: 142100 | Dataset: 0-1379264 | Loss: 0.818 | 913 ms/step , 6886.93 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 09:37:15 | Validation | Step: 142100 | Val_loss: 0.721 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:37:24 | Epoch: 0 | Step: 142110 | Dataset: 0-1379584 | Loss: 0.733 | 914 ms/step , 6882.56 GFLOP/s , 15266.1 tokens/s INFO:__main__:2024-11-05 09:37:33 | Epoch: 0 | Step: 142120 | Dataset: 0-1379904 | Loss: 0.690 | 914 ms/step , 6882.52 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 09:37:42 | Epoch: 0 | Step: 142130 | Dataset: 0-1380224 | Loss: 0.763 | 913 ms/step , 6886.83 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 09:37:51 | Epoch: 0 | Step: 142140 | Dataset: 0-1380544 | Loss: 0.889 | 914 ms/step , 6885.04 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 09:38:00 | Epoch: 0 | Step: 142150 | Dataset: 0-1380864 | Loss: 0.813 | 914 ms/step , 6883.56 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 09:38:09 | Epoch: 0 | Step: 142160 | Dataset: 0-1381184 | Loss: 0.823 | 913 ms/step , 6885.72 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 09:38:18 | Epoch: 0 | Step: 142170 | Dataset: 0-1381504 | Loss: 0.787 | 913 ms/step , 6886.04 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 09:38:28 | Epoch: 0 | Step: 142180 | Dataset: 0-1381824 | Loss: 0.814 | 912 ms/step , 6899.49 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 09:38:37 | Epoch: 0 | Step: 142190 | Dataset: 0-1382144 | Loss: 0.756 | 914 ms/step , 6880.62 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 09:38:46 | Epoch: 0 | Step: 142200 | Dataset: 0-1382464 | Loss: 0.726 | 914 ms/step , 6879.58 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 09:38:48 | Validation | Step: 142200 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:38:57 | Epoch: 0 | Step: 142210 | Dataset: 0-1382784 | Loss: 0.789 | 914 ms/step , 6879.83 GFLOP/s , 15264.7 tokens/s INFO:__main__:2024-11-05 09:39:06 | Epoch: 0 | Step: 142220 | Dataset: 0-1383104 | Loss: 0.815 | 914 ms/step , 6878.89 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 09:39:15 | Epoch: 0 | Step: 142230 | Dataset: 0-1383424 | Loss: 0.714 | 913 ms/step , 6889.31 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 09:39:24 | Epoch: 0 | Step: 142240 | Dataset: 0-1383744 | Loss: 0.747 | 913 ms/step , 6885.09 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 09:39:33 | Epoch: 0 | Step: 142250 | Dataset: 0-1384064 | Loss: 0.805 | 914 ms/step , 6884.14 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-05 09:39:42 | Epoch: 0 | Step: 142260 | Dataset: 0-1384384 | Loss: 0.825 | 913 ms/step , 6887.93 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 09:39:52 | Epoch: 0 | Step: 142270 | Dataset: 0-1384704 | Loss: 0.827 | 915 ms/step , 6873.04 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 09:40:01 | Epoch: 0 | Step: 142280 | Dataset: 0-1385024 | Loss: 0.762 | 913 ms/step , 6886.48 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 09:40:10 | Epoch: 0 | Step: 142290 | Dataset: 0-1385344 | Loss: 0.733 | 915 ms/step , 6876.91 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 09:40:19 | Epoch: 0 | Step: 142300 | Dataset: 0-1385664 | Loss: 0.774 | 914 ms/step , 6882.96 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 09:40:21 | Validation | Step: 142300 | Val_loss: 0.821 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:40:30 | Epoch: 0 | Step: 142310 | Dataset: 0-1385984 | Loss: 0.756 | 912 ms/step , 6895.48 GFLOP/s , 15261.2 tokens/s INFO:__main__:2024-11-05 09:40:39 | Epoch: 0 | Step: 142320 | Dataset: 0-1386304 | Loss: 0.730 | 914 ms/step , 6882.45 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 09:40:48 | Epoch: 0 | Step: 142330 | Dataset: 0-1386624 | Loss: 0.816 | 914 ms/step , 6879.66 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 09:40:57 | Epoch: 0 | Step: 142340 | Dataset: 0-1386944 | Loss: 0.797 | 914 ms/step , 6879.60 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 09:41:06 | Epoch: 0 | Step: 142350 | Dataset: 0-1387264 | Loss: 0.798 | 915 ms/step , 6874.60 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 09:41:15 | Epoch: 0 | Step: 142360 | Dataset: 0-1387584 | Loss: 0.920 | 915 ms/step , 6876.65 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 09:41:25 | Epoch: 0 | Step: 142370 | Dataset: 0-1387904 | Loss: 0.764 | 912 ms/step , 6893.43 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 09:41:34 | Epoch: 0 | Step: 142380 | Dataset: 0-1388224 | Loss: 0.782 | 914 ms/step , 6877.55 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 09:41:43 | Epoch: 0 | Step: 142390 | Dataset: 0-1388544 | Loss: 0.757 | 914 ms/step , 6884.25 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 09:41:52 | Epoch: 0 | Step: 142400 | Dataset: 0-1388864 | Loss: 0.708 | 912 ms/step , 6894.71 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 09:41:54 | Validation | Step: 142400 | Val_loss: 0.807 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:42:03 | Epoch: 0 | Step: 142410 | Dataset: 0-1389184 | Loss: 0.779 | 912 ms/step , 6896.14 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-05 09:42:12 | Epoch: 0 | Step: 142420 | Dataset: 0-1389504 | Loss: 0.721 | 912 ms/step , 6894.32 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 09:42:21 | Epoch: 0 | Step: 142430 | Dataset: 0-1389824 | Loss: 0.794 | 914 ms/step , 6881.84 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 09:42:30 | Epoch: 0 | Step: 142440 | Dataset: 0-1390144 | Loss: 0.682 | 914 ms/step , 6884.69 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 09:42:39 | Epoch: 0 | Step: 142450 | Dataset: 0-1390464 | Loss: 0.811 | 913 ms/step , 6892.01 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 09:42:48 | Epoch: 0 | Step: 142460 | Dataset: 0-1390784 | Loss: 0.770 | 913 ms/step , 6888.61 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 09:42:57 | Epoch: 0 | Step: 142470 | Dataset: 0-1391104 | Loss: 0.821 | 914 ms/step , 6878.34 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 09:43:07 | Epoch: 0 | Step: 142480 | Dataset: 0-1391424 | Loss: 0.854 | 914 ms/step , 6881.94 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 09:43:16 | Epoch: 0 | Step: 142490 | Dataset: 0-1391744 | Loss: 0.829 | 913 ms/step , 6886.03 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 09:43:25 | Epoch: 0 | Step: 142500 | Dataset: 0-1392064 | Loss: 0.764 | 913 ms/step , 6889.07 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 09:43:26 | Validation | Step: 142500 | Val_loss: 0.831 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:43:36 | Epoch: 0 | Step: 142510 | Dataset: 0-1392384 | Loss: 0.779 | 913 ms/step , 6886.55 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-05 09:43:45 | Epoch: 0 | Step: 142520 | Dataset: 0-1392704 | Loss: 0.792 | 914 ms/step , 6884.73 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 09:43:54 | Epoch: 0 | Step: 142530 | Dataset: 0-1393024 | Loss: 0.778 | 913 ms/step , 6891.63 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 09:44:03 | Epoch: 0 | Step: 142540 | Dataset: 0-1393344 | Loss: 0.760 | 915 ms/step , 6876.78 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 09:44:12 | Epoch: 0 | Step: 142550 | Dataset: 0-1393664 | Loss: 0.743 | 913 ms/step , 6890.70 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 09:44:21 | Epoch: 0 | Step: 142560 | Dataset: 0-1393984 | Loss: 0.774 | 913 ms/step , 6886.27 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 09:44:30 | Epoch: 0 | Step: 142570 | Dataset: 0-1394304 | Loss: 0.777 | 913 ms/step , 6885.07 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 09:44:40 | Epoch: 0 | Step: 142580 | Dataset: 0-1394624 | Loss: 0.820 | 913 ms/step , 6885.84 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 09:44:49 | Epoch: 0 | Step: 142590 | Dataset: 0-1394944 | Loss: 0.784 | 915 ms/step , 6876.32 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 09:44:58 | Epoch: 0 | Step: 142600 | Dataset: 0-1395264 | Loss: 0.794 | 913 ms/step , 6885.17 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 09:44:59 | Validation | Step: 142600 | Val_loss: 0.834 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:45:09 | Epoch: 0 | Step: 142610 | Dataset: 0-1395584 | Loss: 0.742 | 915 ms/step , 6873.24 GFLOP/s , 15266.6 tokens/s INFO:__main__:2024-11-05 09:45:18 | Epoch: 0 | Step: 142620 | Dataset: 0-1395904 | Loss: 0.800 | 914 ms/step , 6882.81 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 09:45:27 | Epoch: 0 | Step: 142630 | Dataset: 0-1396224 | Loss: 0.785 | 913 ms/step , 6891.95 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 09:45:36 | Epoch: 0 | Step: 142640 | Dataset: 0-1396544 | Loss: 0.852 | 914 ms/step , 6880.64 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 09:45:45 | Epoch: 0 | Step: 142650 | Dataset: 0-1396864 | Loss: 0.741 | 913 ms/step , 6887.52 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 09:45:54 | Epoch: 0 | Step: 142660 | Dataset: 0-1397184 | Loss: 0.837 | 915 ms/step , 6872.80 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 09:46:03 | Epoch: 0 | Step: 142670 | Dataset: 0-1397504 | Loss: 0.769 | 913 ms/step , 6885.64 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 09:46:13 | Epoch: 0 | Step: 142680 | Dataset: 0-1397824 | Loss: 0.730 | 913 ms/step , 6885.89 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 09:46:22 | Epoch: 0 | Step: 142690 | Dataset: 0-1398144 | Loss: 0.817 | 914 ms/step , 6883.80 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 09:46:31 | Epoch: 0 | Step: 142700 | Dataset: 0-1398464 | Loss: 0.763 | 914 ms/step , 6881.61 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 09:46:32 | Validation | Step: 142700 | Val_loss: 0.824 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:46:42 | Epoch: 0 | Step: 142710 | Dataset: 0-1398784 | Loss: 0.769 | 914 ms/step , 6878.36 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-05 09:46:51 | Epoch: 0 | Step: 142720 | Dataset: 0-1399104 | Loss: 0.800 | 913 ms/step , 6886.66 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 09:47:00 | Epoch: 0 | Step: 142730 | Dataset: 0-1399424 | Loss: 0.761 | 914 ms/step , 6880.74 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 09:47:09 | Epoch: 0 | Step: 142740 | Dataset: 0-1399744 | Loss: 0.722 | 914 ms/step , 6883.62 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 09:47:18 | Epoch: 0 | Step: 142750 | Dataset: 0-1400064 | Loss: 0.764 | 914 ms/step , 6884.88 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 09:47:27 | Epoch: 0 | Step: 142760 | Dataset: 0-1400384 | Loss: 0.766 | 913 ms/step , 6890.29 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 09:47:36 | Epoch: 0 | Step: 142770 | Dataset: 0-1400704 | Loss: 0.730 | 912 ms/step , 6897.63 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 09:47:46 | Epoch: 0 | Step: 142780 | Dataset: 0-1401024 | Loss: 0.743 | 915 ms/step , 6873.24 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 09:47:55 | Epoch: 0 | Step: 142790 | Dataset: 0-1401344 | Loss: 0.791 | 915 ms/step , 6875.05 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 09:48:04 | Epoch: 0 | Step: 142800 | Dataset: 0-1401664 | Loss: 0.829 | 913 ms/step , 6892.27 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 09:48:05 | Validation | Step: 142800 | Val_loss: 0.851 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:48:15 | Epoch: 0 | Step: 142810 | Dataset: 0-1401984 | Loss: 0.745 | 913 ms/step , 6889.50 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-05 09:48:24 | Epoch: 0 | Step: 142820 | Dataset: 0-1402304 | Loss: 0.790 | 914 ms/step , 6884.05 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 09:48:33 | Epoch: 0 | Step: 142830 | Dataset: 0-1402624 | Loss: 0.892 | 913 ms/step , 6886.31 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 09:48:42 | Epoch: 0 | Step: 142840 | Dataset: 0-1402944 | Loss: 0.823 | 914 ms/step , 6883.14 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 09:48:51 | Epoch: 0 | Step: 142850 | Dataset: 0-1403264 | Loss: 0.818 | 913 ms/step , 6889.10 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 09:49:00 | Epoch: 0 | Step: 142860 | Dataset: 0-1403584 | Loss: 0.733 | 914 ms/step , 6882.42 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 09:49:09 | Epoch: 0 | Step: 142870 | Dataset: 0-1403904 | Loss: 0.740 | 912 ms/step , 6893.74 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 09:49:19 | Epoch: 0 | Step: 142880 | Dataset: 0-1404224 | Loss: 0.777 | 913 ms/step , 6886.03 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 09:49:28 | Epoch: 0 | Step: 142890 | Dataset: 0-1404544 | Loss: 0.674 | 914 ms/step , 6880.85 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 09:49:37 | Epoch: 0 | Step: 142900 | Dataset: 0-1404864 | Loss: 0.788 | 914 ms/step , 6880.12 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 09:49:38 | Validation | Step: 142900 | Val_loss: 0.850 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:49:48 | Epoch: 0 | Step: 142910 | Dataset: 0-1405184 | Loss: 0.787 | 914 ms/step , 6880.66 GFLOP/s , 15266.8 tokens/s INFO:__main__:2024-11-05 09:49:57 | Epoch: 0 | Step: 142920 | Dataset: 0-1405504 | Loss: 0.687 | 914 ms/step , 6884.32 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 09:50:06 | Epoch: 0 | Step: 142930 | Dataset: 0-1405824 | Loss: 0.700 | 914 ms/step , 6882.49 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 09:50:15 | Epoch: 0 | Step: 142940 | Dataset: 0-1406144 | Loss: 0.769 | 913 ms/step , 6889.27 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 09:50:24 | Epoch: 0 | Step: 142950 | Dataset: 0-1406464 | Loss: 0.820 | 913 ms/step , 6886.91 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 09:50:33 | Epoch: 0 | Step: 142960 | Dataset: 0-1406784 | Loss: 0.741 | 913 ms/step , 6891.14 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 09:50:42 | Epoch: 0 | Step: 142970 | Dataset: 0-1407104 | Loss: 0.732 | 914 ms/step , 6883.10 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 09:50:52 | Epoch: 0 | Step: 142980 | Dataset: 0-1407424 | Loss: 0.743 | 914 ms/step , 6882.31 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 09:51:01 | Epoch: 0 | Step: 142990 | Dataset: 0-1407744 | Loss: 0.780 | 914 ms/step , 6883.99 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 09:51:10 | Epoch: 0 | Step: 143000 | Dataset: 0-1408064 | Loss: 0.759 | 913 ms/step , 6886.50 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 09:51:11 | Validation | Step: 143000 | Val_loss: 0.860 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:51:11 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_095111_step_143000.pt` INFO:__main__:2024-11-05 09:51:22 | Epoch: 0 | Step: 143010 | Dataset: 0-1408384 | Loss: 0.812 | 914 ms/step , 6882.64 GFLOP/s , 13801.8 tokens/s INFO:__main__:2024-11-05 09:51:31 | Epoch: 0 | Step: 143020 | Dataset: 0-1408704 | Loss: 0.838 | 914 ms/step , 6883.24 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 09:51:40 | Epoch: 0 | Step: 143030 | Dataset: 0-1409024 | Loss: 0.736 | 914 ms/step , 6878.49 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 09:51:49 | Epoch: 0 | Step: 143040 | Dataset: 0-1409344 | Loss: 0.801 | 914 ms/step , 6878.16 GFLOP/s , 17889.4 tokens/s INFO:__main__:2024-11-05 09:51:58 | Epoch: 0 | Step: 143050 | Dataset: 0-1409664 | Loss: 0.817 | 914 ms/step , 6880.44 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 09:52:07 | Epoch: 0 | Step: 143060 | Dataset: 0-1409984 | Loss: 0.839 | 914 ms/step , 6882.82 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 09:52:17 | Epoch: 0 | Step: 143070 | Dataset: 0-1410304 | Loss: 0.851 | 913 ms/step , 6889.50 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 09:52:26 | Epoch: 0 | Step: 143080 | Dataset: 0-1410624 | Loss: 0.759 | 912 ms/step , 6895.60 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 09:52:35 | Epoch: 0 | Step: 143090 | Dataset: 0-1410944 | Loss: 0.743 | 914 ms/step , 6883.05 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 09:52:44 | Epoch: 0 | Step: 143100 | Dataset: 0-1411264 | Loss: 0.781 | 913 ms/step , 6886.22 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 09:52:46 | Validation | Step: 143100 | Val_loss: 0.835 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:52:55 | Epoch: 0 | Step: 143110 | Dataset: 0-1411584 | Loss: 0.766 | 913 ms/step , 6886.27 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-05 09:53:04 | Epoch: 0 | Step: 143120 | Dataset: 0-1411904 | Loss: 0.752 | 913 ms/step , 6888.56 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 09:53:13 | Epoch: 0 | Step: 143130 | Dataset: 0-1412224 | Loss: 0.819 | 914 ms/step , 6880.82 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 09:53:22 | Epoch: 0 | Step: 143140 | Dataset: 0-1412544 | Loss: 0.802 | 913 ms/step , 6889.54 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 09:53:31 | Epoch: 0 | Step: 143150 | Dataset: 0-1412864 | Loss: 0.818 | 914 ms/step , 6879.67 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 09:53:40 | Epoch: 0 | Step: 143160 | Dataset: 0-1413184 | Loss: 0.838 | 914 ms/step , 6884.90 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 09:53:50 | Epoch: 0 | Step: 143170 | Dataset: 0-1413504 | Loss: 0.778 | 914 ms/step , 6879.61 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 09:53:59 | Epoch: 0 | Step: 143180 | Dataset: 0-1413824 | Loss: 0.765 | 912 ms/step , 6892.99 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 09:54:08 | Epoch: 0 | Step: 143190 | Dataset: 0-1414144 | Loss: 0.858 | 913 ms/step , 6887.87 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 09:54:17 | Epoch: 0 | Step: 143200 | Dataset: 0-1414464 | Loss: 0.768 | 914 ms/step , 6881.75 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 09:54:19 | Validation | Step: 143200 | Val_loss: 0.781 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:54:28 | Epoch: 0 | Step: 143210 | Dataset: 0-1414784 | Loss: 0.741 | 913 ms/step , 6887.84 GFLOP/s , 15266.4 tokens/s INFO:__main__:2024-11-05 09:54:37 | Epoch: 0 | Step: 143220 | Dataset: 0-1415104 | Loss: 0.777 | 914 ms/step , 6883.32 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 09:54:46 | Epoch: 0 | Step: 143230 | Dataset: 0-1415424 | Loss: 0.787 | 914 ms/step , 6880.41 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 09:54:55 | Epoch: 0 | Step: 143240 | Dataset: 0-1415744 | Loss: 0.752 | 913 ms/step , 6890.25 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 09:55:04 | Epoch: 0 | Step: 143250 | Dataset: 0-1416064 | Loss: 0.680 | 913 ms/step , 6889.59 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 09:55:13 | Epoch: 0 | Step: 143260 | Dataset: 0-1416384 | Loss: 0.699 | 913 ms/step , 6886.50 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 09:55:23 | Epoch: 0 | Step: 143270 | Dataset: 0-1416704 | Loss: 0.827 | 914 ms/step , 6881.05 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 09:55:32 | Epoch: 0 | Step: 143280 | Dataset: 0-1417024 | Loss: 0.698 | 913 ms/step , 6885.38 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 09:55:41 | Epoch: 0 | Step: 143290 | Dataset: 0-1417344 | Loss: 0.715 | 913 ms/step , 6885.78 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 09:55:50 | Epoch: 0 | Step: 143300 | Dataset: 0-1417664 | Loss: 0.836 | 913 ms/step , 6888.89 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 09:55:52 | Validation | Step: 143300 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:56:01 | Epoch: 0 | Step: 143310 | Dataset: 0-1417984 | Loss: 0.676 | 914 ms/step , 6880.89 GFLOP/s , 15264.9 tokens/s INFO:__main__:2024-11-05 09:56:10 | Epoch: 0 | Step: 143320 | Dataset: 0-1418304 | Loss: 0.769 | 914 ms/step , 6881.13 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 09:56:19 | Epoch: 0 | Step: 143330 | Dataset: 0-1418624 | Loss: 0.791 | 915 ms/step , 6876.63 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 09:56:28 | Epoch: 0 | Step: 143340 | Dataset: 0-1418944 | Loss: 0.774 | 914 ms/step , 6880.28 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 09:56:37 | Epoch: 0 | Step: 143350 | Dataset: 0-1419264 | Loss: 0.751 | 915 ms/step , 6877.39 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 09:56:46 | Epoch: 0 | Step: 143360 | Dataset: 0-1419584 | Loss: 0.758 | 913 ms/step , 6887.06 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 09:56:56 | Epoch: 0 | Step: 143370 | Dataset: 0-1419904 | Loss: 0.854 | 913 ms/step , 6889.92 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 09:57:05 | Epoch: 0 | Step: 143380 | Dataset: 0-1420224 | Loss: 0.801 | 914 ms/step , 6879.89 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 09:57:14 | Epoch: 0 | Step: 143390 | Dataset: 0-1420544 | Loss: 0.770 | 914 ms/step , 6884.26 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 09:57:23 | Epoch: 0 | Step: 143400 | Dataset: 0-1420864 | Loss: 0.751 | 914 ms/step , 6882.19 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 09:57:25 | Validation | Step: 143400 | Val_loss: 0.794 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:57:34 | Epoch: 0 | Step: 143410 | Dataset: 0-1421184 | Loss: 0.814 | 914 ms/step , 6880.17 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 09:57:43 | Epoch: 0 | Step: 143420 | Dataset: 0-1421504 | Loss: 0.799 | 915 ms/step , 6874.45 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 09:57:52 | Epoch: 0 | Step: 143430 | Dataset: 0-1421824 | Loss: 0.724 | 913 ms/step , 6889.66 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 09:58:01 | Epoch: 0 | Step: 143440 | Dataset: 0-1422144 | Loss: 0.784 | 913 ms/step , 6885.16 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 09:58:10 | Epoch: 0 | Step: 143450 | Dataset: 0-1422464 | Loss: 0.797 | 913 ms/step , 6888.41 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 09:58:19 | Epoch: 0 | Step: 143460 | Dataset: 0-1422784 | Loss: 0.797 | 913 ms/step , 6886.66 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 09:58:29 | Epoch: 0 | Step: 143470 | Dataset: 0-1423104 | Loss: 0.733 | 913 ms/step , 6885.64 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 09:58:38 | Epoch: 0 | Step: 143480 | Dataset: 0-1423424 | Loss: 0.826 | 914 ms/step , 6881.98 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 09:58:47 | Epoch: 0 | Step: 143490 | Dataset: 0-1423744 | Loss: 0.865 | 915 ms/step , 6876.47 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 09:58:56 | Epoch: 0 | Step: 143500 | Dataset: 0-1424064 | Loss: 0.750 | 913 ms/step , 6886.78 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 09:58:58 | Validation | Step: 143500 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 09:59:07 | Epoch: 0 | Step: 143510 | Dataset: 0-1424384 | Loss: 0.810 | 913 ms/step , 6889.72 GFLOP/s , 15272.9 tokens/s INFO:__main__:2024-11-05 09:59:16 | Epoch: 0 | Step: 143520 | Dataset: 0-1424704 | Loss: 0.779 | 914 ms/step , 6878.49 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 09:59:25 | Epoch: 0 | Step: 143530 | Dataset: 0-1425024 | Loss: 0.727 | 912 ms/step , 6894.60 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 09:59:34 | Epoch: 0 | Step: 143540 | Dataset: 0-1425344 | Loss: 0.780 | 912 ms/step , 6896.64 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 09:59:43 | Epoch: 0 | Step: 143550 | Dataset: 0-1425664 | Loss: 0.739 | 915 ms/step , 6875.95 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 09:59:52 | Epoch: 0 | Step: 143560 | Dataset: 0-1425984 | Loss: 0.686 | 914 ms/step , 6880.12 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 10:00:02 | Epoch: 0 | Step: 143570 | Dataset: 0-1426304 | Loss: 0.806 | 913 ms/step , 6888.03 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 10:00:11 | Epoch: 0 | Step: 143580 | Dataset: 0-1426624 | Loss: 0.754 | 913 ms/step , 6890.59 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 10:00:20 | Epoch: 0 | Step: 143590 | Dataset: 0-1426944 | Loss: 0.781 | 914 ms/step , 6884.93 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 10:00:29 | Epoch: 0 | Step: 143600 | Dataset: 0-1427264 | Loss: 0.832 | 914 ms/step , 6878.71 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 10:00:31 | Validation | Step: 143600 | Val_loss: 0.797 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:00:40 | Epoch: 0 | Step: 143610 | Dataset: 0-1427584 | Loss: 0.759 | 914 ms/step , 6885.04 GFLOP/s , 15271.5 tokens/s INFO:__main__:2024-11-05 10:00:49 | Epoch: 0 | Step: 143620 | Dataset: 0-1427904 | Loss: 0.753 | 914 ms/step , 6884.74 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 10:00:58 | Epoch: 0 | Step: 143630 | Dataset: 0-1428224 | Loss: 0.807 | 913 ms/step , 6885.51 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 10:01:07 | Epoch: 0 | Step: 143640 | Dataset: 0-1428544 | Loss: 0.780 | 915 ms/step , 6871.58 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 10:01:16 | Epoch: 0 | Step: 143650 | Dataset: 0-1428864 | Loss: 0.764 | 914 ms/step , 6880.14 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 10:01:25 | Epoch: 0 | Step: 143660 | Dataset: 0-1429184 | Loss: 0.716 | 913 ms/step , 6886.63 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 10:01:35 | Epoch: 0 | Step: 143670 | Dataset: 0-1429504 | Loss: 0.748 | 913 ms/step , 6886.23 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 10:01:44 | Epoch: 0 | Step: 143680 | Dataset: 0-1429824 | Loss: 0.840 | 915 ms/step , 6875.58 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 10:01:53 | Epoch: 0 | Step: 143690 | Dataset: 0-1430144 | Loss: 0.749 | 913 ms/step , 6890.32 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 10:02:02 | Epoch: 0 | Step: 143700 | Dataset: 0-1430464 | Loss: 0.847 | 914 ms/step , 6880.13 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 10:02:04 | Validation | Step: 143700 | Val_loss: 0.749 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:02:13 | Epoch: 0 | Step: 143710 | Dataset: 0-1430784 | Loss: 0.725 | 913 ms/step , 6885.13 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-05 10:02:22 | Epoch: 0 | Step: 143720 | Dataset: 0-1431104 | Loss: 0.749 | 912 ms/step , 6892.81 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 10:02:31 | Epoch: 0 | Step: 143730 | Dataset: 0-1431424 | Loss: 0.735 | 914 ms/step , 6884.75 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 10:02:40 | Epoch: 0 | Step: 143740 | Dataset: 0-1431744 | Loss: 0.781 | 913 ms/step , 6890.72 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 10:02:49 | Epoch: 0 | Step: 143750 | Dataset: 0-1432064 | Loss: 0.712 | 913 ms/step , 6892.42 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 10:02:58 | Epoch: 0 | Step: 143760 | Dataset: 0-1432384 | Loss: 0.800 | 913 ms/step , 6886.78 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 10:03:08 | Epoch: 0 | Step: 143770 | Dataset: 0-1432704 | Loss: 0.705 | 914 ms/step , 6884.51 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 10:03:17 | Epoch: 0 | Step: 143780 | Dataset: 0-1433024 | Loss: 0.777 | 913 ms/step , 6887.97 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 10:03:26 | Epoch: 0 | Step: 143790 | Dataset: 0-1433344 | Loss: 0.741 | 912 ms/step , 6894.34 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 10:03:35 | Epoch: 0 | Step: 143800 | Dataset: 0-1433664 | Loss: 0.694 | 913 ms/step , 6892.30 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 10:03:37 | Validation | Step: 143800 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:03:46 | Epoch: 0 | Step: 143810 | Dataset: 0-1433984 | Loss: 0.689 | 913 ms/step , 6887.49 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-05 10:03:55 | Epoch: 0 | Step: 143820 | Dataset: 0-1434304 | Loss: 0.772 | 913 ms/step , 6889.08 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 10:04:04 | Epoch: 0 | Step: 143830 | Dataset: 0-1434624 | Loss: 0.681 | 912 ms/step , 6898.92 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 10:04:13 | Epoch: 0 | Step: 143840 | Dataset: 0-1434944 | Loss: 0.723 | 913 ms/step , 6886.08 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 10:04:22 | Epoch: 0 | Step: 143850 | Dataset: 0-1435264 | Loss: 0.635 | 913 ms/step , 6891.91 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 10:04:31 | Epoch: 0 | Step: 143860 | Dataset: 0-1435584 | Loss: 0.768 | 913 ms/step , 6890.11 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 10:04:40 | Epoch: 0 | Step: 143870 | Dataset: 0-1435904 | Loss: 0.786 | 914 ms/step , 6878.04 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 10:04:50 | Epoch: 0 | Step: 143880 | Dataset: 0-1436224 | Loss: 0.677 | 914 ms/step , 6884.53 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 10:04:59 | Epoch: 0 | Step: 143890 | Dataset: 0-1436544 | Loss: 0.674 | 914 ms/step , 6880.92 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 10:05:08 | Epoch: 0 | Step: 143900 | Dataset: 0-1436864 | Loss: 0.702 | 912 ms/step , 6897.35 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 10:05:09 | Validation | Step: 143900 | Val_loss: 0.740 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:05:19 | Epoch: 0 | Step: 143910 | Dataset: 0-1437184 | Loss: 0.742 | 913 ms/step , 6890.55 GFLOP/s , 15282.0 tokens/s INFO:__main__:2024-11-05 10:05:28 | Epoch: 0 | Step: 143920 | Dataset: 0-1437504 | Loss: 0.768 | 914 ms/step , 6882.28 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 10:05:37 | Epoch: 0 | Step: 143930 | Dataset: 0-1437824 | Loss: 0.727 | 914 ms/step , 6883.74 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 10:05:46 | Epoch: 0 | Step: 143940 | Dataset: 0-1438144 | Loss: 0.707 | 912 ms/step , 6898.34 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 10:05:55 | Epoch: 0 | Step: 143950 | Dataset: 0-1438464 | Loss: 0.719 | 914 ms/step , 6882.70 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 10:06:04 | Epoch: 0 | Step: 143960 | Dataset: 0-1438784 | Loss: 0.720 | 913 ms/step , 6892.56 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 10:06:13 | Epoch: 0 | Step: 143970 | Dataset: 0-1439104 | Loss: 0.787 | 911 ms/step , 6900.86 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 10:06:23 | Epoch: 0 | Step: 143980 | Dataset: 0-1439424 | Loss: 0.805 | 913 ms/step , 6891.14 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 10:06:32 | Epoch: 0 | Step: 143990 | Dataset: 0-1439744 | Loss: 0.765 | 914 ms/step , 6877.76 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 10:06:41 | Epoch: 0 | Step: 144000 | Dataset: 0-1440064 | Loss: 0.965 | 914 ms/step , 6878.95 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 10:06:42 | Validation | Step: 144000 | Val_loss: 0.792 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:06:42 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_100642_step_144000.pt` INFO:__main__:2024-11-05 10:06:53 | Epoch: 0 | Step: 144010 | Dataset: 0-1440384 | Loss: 0.819 | 914 ms/step , 6883.53 GFLOP/s , 13786.6 tokens/s INFO:__main__:2024-11-05 10:07:02 | Epoch: 0 | Step: 144020 | Dataset: 0-1440704 | Loss: 0.715 | 913 ms/step , 6887.70 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 10:07:11 | Epoch: 0 | Step: 144030 | Dataset: 0-1441024 | Loss: 0.803 | 914 ms/step , 6878.00 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 10:07:20 | Epoch: 0 | Step: 144040 | Dataset: 0-1441344 | Loss: 0.769 | 914 ms/step , 6884.78 GFLOP/s , 17872.8 tokens/s INFO:__main__:2024-11-05 10:07:29 | Epoch: 0 | Step: 144050 | Dataset: 0-1441664 | Loss: 0.752 | 913 ms/step , 6887.77 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 10:07:38 | Epoch: 0 | Step: 144060 | Dataset: 0-1441984 | Loss: 0.706 | 913 ms/step , 6889.52 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 10:07:48 | Epoch: 0 | Step: 144070 | Dataset: 0-1442304 | Loss: 0.823 | 914 ms/step , 6879.96 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 10:07:57 | Epoch: 0 | Step: 144080 | Dataset: 0-1442624 | Loss: 0.775 | 912 ms/step , 6894.44 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 10:08:06 | Epoch: 0 | Step: 144090 | Dataset: 0-1442944 | Loss: 0.789 | 912 ms/step , 6895.23 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 10:08:15 | Epoch: 0 | Step: 144100 | Dataset: 0-1443264 | Loss: 0.737 | 913 ms/step , 6891.30 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:08:17 | Validation | Step: 144100 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:08:26 | Epoch: 0 | Step: 144110 | Dataset: 0-1443584 | Loss: 0.945 | 914 ms/step , 6881.58 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 10:08:35 | Epoch: 0 | Step: 144120 | Dataset: 0-1443904 | Loss: 0.733 | 914 ms/step , 6881.38 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 10:08:44 | Epoch: 0 | Step: 144130 | Dataset: 0-1444224 | Loss: 0.724 | 913 ms/step , 6888.12 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 10:08:53 | Epoch: 0 | Step: 144140 | Dataset: 0-1444544 | Loss: 0.710 | 912 ms/step , 6896.80 GFLOP/s , 17947.5 tokens/s INFO:__main__:2024-11-05 10:09:02 | Epoch: 0 | Step: 144150 | Dataset: 0-1444864 | Loss: 0.864 | 912 ms/step , 6897.42 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 10:09:11 | Epoch: 0 | Step: 144160 | Dataset: 0-1445184 | Loss: 0.648 | 912 ms/step , 6893.68 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 10:09:20 | Epoch: 0 | Step: 144170 | Dataset: 0-1445504 | Loss: 0.824 | 913 ms/step , 6890.24 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 10:09:30 | Epoch: 0 | Step: 144180 | Dataset: 0-1445824 | Loss: 0.737 | 912 ms/step , 6897.71 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 10:09:39 | Epoch: 0 | Step: 144190 | Dataset: 0-1446144 | Loss: 0.697 | 913 ms/step , 6888.88 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 10:09:48 | Epoch: 0 | Step: 144200 | Dataset: 0-1446464 | Loss: 0.701 | 912 ms/step , 6894.75 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 10:09:49 | Validation | Step: 144200 | Val_loss: 0.827 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:09:59 | Epoch: 0 | Step: 144210 | Dataset: 0-1446784 | Loss: 0.766 | 912 ms/step , 6897.04 GFLOP/s , 15272.1 tokens/s INFO:__main__:2024-11-05 10:10:08 | Epoch: 0 | Step: 144220 | Dataset: 0-1447104 | Loss: 0.745 | 912 ms/step , 6893.91 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 10:10:17 | Epoch: 0 | Step: 144230 | Dataset: 0-1447424 | Loss: 0.780 | 913 ms/step , 6887.57 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 10:10:26 | Epoch: 0 | Step: 144240 | Dataset: 0-1447744 | Loss: 0.842 | 911 ms/step , 6902.19 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 10:10:35 | Epoch: 0 | Step: 144250 | Dataset: 0-1448064 | Loss: 0.784 | 913 ms/step , 6885.30 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 10:10:44 | Epoch: 0 | Step: 144260 | Dataset: 0-1448384 | Loss: 0.701 | 913 ms/step , 6889.08 GFLOP/s , 17953.3 tokens/s INFO:__main__:2024-11-05 10:10:53 | Epoch: 0 | Step: 144270 | Dataset: 0-1448704 | Loss: 0.702 | 912 ms/step , 6896.82 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 10:11:02 | Epoch: 0 | Step: 144280 | Dataset: 0-1449024 | Loss: 0.741 | 913 ms/step , 6892.43 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 10:11:12 | Epoch: 0 | Step: 144290 | Dataset: 0-1449344 | Loss: 0.969 | 913 ms/step , 6890.87 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:11:21 | Epoch: 0 | Step: 144300 | Dataset: 0-1449664 | Loss: 0.749 | 913 ms/step , 6888.87 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 10:11:22 | Validation | Step: 144300 | Val_loss: 0.813 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:11:31 | Epoch: 0 | Step: 144310 | Dataset: 0-1449984 | Loss: 0.892 | 913 ms/step , 6892.33 GFLOP/s , 15286.4 tokens/s INFO:__main__:2024-11-05 10:11:41 | Epoch: 0 | Step: 144320 | Dataset: 0-1450304 | Loss: 0.803 | 912 ms/step , 6897.47 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 10:11:50 | Epoch: 0 | Step: 144330 | Dataset: 0-1450624 | Loss: 0.831 | 913 ms/step , 6891.13 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 10:11:59 | Epoch: 0 | Step: 144340 | Dataset: 0-1450944 | Loss: 0.809 | 913 ms/step , 6889.30 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 10:12:08 | Epoch: 0 | Step: 144350 | Dataset: 0-1451264 | Loss: 0.799 | 912 ms/step , 6894.76 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 10:12:17 | Epoch: 0 | Step: 144360 | Dataset: 0-1451584 | Loss: 0.806 | 912 ms/step , 6893.12 GFLOP/s , 17949.1 tokens/s INFO:__main__:2024-11-05 10:12:26 | Epoch: 0 | Step: 144370 | Dataset: 0-1451904 | Loss: 0.854 | 913 ms/step , 6889.55 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 10:12:35 | Epoch: 0 | Step: 144380 | Dataset: 0-1452224 | Loss: 0.756 | 912 ms/step , 6895.86 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 10:12:44 | Epoch: 0 | Step: 144390 | Dataset: 0-1452544 | Loss: 0.795 | 912 ms/step , 6892.63 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 10:12:54 | Epoch: 0 | Step: 144400 | Dataset: 0-1452864 | Loss: 0.685 | 912 ms/step , 6894.89 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 10:12:55 | Validation | Step: 144400 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:13:04 | Epoch: 0 | Step: 144410 | Dataset: 0-1453184 | Loss: 0.823 | 913 ms/step , 6889.60 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 10:13:13 | Epoch: 0 | Step: 144420 | Dataset: 0-1453504 | Loss: 0.867 | 914 ms/step , 6881.86 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 10:13:23 | Epoch: 0 | Step: 144430 | Dataset: 0-1453824 | Loss: 0.814 | 912 ms/step , 6893.83 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 10:13:32 | Epoch: 0 | Step: 144440 | Dataset: 0-1454144 | Loss: 0.686 | 912 ms/step , 6893.22 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 10:13:41 | Epoch: 0 | Step: 144450 | Dataset: 0-1454464 | Loss: 0.787 | 913 ms/step , 6890.59 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 10:13:50 | Epoch: 0 | Step: 144460 | Dataset: 0-1454784 | Loss: 0.763 | 913 ms/step , 6889.39 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 10:13:59 | Epoch: 0 | Step: 144470 | Dataset: 0-1455104 | Loss: 0.823 | 913 ms/step , 6890.57 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-05 10:14:08 | Epoch: 0 | Step: 144480 | Dataset: 0-1455424 | Loss: 0.747 | 912 ms/step , 6895.00 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 10:14:17 | Epoch: 0 | Step: 144490 | Dataset: 0-1455744 | Loss: 0.753 | 913 ms/step , 6892.27 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 10:14:27 | Epoch: 0 | Step: 144500 | Dataset: 0-1456064 | Loss: 0.803 | 913 ms/step , 6886.22 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 10:14:28 | Validation | Step: 144500 | Val_loss: 0.814 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:14:37 | Epoch: 0 | Step: 144510 | Dataset: 0-1456384 | Loss: 0.828 | 912 ms/step , 6892.70 GFLOP/s , 15283.8 tokens/s INFO:__main__:2024-11-05 10:14:46 | Epoch: 0 | Step: 144520 | Dataset: 0-1456704 | Loss: 0.909 | 913 ms/step , 6892.31 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:14:55 | Epoch: 0 | Step: 144530 | Dataset: 0-1457024 | Loss: 0.694 | 912 ms/step , 6899.86 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 10:15:05 | Epoch: 0 | Step: 144540 | Dataset: 0-1457344 | Loss: 0.704 | 912 ms/step , 6893.79 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 10:15:14 | Epoch: 0 | Step: 144550 | Dataset: 0-1457664 | Loss: 0.668 | 912 ms/step , 6898.14 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 10:15:23 | Epoch: 0 | Step: 144560 | Dataset: 0-1457984 | Loss: 0.622 | 912 ms/step , 6897.88 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 10:15:32 | Epoch: 0 | Step: 144570 | Dataset: 0-1458304 | Loss: 0.952 | 912 ms/step , 6893.33 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 10:15:41 | Epoch: 0 | Step: 144580 | Dataset: 0-1458624 | Loss: 0.714 | 913 ms/step , 6888.74 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 10:15:50 | Epoch: 0 | Step: 144590 | Dataset: 0-1458944 | Loss: 0.804 | 912 ms/step , 6898.33 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 10:15:59 | Epoch: 0 | Step: 144600 | Dataset: 0-1459264 | Loss: 0.816 | 913 ms/step , 6891.13 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 10:16:01 | Validation | Step: 144600 | Val_loss: 0.772 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:16:10 | Epoch: 0 | Step: 144610 | Dataset: 0-1459584 | Loss: 0.803 | 913 ms/step , 6887.21 GFLOP/s , 15299.9 tokens/s INFO:__main__:2024-11-05 10:16:19 | Epoch: 0 | Step: 144620 | Dataset: 0-1459904 | Loss: 0.782 | 914 ms/step , 6881.57 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 10:16:28 | Epoch: 0 | Step: 144630 | Dataset: 0-1460224 | Loss: 0.756 | 913 ms/step , 6886.24 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 10:16:37 | Epoch: 0 | Step: 144640 | Dataset: 0-1460544 | Loss: 0.746 | 913 ms/step , 6888.97 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 10:16:47 | Epoch: 0 | Step: 144650 | Dataset: 0-1460864 | Loss: 0.670 | 912 ms/step , 6894.74 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 10:16:56 | Epoch: 0 | Step: 144660 | Dataset: 0-1461184 | Loss: 0.770 | 914 ms/step , 6878.71 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 10:17:05 | Epoch: 0 | Step: 144670 | Dataset: 0-1461504 | Loss: 0.820 | 913 ms/step , 6890.37 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 10:17:14 | Epoch: 0 | Step: 144680 | Dataset: 0-1461824 | Loss: 0.877 | 913 ms/step , 6890.36 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 10:17:23 | Epoch: 0 | Step: 144690 | Dataset: 0-1462144 | Loss: 0.795 | 913 ms/step , 6889.00 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 10:17:32 | Epoch: 0 | Step: 144700 | Dataset: 0-1462464 | Loss: 0.790 | 914 ms/step , 6884.41 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 10:17:34 | Validation | Step: 144700 | Val_loss: 0.803 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:17:43 | Epoch: 0 | Step: 144710 | Dataset: 0-1462784 | Loss: 0.839 | 912 ms/step , 6898.74 GFLOP/s , 15281.3 tokens/s INFO:__main__:2024-11-05 10:17:52 | Epoch: 0 | Step: 144720 | Dataset: 0-1463104 | Loss: 0.794 | 913 ms/step , 6890.63 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 10:18:01 | Epoch: 0 | Step: 144730 | Dataset: 0-1463424 | Loss: 0.820 | 913 ms/step , 6886.61 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 10:18:10 | Epoch: 0 | Step: 144740 | Dataset: 0-1463744 | Loss: 0.725 | 913 ms/step , 6886.92 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 10:18:20 | Epoch: 0 | Step: 144750 | Dataset: 0-1464064 | Loss: 0.756 | 913 ms/step , 6885.83 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 10:18:29 | Epoch: 0 | Step: 144760 | Dataset: 0-1464384 | Loss: 0.849 | 913 ms/step , 6887.05 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 10:18:38 | Epoch: 0 | Step: 144770 | Dataset: 0-1464704 | Loss: 0.641 | 911 ms/step , 6900.74 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 10:18:47 | Epoch: 0 | Step: 144780 | Dataset: 0-1465024 | Loss: 0.791 | 912 ms/step , 6899.34 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 10:18:56 | Epoch: 0 | Step: 144790 | Dataset: 0-1465344 | Loss: 0.843 | 913 ms/step , 6886.35 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 10:19:05 | Epoch: 0 | Step: 144800 | Dataset: 0-1465664 | Loss: 0.830 | 912 ms/step , 6897.87 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 10:19:07 | Validation | Step: 144800 | Val_loss: 0.797 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:19:16 | Epoch: 0 | Step: 144810 | Dataset: 0-1465984 | Loss: 0.704 | 913 ms/step , 6890.80 GFLOP/s , 15281.0 tokens/s INFO:__main__:2024-11-05 10:19:25 | Epoch: 0 | Step: 144820 | Dataset: 0-1466304 | Loss: 0.696 | 913 ms/step , 6887.41 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 10:19:34 | Epoch: 0 | Step: 144830 | Dataset: 0-1466624 | Loss: 0.770 | 914 ms/step , 6878.54 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 10:19:43 | Epoch: 0 | Step: 144840 | Dataset: 0-1466944 | Loss: 0.865 | 913 ms/step , 6892.26 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 10:19:52 | Epoch: 0 | Step: 144850 | Dataset: 0-1467264 | Loss: 0.645 | 912 ms/step , 6896.56 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 10:20:02 | Epoch: 0 | Step: 144860 | Dataset: 0-1467584 | Loss: 0.799 | 914 ms/step , 6883.92 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 10:20:11 | Epoch: 0 | Step: 144870 | Dataset: 0-1467904 | Loss: 0.781 | 911 ms/step , 6900.29 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-05 10:20:20 | Epoch: 0 | Step: 144880 | Dataset: 0-1468224 | Loss: 0.663 | 912 ms/step , 6897.58 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 10:20:29 | Epoch: 0 | Step: 144890 | Dataset: 0-1468544 | Loss: 0.653 | 912 ms/step , 6895.94 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 10:20:38 | Epoch: 0 | Step: 144900 | Dataset: 0-1468864 | Loss: 0.801 | 912 ms/step , 6898.26 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 10:20:40 | Validation | Step: 144900 | Val_loss: 0.810 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:20:49 | Epoch: 0 | Step: 144910 | Dataset: 0-1469184 | Loss: 0.806 | 913 ms/step , 6888.08 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-05 10:20:58 | Epoch: 0 | Step: 144920 | Dataset: 0-1469504 | Loss: 0.730 | 912 ms/step , 6899.90 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 10:21:07 | Epoch: 0 | Step: 144930 | Dataset: 0-1469824 | Loss: 0.768 | 913 ms/step , 6892.46 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 10:21:16 | Epoch: 0 | Step: 144940 | Dataset: 0-1470144 | Loss: 0.877 | 913 ms/step , 6887.35 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 10:21:25 | Epoch: 0 | Step: 144950 | Dataset: 0-1470464 | Loss: 0.830 | 913 ms/step , 6889.20 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 10:21:34 | Epoch: 0 | Step: 144960 | Dataset: 0-1470784 | Loss: 0.911 | 913 ms/step , 6892.42 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 10:21:44 | Epoch: 0 | Step: 144970 | Dataset: 0-1471104 | Loss: 0.739 | 914 ms/step , 6884.18 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 10:21:53 | Epoch: 0 | Step: 144980 | Dataset: 0-1471424 | Loss: 0.804 | 912 ms/step , 6896.96 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 10:22:02 | Epoch: 0 | Step: 144990 | Dataset: 0-1471744 | Loss: 0.769 | 913 ms/step , 6890.00 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 10:22:11 | Epoch: 0 | Step: 145000 | Dataset: 0-1472064 | Loss: 0.709 | 912 ms/step , 6893.53 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 10:22:13 | Validation | Step: 145000 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:22:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_102213_step_145000.pt` INFO:__main__:2024-11-05 10:22:23 | Epoch: 0 | Step: 145010 | Dataset: 0-1472384 | Loss: 0.759 | 913 ms/step , 6886.30 GFLOP/s , 13704.7 tokens/s INFO:__main__:2024-11-05 10:22:32 | Epoch: 0 | Step: 145020 | Dataset: 0-1472704 | Loss: 0.765 | 913 ms/step , 6889.30 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 10:22:41 | Epoch: 0 | Step: 145030 | Dataset: 0-1473024 | Loss: 0.832 | 914 ms/step , 6884.97 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 10:22:50 | Epoch: 0 | Step: 145040 | Dataset: 0-1473344 | Loss: 0.761 | 914 ms/step , 6882.34 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 10:23:00 | Epoch: 0 | Step: 145050 | Dataset: 0-1473664 | Loss: 0.754 | 913 ms/step , 6890.64 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:23:09 | Epoch: 0 | Step: 145060 | Dataset: 0-1473984 | Loss: 0.717 | 913 ms/step , 6891.61 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 10:23:18 | Epoch: 0 | Step: 145070 | Dataset: 0-1474304 | Loss: 0.822 | 913 ms/step , 6885.84 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 10:23:27 | Epoch: 0 | Step: 145080 | Dataset: 0-1474624 | Loss: 0.716 | 912 ms/step , 6892.86 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 10:23:36 | Epoch: 0 | Step: 145090 | Dataset: 0-1474944 | Loss: 0.821 | 913 ms/step , 6888.23 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 10:23:45 | Epoch: 0 | Step: 145100 | Dataset: 0-1475264 | Loss: 0.808 | 914 ms/step , 6882.12 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 10:23:47 | Validation | Step: 145100 | Val_loss: 0.793 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:23:56 | Epoch: 0 | Step: 145110 | Dataset: 0-1475584 | Loss: 0.809 | 913 ms/step , 6887.79 GFLOP/s , 15270.9 tokens/s INFO:__main__:2024-11-05 10:24:05 | Epoch: 0 | Step: 145120 | Dataset: 0-1475904 | Loss: 0.760 | 913 ms/step , 6891.89 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 10:24:14 | Epoch: 0 | Step: 145130 | Dataset: 0-1476224 | Loss: 0.857 | 913 ms/step , 6890.67 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 10:24:23 | Epoch: 0 | Step: 145140 | Dataset: 0-1476544 | Loss: 0.718 | 913 ms/step , 6890.04 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:24:32 | Epoch: 0 | Step: 145150 | Dataset: 0-1476864 | Loss: 0.754 | 913 ms/step , 6887.37 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 10:24:42 | Epoch: 0 | Step: 145160 | Dataset: 0-1477184 | Loss: 0.705 | 912 ms/step , 6896.78 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 10:24:51 | Epoch: 0 | Step: 145170 | Dataset: 0-1477504 | Loss: 0.686 | 913 ms/step , 6889.60 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 10:25:00 | Epoch: 0 | Step: 145180 | Dataset: 0-1477824 | Loss: 0.719 | 913 ms/step , 6891.11 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 10:25:09 | Epoch: 0 | Step: 145190 | Dataset: 0-1478144 | Loss: 0.752 | 913 ms/step , 6891.24 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 10:25:18 | Epoch: 0 | Step: 145200 | Dataset: 0-1478464 | Loss: 0.758 | 912 ms/step , 6892.96 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 10:25:20 | Validation | Step: 145200 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:25:29 | Epoch: 0 | Step: 145210 | Dataset: 0-1478784 | Loss: 0.681 | 914 ms/step , 6881.28 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-05 10:25:38 | Epoch: 0 | Step: 145220 | Dataset: 0-1479104 | Loss: 0.726 | 912 ms/step , 6894.76 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 10:25:47 | Epoch: 0 | Step: 145230 | Dataset: 0-1479424 | Loss: 0.819 | 913 ms/step , 6892.58 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 10:25:56 | Epoch: 0 | Step: 145240 | Dataset: 0-1479744 | Loss: 0.817 | 913 ms/step , 6888.68 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 10:26:05 | Epoch: 0 | Step: 145250 | Dataset: 0-1480064 | Loss: 0.774 | 912 ms/step , 6893.72 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 10:26:14 | Epoch: 0 | Step: 145260 | Dataset: 0-1480384 | Loss: 0.922 | 912 ms/step , 6894.76 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 10:26:24 | Epoch: 0 | Step: 145270 | Dataset: 0-1480704 | Loss: 0.788 | 914 ms/step , 6884.30 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 10:26:33 | Epoch: 0 | Step: 145280 | Dataset: 0-1481024 | Loss: 0.701 | 913 ms/step , 6886.89 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 10:26:42 | Epoch: 0 | Step: 145290 | Dataset: 0-1481344 | Loss: 0.760 | 916 ms/step , 6869.34 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 10:26:51 | Epoch: 0 | Step: 145300 | Dataset: 0-1481664 | Loss: 0.747 | 913 ms/step , 6889.82 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 10:26:53 | Validation | Step: 145300 | Val_loss: 0.804 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:27:02 | Epoch: 0 | Step: 145310 | Dataset: 0-1481984 | Loss: 0.845 | 913 ms/step , 6889.41 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-05 10:27:11 | Epoch: 0 | Step: 145320 | Dataset: 0-1482304 | Loss: 0.782 | 912 ms/step , 6893.44 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 10:27:20 | Epoch: 0 | Step: 145330 | Dataset: 0-1482624 | Loss: 0.637 | 913 ms/step , 6892.15 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 10:27:29 | Epoch: 0 | Step: 145340 | Dataset: 0-1482944 | Loss: 0.820 | 913 ms/step , 6891.55 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 10:27:38 | Epoch: 0 | Step: 145350 | Dataset: 0-1483264 | Loss: 0.512 | 912 ms/step , 6896.92 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 10:27:47 | Epoch: 0 | Step: 145360 | Dataset: 0-1483584 | Loss: 0.772 | 913 ms/step , 6890.70 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 10:27:57 | Epoch: 0 | Step: 145370 | Dataset: 0-1483904 | Loss: 0.823 | 913 ms/step , 6886.70 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 10:28:06 | Epoch: 0 | Step: 145380 | Dataset: 0-1484224 | Loss: 0.687 | 913 ms/step , 6888.59 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 10:28:15 | Epoch: 0 | Step: 145390 | Dataset: 0-1484544 | Loss: 0.858 | 913 ms/step , 6890.24 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 10:28:24 | Epoch: 0 | Step: 145400 | Dataset: 0-1484864 | Loss: 0.729 | 913 ms/step , 6885.27 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 10:28:26 | Validation | Step: 145400 | Val_loss: 0.812 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:28:35 | Epoch: 0 | Step: 145410 | Dataset: 0-1485184 | Loss: 0.796 | 913 ms/step , 6889.73 GFLOP/s , 15293.1 tokens/s INFO:__main__:2024-11-05 10:28:44 | Epoch: 0 | Step: 145420 | Dataset: 0-1485504 | Loss: 0.799 | 914 ms/step , 6879.24 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 10:28:53 | Epoch: 0 | Step: 145430 | Dataset: 0-1485824 | Loss: 0.816 | 912 ms/step , 6899.37 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 10:29:02 | Epoch: 0 | Step: 145440 | Dataset: 0-1486144 | Loss: 0.733 | 915 ms/step , 6876.60 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 10:29:11 | Epoch: 0 | Step: 145450 | Dataset: 0-1486464 | Loss: 0.620 | 911 ms/step , 6901.09 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 10:29:20 | Epoch: 0 | Step: 145460 | Dataset: 0-1486784 | Loss: 0.744 | 913 ms/step , 6889.38 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 10:29:29 | Epoch: 0 | Step: 145470 | Dataset: 0-1487104 | Loss: 0.679 | 913 ms/step , 6886.64 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 10:29:39 | Epoch: 0 | Step: 145480 | Dataset: 0-1487424 | Loss: 0.696 | 912 ms/step , 6894.61 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 10:29:48 | Epoch: 0 | Step: 145490 | Dataset: 0-1487744 | Loss: 0.802 | 914 ms/step , 6882.52 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 10:29:57 | Epoch: 0 | Step: 145500 | Dataset: 0-1488064 | Loss: 0.709 | 914 ms/step , 6880.08 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 10:29:58 | Validation | Step: 145500 | Val_loss: 0.801 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:30:08 | Epoch: 0 | Step: 145510 | Dataset: 0-1488384 | Loss: 0.765 | 913 ms/step , 6887.35 GFLOP/s , 15279.4 tokens/s INFO:__main__:2024-11-05 10:30:17 | Epoch: 0 | Step: 145520 | Dataset: 0-1488704 | Loss: 0.535 | 912 ms/step , 6896.44 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 10:30:26 | Epoch: 0 | Step: 145530 | Dataset: 0-1489024 | Loss: 0.811 | 914 ms/step , 6880.91 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 10:30:35 | Epoch: 0 | Step: 145540 | Dataset: 0-1489344 | Loss: 0.844 | 913 ms/step , 6890.27 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 10:30:44 | Epoch: 0 | Step: 145550 | Dataset: 0-1489664 | Loss: 0.715 | 913 ms/step , 6892.46 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 10:30:53 | Epoch: 0 | Step: 145560 | Dataset: 0-1489984 | Loss: 0.816 | 912 ms/step , 6895.40 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 10:31:02 | Epoch: 0 | Step: 145570 | Dataset: 0-1490304 | Loss: 0.738 | 914 ms/step , 6877.58 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 10:31:12 | Epoch: 0 | Step: 145580 | Dataset: 0-1490624 | Loss: 0.661 | 912 ms/step , 6894.15 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 10:31:21 | Epoch: 0 | Step: 145590 | Dataset: 0-1490944 | Loss: 0.779 | 914 ms/step , 6884.26 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 10:31:30 | Epoch: 0 | Step: 145600 | Dataset: 0-1491264 | Loss: 0.936 | 914 ms/step , 6880.69 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 10:31:31 | Validation | Step: 145600 | Val_loss: 0.735 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:31:41 | Epoch: 0 | Step: 145610 | Dataset: 0-1491584 | Loss: 0.839 | 915 ms/step , 6874.48 GFLOP/s , 15284.1 tokens/s INFO:__main__:2024-11-05 10:31:50 | Epoch: 0 | Step: 145620 | Dataset: 0-1491904 | Loss: 0.805 | 913 ms/step , 6891.50 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 10:31:59 | Epoch: 0 | Step: 145630 | Dataset: 0-1492224 | Loss: 0.871 | 917 ms/step , 6856.17 GFLOP/s , 17902.7 tokens/s INFO:__main__:2024-11-05 10:32:08 | Epoch: 0 | Step: 145640 | Dataset: 0-1492544 | Loss: 0.910 | 913 ms/step , 6886.37 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 10:32:17 | Epoch: 0 | Step: 145650 | Dataset: 0-1492864 | Loss: 0.653 | 913 ms/step , 6886.86 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 10:32:26 | Epoch: 0 | Step: 145660 | Dataset: 0-1493184 | Loss: 0.688 | 912 ms/step , 6896.58 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 10:32:35 | Epoch: 0 | Step: 145670 | Dataset: 0-1493504 | Loss: 0.611 | 913 ms/step , 6887.86 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 10:32:44 | Epoch: 0 | Step: 145680 | Dataset: 0-1493824 | Loss: 0.863 | 915 ms/step , 6876.01 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 10:32:54 | Epoch: 0 | Step: 145690 | Dataset: 0-1494144 | Loss: 0.823 | 915 ms/step , 6876.02 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 10:33:03 | Epoch: 0 | Step: 145700 | Dataset: 0-1494464 | Loss: 0.609 | 913 ms/step , 6892.06 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 10:33:04 | Validation | Step: 145700 | Val_loss: 0.799 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:33:13 | Epoch: 0 | Step: 145710 | Dataset: 0-1494784 | Loss: 0.788 | 913 ms/step , 6891.15 GFLOP/s , 15271.4 tokens/s INFO:__main__:2024-11-05 10:33:23 | Epoch: 0 | Step: 145720 | Dataset: 0-1495104 | Loss: 0.758 | 912 ms/step , 6894.05 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 10:33:32 | Epoch: 0 | Step: 145730 | Dataset: 0-1495424 | Loss: 0.791 | 913 ms/step , 6886.27 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 10:33:41 | Epoch: 0 | Step: 145740 | Dataset: 0-1495744 | Loss: 0.738 | 913 ms/step , 6891.42 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 10:33:50 | Epoch: 0 | Step: 145750 | Dataset: 0-1496064 | Loss: 0.776 | 913 ms/step , 6886.12 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 10:33:59 | Epoch: 0 | Step: 145760 | Dataset: 0-1496384 | Loss: 0.755 | 912 ms/step , 6894.17 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 10:34:08 | Epoch: 0 | Step: 145770 | Dataset: 0-1496704 | Loss: 0.728 | 913 ms/step , 6891.34 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 10:34:17 | Epoch: 0 | Step: 145780 | Dataset: 0-1497024 | Loss: 0.828 | 913 ms/step , 6889.87 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 10:34:27 | Epoch: 0 | Step: 145790 | Dataset: 0-1497344 | Loss: 0.748 | 914 ms/step , 6884.45 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 10:34:36 | Epoch: 0 | Step: 145800 | Dataset: 0-1497664 | Loss: 0.755 | 912 ms/step , 6894.84 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 10:34:37 | Validation | Step: 145800 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:34:46 | Epoch: 0 | Step: 145810 | Dataset: 0-1497984 | Loss: 0.554 | 912 ms/step , 6893.54 GFLOP/s , 15292.2 tokens/s INFO:__main__:2024-11-05 10:34:56 | Epoch: 0 | Step: 145820 | Dataset: 0-1498304 | Loss: 0.794 | 913 ms/step , 6891.40 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 10:35:05 | Epoch: 0 | Step: 145830 | Dataset: 0-1498624 | Loss: 0.813 | 914 ms/step , 6883.77 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 10:35:14 | Epoch: 0 | Step: 145840 | Dataset: 0-1498944 | Loss: 0.832 | 914 ms/step , 6883.59 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 10:35:23 | Epoch: 0 | Step: 145850 | Dataset: 0-1499264 | Loss: 0.859 | 912 ms/step , 6894.06 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 10:35:32 | Epoch: 0 | Step: 145860 | Dataset: 0-1499584 | Loss: 0.727 | 913 ms/step , 6885.25 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 10:35:41 | Epoch: 0 | Step: 145870 | Dataset: 0-1499904 | Loss: 0.800 | 912 ms/step , 6898.27 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 10:35:50 | Epoch: 0 | Step: 145880 | Dataset: 0-1500224 | Loss: 0.693 | 913 ms/step , 6885.05 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 10:35:59 | Epoch: 0 | Step: 145890 | Dataset: 0-1500544 | Loss: 0.799 | 913 ms/step , 6888.67 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 10:36:09 | Epoch: 0 | Step: 145900 | Dataset: 0-1500864 | Loss: 0.746 | 913 ms/step , 6885.66 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 10:36:10 | Validation | Step: 145900 | Val_loss: 0.746 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:36:19 | Epoch: 0 | Step: 145910 | Dataset: 0-1501184 | Loss: 0.731 | 913 ms/step , 6892.52 GFLOP/s , 15283.8 tokens/s INFO:__main__:2024-11-05 10:36:28 | Epoch: 0 | Step: 145920 | Dataset: 0-1501504 | Loss: 0.730 | 913 ms/step , 6889.33 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 10:36:38 | Epoch: 0 | Step: 145930 | Dataset: 0-1501824 | Loss: 0.701 | 912 ms/step , 6894.25 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 10:36:47 | Epoch: 0 | Step: 145940 | Dataset: 0-1502144 | Loss: 0.857 | 912 ms/step , 6893.38 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 10:36:56 | Epoch: 0 | Step: 145950 | Dataset: 0-1502464 | Loss: 0.746 | 913 ms/step , 6888.95 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 10:37:05 | Epoch: 0 | Step: 145960 | Dataset: 0-1502784 | Loss: 0.908 | 913 ms/step , 6891.71 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 10:37:14 | Epoch: 0 | Step: 145970 | Dataset: 0-1503104 | Loss: 0.685 | 913 ms/step , 6889.52 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 10:37:23 | Epoch: 0 | Step: 145980 | Dataset: 0-1503424 | Loss: 0.880 | 913 ms/step , 6886.84 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 10:37:32 | Epoch: 0 | Step: 145990 | Dataset: 0-1503744 | Loss: 0.679 | 912 ms/step , 6894.46 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 10:37:42 | Epoch: 0 | Step: 146000 | Dataset: 0-1504064 | Loss: 0.762 | 913 ms/step , 6891.82 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 10:37:43 | Validation | Step: 146000 | Val_loss: 0.773 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:37:43 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_103743_step_146000.pt` INFO:__main__:2024-11-05 10:37:53 | Epoch: 0 | Step: 146010 | Dataset: 0-1504384 | Loss: 0.837 | 914 ms/step , 6884.05 GFLOP/s , 13815.2 tokens/s INFO:__main__:2024-11-05 10:38:02 | Epoch: 0 | Step: 146020 | Dataset: 0-1504704 | Loss: 0.881 | 912 ms/step , 6894.89 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 10:38:12 | Epoch: 0 | Step: 146030 | Dataset: 0-1505024 | Loss: 0.744 | 911 ms/step , 6900.18 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-05 10:38:21 | Epoch: 0 | Step: 146040 | Dataset: 0-1505344 | Loss: 0.668 | 913 ms/step , 6887.63 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 10:38:30 | Epoch: 0 | Step: 146050 | Dataset: 0-1505664 | Loss: 0.708 | 913 ms/step , 6890.24 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 10:38:39 | Epoch: 0 | Step: 146060 | Dataset: 0-1505984 | Loss: 0.766 | 913 ms/step , 6892.07 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 10:38:48 | Epoch: 0 | Step: 146070 | Dataset: 0-1506304 | Loss: 0.717 | 914 ms/step , 6883.00 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 10:38:57 | Epoch: 0 | Step: 146080 | Dataset: 0-1506624 | Loss: 0.828 | 914 ms/step , 6883.74 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 10:39:06 | Epoch: 0 | Step: 146090 | Dataset: 0-1506944 | Loss: 0.782 | 914 ms/step , 6884.24 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 10:39:16 | Epoch: 0 | Step: 146100 | Dataset: 0-1507264 | Loss: 0.789 | 913 ms/step , 6885.88 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 10:39:17 | Validation | Step: 146100 | Val_loss: 0.665 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:39:26 | Epoch: 0 | Step: 146110 | Dataset: 0-1507584 | Loss: 0.776 | 912 ms/step , 6898.13 GFLOP/s , 15304.4 tokens/s INFO:__main__:2024-11-05 10:39:35 | Epoch: 0 | Step: 146120 | Dataset: 0-1507904 | Loss: 0.818 | 912 ms/step , 6895.09 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 10:39:45 | Epoch: 0 | Step: 146130 | Dataset: 0-1508224 | Loss: 0.681 | 912 ms/step , 6897.49 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 10:39:54 | Epoch: 0 | Step: 146140 | Dataset: 0-1508544 | Loss: 0.774 | 913 ms/step , 6890.92 GFLOP/s , 17955.0 tokens/s INFO:__main__:2024-11-05 10:40:03 | Epoch: 0 | Step: 146150 | Dataset: 0-1508864 | Loss: 0.840 | 912 ms/step , 6894.69 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 10:40:12 | Epoch: 0 | Step: 146160 | Dataset: 0-1509184 | Loss: 0.801 | 912 ms/step , 6896.71 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 10:40:21 | Epoch: 0 | Step: 146170 | Dataset: 0-1509504 | Loss: 0.706 | 913 ms/step , 6891.97 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 10:40:30 | Epoch: 0 | Step: 146180 | Dataset: 0-1509824 | Loss: 0.754 | 913 ms/step , 6889.75 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 10:40:39 | Epoch: 0 | Step: 146190 | Dataset: 0-1510144 | Loss: 0.689 | 914 ms/step , 6883.89 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 10:40:48 | Epoch: 0 | Step: 146200 | Dataset: 0-1510464 | Loss: 0.785 | 914 ms/step , 6884.42 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 10:40:50 | Validation | Step: 146200 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:40:59 | Epoch: 0 | Step: 146210 | Dataset: 0-1510784 | Loss: 0.770 | 912 ms/step , 6897.84 GFLOP/s , 15281.3 tokens/s INFO:__main__:2024-11-05 10:41:08 | Epoch: 0 | Step: 146220 | Dataset: 0-1511104 | Loss: 0.831 | 914 ms/step , 6881.43 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 10:41:17 | Epoch: 0 | Step: 146230 | Dataset: 0-1511424 | Loss: 0.768 | 915 ms/step , 6877.12 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 10:41:27 | Epoch: 0 | Step: 146240 | Dataset: 0-1511744 | Loss: 0.785 | 912 ms/step , 6897.51 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 10:41:36 | Epoch: 0 | Step: 146250 | Dataset: 0-1512064 | Loss: 0.776 | 913 ms/step , 6887.70 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 10:41:45 | Epoch: 0 | Step: 146260 | Dataset: 0-1512384 | Loss: 0.385 | 912 ms/step , 6892.88 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 10:41:54 | Epoch: 0 | Step: 146270 | Dataset: 0-1512704 | Loss: 0.680 | 914 ms/step , 6882.73 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 10:42:03 | Epoch: 0 | Step: 146280 | Dataset: 0-1513024 | Loss: 0.528 | 912 ms/step , 6897.70 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 10:42:12 | Epoch: 0 | Step: 146290 | Dataset: 0-1513344 | Loss: 0.694 | 913 ms/step , 6889.22 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 10:42:21 | Epoch: 0 | Step: 146300 | Dataset: 0-1513664 | Loss: 0.851 | 914 ms/step , 6879.45 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 10:42:23 | Validation | Step: 146300 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:42:32 | Epoch: 0 | Step: 146310 | Dataset: 0-1513984 | Loss: 0.782 | 913 ms/step , 6890.38 GFLOP/s , 15278.6 tokens/s INFO:__main__:2024-11-05 10:42:41 | Epoch: 0 | Step: 146320 | Dataset: 0-1514304 | Loss: 0.759 | 913 ms/step , 6889.63 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 10:42:50 | Epoch: 0 | Step: 146330 | Dataset: 0-1514624 | Loss: 0.972 | 915 ms/step , 6874.77 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 10:43:00 | Epoch: 0 | Step: 146340 | Dataset: 0-1514944 | Loss: 0.836 | 913 ms/step , 6886.73 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 10:43:09 | Epoch: 0 | Step: 146350 | Dataset: 0-1515264 | Loss: 0.712 | 913 ms/step , 6889.00 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 10:43:18 | Epoch: 0 | Step: 146360 | Dataset: 0-1515584 | Loss: 0.769 | 913 ms/step , 6891.72 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 10:43:27 | Epoch: 0 | Step: 146370 | Dataset: 0-1515904 | Loss: 0.717 | 912 ms/step , 6896.10 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 10:43:36 | Epoch: 0 | Step: 146380 | Dataset: 0-1516224 | Loss: 0.738 | 912 ms/step , 6897.27 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 10:43:45 | Epoch: 0 | Step: 146390 | Dataset: 0-1516544 | Loss: 0.714 | 913 ms/step , 6890.28 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 10:43:54 | Epoch: 0 | Step: 146400 | Dataset: 0-1516864 | Loss: 0.697 | 912 ms/step , 6895.87 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 10:43:56 | Validation | Step: 146400 | Val_loss: 0.764 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:44:05 | Epoch: 0 | Step: 146410 | Dataset: 0-1517184 | Loss: 0.762 | 912 ms/step , 6895.45 GFLOP/s , 15291.5 tokens/s INFO:__main__:2024-11-05 10:44:14 | Epoch: 0 | Step: 146420 | Dataset: 0-1517504 | Loss: 0.787 | 912 ms/step , 6897.31 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 10:44:23 | Epoch: 0 | Step: 146430 | Dataset: 0-1517824 | Loss: 0.646 | 912 ms/step , 6894.04 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 10:44:32 | Epoch: 0 | Step: 146440 | Dataset: 0-1518144 | Loss: 0.797 | 912 ms/step , 6897.43 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 10:44:42 | Epoch: 0 | Step: 146450 | Dataset: 0-1518464 | Loss: 0.805 | 913 ms/step , 6885.08 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 10:44:51 | Epoch: 0 | Step: 146460 | Dataset: 0-1518784 | Loss: 0.733 | 914 ms/step , 6884.51 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-05 10:45:00 | Epoch: 0 | Step: 146470 | Dataset: 0-1519104 | Loss: 0.793 | 913 ms/step , 6890.28 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-05 10:45:09 | Epoch: 0 | Step: 146480 | Dataset: 0-1519424 | Loss: 0.736 | 913 ms/step , 6886.18 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 10:45:18 | Epoch: 0 | Step: 146490 | Dataset: 0-1519744 | Loss: 0.667 | 911 ms/step , 6901.89 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 10:45:27 | Epoch: 0 | Step: 146500 | Dataset: 0-1520064 | Loss: 0.745 | 913 ms/step , 6891.25 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 10:45:29 | Validation | Step: 146500 | Val_loss: 0.787 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:45:38 | Epoch: 0 | Step: 146510 | Dataset: 0-1520384 | Loss: 0.768 | 913 ms/step , 6885.55 GFLOP/s , 15284.2 tokens/s INFO:__main__:2024-11-05 10:45:47 | Epoch: 0 | Step: 146520 | Dataset: 0-1520704 | Loss: 0.829 | 911 ms/step , 6901.84 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 10:45:56 | Epoch: 0 | Step: 146530 | Dataset: 0-1521024 | Loss: 0.780 | 912 ms/step , 6897.85 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 10:46:05 | Epoch: 0 | Step: 146540 | Dataset: 0-1521344 | Loss: 0.759 | 913 ms/step , 6890.68 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 10:46:14 | Epoch: 0 | Step: 146550 | Dataset: 0-1521664 | Loss: 0.892 | 913 ms/step , 6885.07 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 10:46:24 | Epoch: 0 | Step: 146560 | Dataset: 0-1521984 | Loss: 0.784 | 913 ms/step , 6891.90 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-05 10:46:33 | Epoch: 0 | Step: 146570 | Dataset: 0-1522304 | Loss: 0.727 | 912 ms/step , 6897.89 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 10:46:42 | Epoch: 0 | Step: 146580 | Dataset: 0-1522624 | Loss: 0.697 | 911 ms/step , 6902.51 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 10:46:51 | Epoch: 0 | Step: 146590 | Dataset: 0-1522944 | Loss: 0.771 | 912 ms/step , 6898.10 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 10:47:00 | Epoch: 0 | Step: 146600 | Dataset: 0-1523264 | Loss: 0.850 | 913 ms/step , 6885.55 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 10:47:02 | Validation | Step: 146600 | Val_loss: 0.709 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:47:11 | Epoch: 0 | Step: 146610 | Dataset: 0-1523584 | Loss: 0.763 | 912 ms/step , 6894.40 GFLOP/s , 15287.0 tokens/s INFO:__main__:2024-11-05 10:47:20 | Epoch: 0 | Step: 146620 | Dataset: 0-1523904 | Loss: 0.878 | 914 ms/step , 6878.85 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 10:47:29 | Epoch: 0 | Step: 146630 | Dataset: 0-1524224 | Loss: 0.835 | 912 ms/step , 6894.10 GFLOP/s , 17951.3 tokens/s INFO:__main__:2024-11-05 10:47:38 | Epoch: 0 | Step: 146640 | Dataset: 0-1524544 | Loss: 0.678 | 912 ms/step , 6896.90 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 10:47:47 | Epoch: 0 | Step: 146650 | Dataset: 0-1524864 | Loss: 0.824 | 913 ms/step , 6892.11 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 10:47:56 | Epoch: 0 | Step: 146660 | Dataset: 0-1525184 | Loss: 0.883 | 914 ms/step , 6884.34 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 10:48:06 | Epoch: 0 | Step: 146670 | Dataset: 0-1525504 | Loss: 0.619 | 912 ms/step , 6899.03 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 10:48:15 | Epoch: 0 | Step: 146680 | Dataset: 0-1525824 | Loss: 0.682 | 913 ms/step , 6885.47 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 10:48:24 | Epoch: 0 | Step: 146690 | Dataset: 0-1526144 | Loss: 0.789 | 915 ms/step , 6874.23 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 10:48:33 | Epoch: 0 | Step: 146700 | Dataset: 0-1526464 | Loss: 0.770 | 913 ms/step , 6890.75 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 10:48:35 | Validation | Step: 146700 | Val_loss: 0.768 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:48:44 | Epoch: 0 | Step: 146710 | Dataset: 0-1526784 | Loss: 0.784 | 915 ms/step , 6875.87 GFLOP/s , 15285.7 tokens/s INFO:__main__:2024-11-05 10:48:53 | Epoch: 0 | Step: 146720 | Dataset: 0-1527104 | Loss: 0.763 | 912 ms/step , 6894.94 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 10:49:02 | Epoch: 0 | Step: 146730 | Dataset: 0-1527424 | Loss: 0.856 | 913 ms/step , 6888.83 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 10:49:11 | Epoch: 0 | Step: 146740 | Dataset: 0-1527744 | Loss: 0.775 | 913 ms/step , 6889.03 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 10:49:20 | Epoch: 0 | Step: 146750 | Dataset: 0-1528064 | Loss: 0.806 | 912 ms/step , 6896.54 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 10:49:29 | Epoch: 0 | Step: 146760 | Dataset: 0-1528384 | Loss: 0.863 | 912 ms/step , 6895.44 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 10:49:38 | Epoch: 0 | Step: 146770 | Dataset: 0-1528704 | Loss: 0.695 | 912 ms/step , 6895.54 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 10:49:48 | Epoch: 0 | Step: 146780 | Dataset: 0-1529024 | Loss: 0.745 | 912 ms/step , 6893.28 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 10:49:57 | Epoch: 0 | Step: 146790 | Dataset: 0-1529344 | Loss: 0.826 | 912 ms/step , 6896.09 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 10:50:06 | Epoch: 0 | Step: 146800 | Dataset: 0-1529664 | Loss: 0.813 | 913 ms/step , 6887.61 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 10:50:07 | Validation | Step: 146800 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:50:17 | Epoch: 0 | Step: 146810 | Dataset: 0-1529984 | Loss: 0.699 | 913 ms/step , 6886.97 GFLOP/s , 15279.9 tokens/s INFO:__main__:2024-11-05 10:50:26 | Epoch: 0 | Step: 146820 | Dataset: 0-1530304 | Loss: 0.710 | 914 ms/step , 6881.74 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 10:50:35 | Epoch: 0 | Step: 146830 | Dataset: 0-1530624 | Loss: 0.758 | 913 ms/step , 6890.63 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 10:50:44 | Epoch: 0 | Step: 146840 | Dataset: 0-1530944 | Loss: 0.750 | 913 ms/step , 6891.52 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 10:50:53 | Epoch: 0 | Step: 146850 | Dataset: 0-1531264 | Loss: 0.747 | 912 ms/step , 6894.59 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 10:51:02 | Epoch: 0 | Step: 146860 | Dataset: 0-1531584 | Loss: 0.788 | 913 ms/step , 6887.12 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 10:51:11 | Epoch: 0 | Step: 146870 | Dataset: 0-1531904 | Loss: 0.748 | 911 ms/step , 6900.16 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 10:51:21 | Epoch: 0 | Step: 146880 | Dataset: 0-1532224 | Loss: 0.814 | 913 ms/step , 6885.94 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 10:51:30 | Epoch: 0 | Step: 146890 | Dataset: 0-1532544 | Loss: 0.754 | 913 ms/step , 6888.91 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 10:51:39 | Epoch: 0 | Step: 146900 | Dataset: 0-1532864 | Loss: 0.658 | 913 ms/step , 6890.27 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 10:51:40 | Validation | Step: 146900 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:51:49 | Epoch: 0 | Step: 146910 | Dataset: 0-1533184 | Loss: 0.843 | 913 ms/step , 6892.44 GFLOP/s , 15281.1 tokens/s INFO:__main__:2024-11-05 10:51:59 | Epoch: 0 | Step: 146920 | Dataset: 0-1533504 | Loss: 0.719 | 912 ms/step , 6899.52 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 10:52:08 | Epoch: 0 | Step: 146930 | Dataset: 0-1533824 | Loss: 0.709 | 912 ms/step , 6895.96 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 10:52:17 | Epoch: 0 | Step: 146940 | Dataset: 0-1534144 | Loss: 0.860 | 915 ms/step , 6876.12 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 10:52:26 | Epoch: 0 | Step: 146950 | Dataset: 0-1534464 | Loss: 0.780 | 913 ms/step , 6889.74 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 10:52:35 | Epoch: 0 | Step: 146960 | Dataset: 0-1534784 | Loss: 0.811 | 911 ms/step , 6903.84 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 10:52:44 | Epoch: 0 | Step: 146970 | Dataset: 0-1535104 | Loss: 0.795 | 913 ms/step , 6889.27 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 10:52:53 | Epoch: 0 | Step: 146980 | Dataset: 0-1535424 | Loss: 0.803 | 911 ms/step , 6900.32 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-05 10:53:03 | Epoch: 0 | Step: 146990 | Dataset: 0-1535744 | Loss: 0.693 | 912 ms/step , 6897.76 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 10:53:12 | Epoch: 0 | Step: 147000 | Dataset: 0-1536064 | Loss: 0.756 | 914 ms/step , 6880.59 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 10:53:13 | Validation | Step: 147000 | Val_loss: 0.784 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:53:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_105313_step_147000.pt` INFO:__main__:2024-11-05 10:53:24 | Epoch: 0 | Step: 147010 | Dataset: 0-1536384 | Loss: 0.769 | 914 ms/step , 6883.79 GFLOP/s , 13832.4 tokens/s INFO:__main__:2024-11-05 10:53:33 | Epoch: 0 | Step: 147020 | Dataset: 0-1536704 | Loss: 0.873 | 912 ms/step , 6896.36 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 10:53:42 | Epoch: 0 | Step: 147030 | Dataset: 0-1537024 | Loss: 0.833 | 913 ms/step , 6891.11 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 10:53:51 | Epoch: 0 | Step: 147040 | Dataset: 0-1537344 | Loss: 0.739 | 911 ms/step , 6902.16 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 10:54:00 | Epoch: 0 | Step: 147050 | Dataset: 0-1537664 | Loss: 0.738 | 913 ms/step , 6886.98 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 10:54:09 | Epoch: 0 | Step: 147060 | Dataset: 0-1537984 | Loss: 0.748 | 912 ms/step , 6893.99 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 10:54:18 | Epoch: 0 | Step: 147070 | Dataset: 0-1538304 | Loss: 0.717 | 913 ms/step , 6889.04 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 10:54:27 | Epoch: 0 | Step: 147080 | Dataset: 0-1538624 | Loss: 0.701 | 912 ms/step , 6894.72 GFLOP/s , 17951.7 tokens/s INFO:__main__:2024-11-05 10:54:37 | Epoch: 0 | Step: 147090 | Dataset: 0-1538944 | Loss: 0.680 | 912 ms/step , 6896.38 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-05 10:54:46 | Epoch: 0 | Step: 147100 | Dataset: 0-1539264 | Loss: 0.820 | 912 ms/step , 6897.15 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 10:54:47 | Validation | Step: 147100 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:54:56 | Epoch: 0 | Step: 147110 | Dataset: 0-1539584 | Loss: 0.727 | 913 ms/step , 6886.60 GFLOP/s , 15296.5 tokens/s INFO:__main__:2024-11-05 10:55:06 | Epoch: 0 | Step: 147120 | Dataset: 0-1539904 | Loss: 0.676 | 911 ms/step , 6901.33 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-05 10:55:15 | Epoch: 0 | Step: 147130 | Dataset: 0-1540224 | Loss: 0.766 | 912 ms/step , 6896.71 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 10:55:24 | Epoch: 0 | Step: 147140 | Dataset: 0-1540544 | Loss: 0.851 | 912 ms/step , 6893.22 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 10:55:33 | Epoch: 0 | Step: 147150 | Dataset: 0-1540864 | Loss: 0.772 | 912 ms/step , 6893.28 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 10:55:42 | Epoch: 0 | Step: 147160 | Dataset: 0-1541184 | Loss: 0.770 | 912 ms/step , 6893.58 GFLOP/s , 17949.2 tokens/s INFO:__main__:2024-11-05 10:55:51 | Epoch: 0 | Step: 147170 | Dataset: 0-1541504 | Loss: 0.698 | 913 ms/step , 6885.79 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 10:56:00 | Epoch: 0 | Step: 147180 | Dataset: 0-1541824 | Loss: 0.793 | 911 ms/step , 6902.84 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 10:56:09 | Epoch: 0 | Step: 147190 | Dataset: 0-1542144 | Loss: 0.903 | 913 ms/step , 6886.71 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 10:56:19 | Epoch: 0 | Step: 147200 | Dataset: 0-1542464 | Loss: 0.715 | 912 ms/step , 6898.75 GFLOP/s , 17957.1 tokens/s INFO:__main__:2024-11-05 10:56:20 | Validation | Step: 147200 | Val_loss: 0.824 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:56:29 | Epoch: 0 | Step: 147210 | Dataset: 0-1542784 | Loss: 0.774 | 912 ms/step , 6895.91 GFLOP/s , 15287.3 tokens/s INFO:__main__:2024-11-05 10:56:38 | Epoch: 0 | Step: 147220 | Dataset: 0-1543104 | Loss: 0.478 | 912 ms/step , 6899.70 GFLOP/s , 17961.1 tokens/s INFO:__main__:2024-11-05 10:56:48 | Epoch: 0 | Step: 147230 | Dataset: 0-1543424 | Loss: 0.517 | 912 ms/step , 6896.22 GFLOP/s , 17960.8 tokens/s INFO:__main__:2024-11-05 10:56:57 | Epoch: 0 | Step: 147240 | Dataset: 0-1543744 | Loss: 0.415 | 913 ms/step , 6890.53 GFLOP/s , 17956.6 tokens/s INFO:__main__:2024-11-05 10:57:06 | Epoch: 0 | Step: 147250 | Dataset: 0-1544064 | Loss: 0.497 | 913 ms/step , 6888.65 GFLOP/s , 17953.4 tokens/s INFO:__main__:2024-11-05 10:57:15 | Epoch: 0 | Step: 147260 | Dataset: 0-1544384 | Loss: 0.427 | 911 ms/step , 6902.16 GFLOP/s , 17959.6 tokens/s INFO:__main__:2024-11-05 10:57:24 | Epoch: 0 | Step: 147270 | Dataset: 0-1544704 | Loss: 0.443 | 912 ms/step , 6898.35 GFLOP/s , 17957.4 tokens/s INFO:__main__:2024-11-05 10:57:33 | Epoch: 0 | Step: 147280 | Dataset: 0-1545024 | Loss: 0.423 | 913 ms/step , 6885.20 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-05 10:57:42 | Epoch: 0 | Step: 147290 | Dataset: 0-1545344 | Loss: 0.578 | 912 ms/step , 6893.90 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 10:57:51 | Epoch: 0 | Step: 147300 | Dataset: 0-1545664 | Loss: 0.487 | 913 ms/step , 6892.24 GFLOP/s , 17961.6 tokens/s INFO:__main__:2024-11-05 10:57:53 | Validation | Step: 147300 | Val_loss: 0.763 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:58:02 | Epoch: 0 | Step: 147310 | Dataset: 0-1545984 | Loss: 0.305 | 912 ms/step , 6897.73 GFLOP/s , 15305.4 tokens/s INFO:__main__:2024-11-05 10:58:11 | Epoch: 0 | Step: 147320 | Dataset: 0-1546304 | Loss: 0.316 | 912 ms/step , 6899.68 GFLOP/s , 17959.7 tokens/s INFO:__main__:2024-11-05 10:58:20 | Epoch: 0 | Step: 147330 | Dataset: 0-1546624 | Loss: 0.400 | 911 ms/step , 6907.07 GFLOP/s , 17967.9 tokens/s INFO:__main__:2024-11-05 10:58:29 | Epoch: 0 | Step: 147340 | Dataset: 0-1546944 | Loss: 0.397 | 911 ms/step , 6903.20 GFLOP/s , 17958.1 tokens/s INFO:__main__:2024-11-05 10:58:39 | Epoch: 0 | Step: 147350 | Dataset: 0-1547264 | Loss: 0.408 | 912 ms/step , 6899.82 GFLOP/s , 17963.7 tokens/s INFO:__main__:2024-11-05 10:58:48 | Epoch: 0 | Step: 147360 | Dataset: 0-1547584 | Loss: 0.366 | 911 ms/step , 6904.96 GFLOP/s , 17966.4 tokens/s INFO:__main__:2024-11-05 10:58:57 | Epoch: 0 | Step: 147370 | Dataset: 0-1547904 | Loss: 0.409 | 911 ms/step , 6904.61 GFLOP/s , 17959.3 tokens/s INFO:__main__:2024-11-05 10:59:06 | Epoch: 0 | Step: 147380 | Dataset: 0-1548224 | Loss: 0.454 | 912 ms/step , 6894.26 GFLOP/s , 17962.7 tokens/s INFO:__main__:2024-11-05 10:59:15 | Epoch: 0 | Step: 147390 | Dataset: 0-1548544 | Loss: 0.543 | 911 ms/step , 6904.42 GFLOP/s , 17964.3 tokens/s INFO:__main__:2024-11-05 10:59:24 | Epoch: 0 | Step: 147400 | Dataset: 0-1548864 | Loss: 0.309 | 912 ms/step , 6894.18 GFLOP/s , 17959.7 tokens/s INFO:__main__:2024-11-05 10:59:26 | Validation | Step: 147400 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 10:59:35 | Epoch: 0 | Step: 147410 | Dataset: 0-1549184 | Loss: 0.440 | 912 ms/step , 6896.04 GFLOP/s , 15288.6 tokens/s INFO:__main__:2024-11-05 10:59:44 | Epoch: 0 | Step: 147420 | Dataset: 0-1549504 | Loss: 0.462 | 912 ms/step , 6894.63 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 10:59:53 | Epoch: 0 | Step: 147430 | Dataset: 0-1549824 | Loss: 0.347 | 911 ms/step , 6903.01 GFLOP/s , 17961.1 tokens/s INFO:__main__:2024-11-05 11:00:02 | Epoch: 0 | Step: 147440 | Dataset: 0-1550144 | Loss: 0.332 | 912 ms/step , 6894.84 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-05 11:00:11 | Epoch: 0 | Step: 147450 | Dataset: 0-1550464 | Loss: 0.399 | 912 ms/step , 6895.99 GFLOP/s , 17959.1 tokens/s INFO:__main__:2024-11-05 11:00:21 | Epoch: 0 | Step: 147460 | Dataset: 0-1550784 | Loss: 0.615 | 912 ms/step , 6898.79 GFLOP/s , 17960.5 tokens/s INFO:__main__:2024-11-05 11:00:30 | Epoch: 0 | Step: 147470 | Dataset: 0-1551104 | Loss: 0.492 | 912 ms/step , 6899.10 GFLOP/s , 17962.9 tokens/s INFO:__main__:2024-11-05 11:00:39 | Epoch: 0 | Step: 147480 | Dataset: 0-1551424 | Loss: 0.406 | 913 ms/step , 6891.71 GFLOP/s , 17952.6 tokens/s INFO:__main__:2024-11-05 11:00:48 | Epoch: 0 | Step: 147490 | Dataset: 0-1551744 | Loss: 0.360 | 912 ms/step , 6896.12 GFLOP/s , 17956.2 tokens/s INFO:__main__:2024-11-05 11:00:57 | Epoch: 0 | Step: 147500 | Dataset: 0-1552064 | Loss: 0.371 | 913 ms/step , 6885.33 GFLOP/s , 17958.7 tokens/s INFO:__main__:2024-11-05 11:00:59 | Validation | Step: 147500 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:01:08 | Epoch: 0 | Step: 147510 | Dataset: 0-1552384 | Loss: 0.411 | 913 ms/step , 6889.48 GFLOP/s , 15288.3 tokens/s INFO:__main__:2024-11-05 11:01:17 | Epoch: 0 | Step: 147520 | Dataset: 0-1552704 | Loss: 0.732 | 912 ms/step , 6898.04 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 11:01:26 | Epoch: 0 | Step: 147530 | Dataset: 0-1553024 | Loss: 0.598 | 913 ms/step , 6892.56 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 11:01:35 | Epoch: 0 | Step: 147540 | Dataset: 0-1553344 | Loss: 0.692 | 913 ms/step , 6885.43 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 11:01:44 | Epoch: 0 | Step: 147550 | Dataset: 0-1553664 | Loss: 0.782 | 913 ms/step , 6891.71 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 11:01:53 | Epoch: 0 | Step: 147560 | Dataset: 0-1553984 | Loss: 0.765 | 912 ms/step , 6892.77 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 11:02:03 | Epoch: 0 | Step: 147570 | Dataset: 0-1554304 | Loss: 0.564 | 912 ms/step , 6894.10 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 11:02:12 | Epoch: 0 | Step: 147580 | Dataset: 0-1554624 | Loss: 0.567 | 913 ms/step , 6892.32 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 11:02:21 | Epoch: 0 | Step: 147590 | Dataset: 0-1554944 | Loss: 0.721 | 914 ms/step , 6882.94 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 11:02:30 | Epoch: 0 | Step: 147600 | Dataset: 0-1555264 | Loss: 0.732 | 912 ms/step , 6895.25 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 11:02:32 | Validation | Step: 147600 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:02:41 | Epoch: 0 | Step: 147610 | Dataset: 0-1555584 | Loss: 0.756 | 914 ms/step , 6882.75 GFLOP/s , 15277.0 tokens/s INFO:__main__:2024-11-05 11:02:50 | Epoch: 0 | Step: 147620 | Dataset: 0-1555904 | Loss: 0.687 | 912 ms/step , 6894.41 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 11:02:59 | Epoch: 0 | Step: 147630 | Dataset: 0-1556224 | Loss: 0.689 | 911 ms/step , 6901.68 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 11:03:08 | Epoch: 0 | Step: 147640 | Dataset: 0-1556544 | Loss: 0.739 | 912 ms/step , 6894.30 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 11:03:17 | Epoch: 0 | Step: 147650 | Dataset: 0-1556864 | Loss: 0.840 | 915 ms/step , 6870.30 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 11:03:26 | Epoch: 0 | Step: 147660 | Dataset: 0-1557184 | Loss: 0.812 | 913 ms/step , 6892.18 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 11:03:35 | Epoch: 0 | Step: 147670 | Dataset: 0-1557504 | Loss: 0.802 | 913 ms/step , 6886.00 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 11:03:45 | Epoch: 0 | Step: 147680 | Dataset: 0-1557824 | Loss: 0.748 | 914 ms/step , 6884.03 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 11:03:54 | Epoch: 0 | Step: 147690 | Dataset: 0-1558144 | Loss: 0.746 | 912 ms/step , 6892.90 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 11:04:03 | Epoch: 0 | Step: 147700 | Dataset: 0-1558464 | Loss: 0.811 | 913 ms/step , 6886.01 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 11:04:04 | Validation | Step: 147700 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:04:14 | Epoch: 0 | Step: 147710 | Dataset: 0-1558784 | Loss: 0.784 | 913 ms/step , 6890.33 GFLOP/s , 15277.8 tokens/s INFO:__main__:2024-11-05 11:04:23 | Epoch: 0 | Step: 147720 | Dataset: 0-1559104 | Loss: 0.591 | 911 ms/step , 6900.76 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 11:04:32 | Epoch: 0 | Step: 147730 | Dataset: 0-1559424 | Loss: 0.789 | 914 ms/step , 6882.68 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 11:04:41 | Epoch: 0 | Step: 147740 | Dataset: 0-1559744 | Loss: 0.696 | 912 ms/step , 6897.00 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 11:04:50 | Epoch: 0 | Step: 147750 | Dataset: 0-1560064 | Loss: 0.758 | 912 ms/step , 6894.08 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 11:04:59 | Epoch: 0 | Step: 147760 | Dataset: 0-1560384 | Loss: 0.765 | 913 ms/step , 6886.83 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 11:05:08 | Epoch: 0 | Step: 147770 | Dataset: 0-1560704 | Loss: 0.720 | 913 ms/step , 6887.59 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 11:05:18 | Epoch: 0 | Step: 147780 | Dataset: 0-1561024 | Loss: 0.746 | 912 ms/step , 6894.52 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 11:05:27 | Epoch: 0 | Step: 147790 | Dataset: 0-1561344 | Loss: 0.767 | 913 ms/step , 6885.61 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 11:05:36 | Epoch: 0 | Step: 147800 | Dataset: 0-1561664 | Loss: 0.726 | 913 ms/step , 6891.49 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 11:05:37 | Validation | Step: 147800 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:05:47 | Epoch: 0 | Step: 147810 | Dataset: 0-1561984 | Loss: 0.752 | 914 ms/step , 6878.83 GFLOP/s , 15263.5 tokens/s INFO:__main__:2024-11-05 11:05:56 | Epoch: 0 | Step: 147820 | Dataset: 0-1562304 | Loss: 0.656 | 914 ms/step , 6880.63 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 11:06:05 | Epoch: 0 | Step: 147830 | Dataset: 0-1562624 | Loss: 0.777 | 914 ms/step , 6882.88 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 11:06:14 | Epoch: 0 | Step: 147840 | Dataset: 0-1562944 | Loss: 0.579 | 912 ms/step , 6894.07 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 11:06:23 | Epoch: 0 | Step: 147850 | Dataset: 0-1563264 | Loss: 0.656 | 912 ms/step , 6894.53 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 11:06:32 | Epoch: 0 | Step: 147860 | Dataset: 0-1563584 | Loss: 0.744 | 913 ms/step , 6891.53 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 11:06:41 | Epoch: 0 | Step: 147870 | Dataset: 0-1563904 | Loss: 0.695 | 914 ms/step , 6880.01 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 11:06:50 | Epoch: 0 | Step: 147880 | Dataset: 0-1564224 | Loss: 0.716 | 913 ms/step , 6890.51 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 11:07:00 | Epoch: 0 | Step: 147890 | Dataset: 0-1564544 | Loss: 0.672 | 914 ms/step , 6881.09 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 11:07:09 | Epoch: 0 | Step: 147900 | Dataset: 0-1564864 | Loss: 0.730 | 913 ms/step , 6888.34 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 11:07:10 | Validation | Step: 147900 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:07:19 | Epoch: 0 | Step: 147910 | Dataset: 0-1565184 | Loss: 0.724 | 912 ms/step , 6894.49 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 11:07:29 | Epoch: 0 | Step: 147920 | Dataset: 0-1565504 | Loss: 0.765 | 913 ms/step , 6888.99 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 11:07:38 | Epoch: 0 | Step: 147930 | Dataset: 0-1565824 | Loss: 0.790 | 913 ms/step , 6886.12 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 11:07:47 | Epoch: 0 | Step: 147940 | Dataset: 0-1566144 | Loss: 0.666 | 912 ms/step , 6896.90 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 11:07:56 | Epoch: 0 | Step: 147950 | Dataset: 0-1566464 | Loss: 0.735 | 912 ms/step , 6893.15 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 11:08:05 | Epoch: 0 | Step: 147960 | Dataset: 0-1566784 | Loss: 0.833 | 913 ms/step , 6890.86 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 11:08:14 | Epoch: 0 | Step: 147970 | Dataset: 0-1567104 | Loss: 0.665 | 913 ms/step , 6885.31 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 11:08:23 | Epoch: 0 | Step: 147980 | Dataset: 0-1567424 | Loss: 0.697 | 912 ms/step , 6892.77 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:08:33 | Epoch: 0 | Step: 147990 | Dataset: 0-1567744 | Loss: 0.722 | 913 ms/step , 6889.09 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 11:08:42 | Epoch: 0 | Step: 148000 | Dataset: 0-1568064 | Loss: 0.763 | 914 ms/step , 6884.19 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 11:08:43 | Validation | Step: 148000 | Val_loss: 0.707 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:08:43 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_110843_step_148000.pt` INFO:__main__:2024-11-05 11:08:54 | Epoch: 0 | Step: 148010 | Dataset: 0-1568384 | Loss: 0.756 | 913 ms/step , 6889.77 GFLOP/s , 13767.5 tokens/s INFO:__main__:2024-11-05 11:09:03 | Epoch: 0 | Step: 148020 | Dataset: 0-1568704 | Loss: 0.697 | 912 ms/step , 6898.45 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 11:09:12 | Epoch: 0 | Step: 148030 | Dataset: 0-1569024 | Loss: 0.630 | 914 ms/step , 6883.64 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 11:09:21 | Epoch: 0 | Step: 148040 | Dataset: 0-1569344 | Loss: 0.660 | 914 ms/step , 6880.08 GFLOP/s , 17890.9 tokens/s INFO:__main__:2024-11-05 11:09:30 | Epoch: 0 | Step: 148050 | Dataset: 0-1569664 | Loss: 0.630 | 914 ms/step , 6883.37 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 11:09:39 | Epoch: 0 | Step: 148060 | Dataset: 0-1569984 | Loss: 0.728 | 913 ms/step , 6886.88 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:09:48 | Epoch: 0 | Step: 148070 | Dataset: 0-1570304 | Loss: 0.716 | 914 ms/step , 6880.60 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 11:09:58 | Epoch: 0 | Step: 148080 | Dataset: 0-1570624 | Loss: 0.737 | 913 ms/step , 6890.83 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 11:10:07 | Epoch: 0 | Step: 148090 | Dataset: 0-1570944 | Loss: 0.532 | 913 ms/step , 6888.32 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:10:16 | Epoch: 0 | Step: 148100 | Dataset: 0-1571264 | Loss: 0.797 | 914 ms/step , 6879.44 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 11:10:17 | Validation | Step: 148100 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:10:27 | Epoch: 0 | Step: 148110 | Dataset: 0-1571584 | Loss: 0.743 | 914 ms/step , 6882.15 GFLOP/s , 15268.2 tokens/s INFO:__main__:2024-11-05 11:10:36 | Epoch: 0 | Step: 148120 | Dataset: 0-1571904 | Loss: 0.796 | 912 ms/step , 6893.46 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 11:10:45 | Epoch: 0 | Step: 148130 | Dataset: 0-1572224 | Loss: 0.629 | 913 ms/step , 6890.11 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 11:10:54 | Epoch: 0 | Step: 148140 | Dataset: 0-1572544 | Loss: 0.706 | 912 ms/step , 6892.82 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 11:11:03 | Epoch: 0 | Step: 148150 | Dataset: 0-1572864 | Loss: 0.745 | 913 ms/step , 6890.34 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 11:11:12 | Epoch: 0 | Step: 148160 | Dataset: 0-1573184 | Loss: 0.803 | 914 ms/step , 6881.75 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 11:11:21 | Epoch: 0 | Step: 148170 | Dataset: 0-1573504 | Loss: 0.819 | 913 ms/step , 6890.09 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 11:11:31 | Epoch: 0 | Step: 148180 | Dataset: 0-1573824 | Loss: 0.697 | 913 ms/step , 6891.57 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:11:40 | Epoch: 0 | Step: 148190 | Dataset: 0-1574144 | Loss: 0.811 | 913 ms/step , 6888.67 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 11:11:49 | Epoch: 0 | Step: 148200 | Dataset: 0-1574464 | Loss: 0.720 | 913 ms/step , 6890.57 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 11:11:50 | Validation | Step: 148200 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:12:00 | Epoch: 0 | Step: 148210 | Dataset: 0-1574784 | Loss: 0.785 | 913 ms/step , 6891.37 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-05 11:12:09 | Epoch: 0 | Step: 148220 | Dataset: 0-1575104 | Loss: 0.834 | 915 ms/step , 6875.98 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 11:12:18 | Epoch: 0 | Step: 148230 | Dataset: 0-1575424 | Loss: 0.706 | 912 ms/step , 6894.40 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 11:12:27 | Epoch: 0 | Step: 148240 | Dataset: 0-1575744 | Loss: 0.719 | 913 ms/step , 6887.31 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 11:12:36 | Epoch: 0 | Step: 148250 | Dataset: 0-1576064 | Loss: 0.647 | 914 ms/step , 6885.02 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 11:12:45 | Epoch: 0 | Step: 148260 | Dataset: 0-1576384 | Loss: 0.668 | 913 ms/step , 6886.30 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 11:12:54 | Epoch: 0 | Step: 148270 | Dataset: 0-1576704 | Loss: 0.765 | 913 ms/step , 6891.05 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 11:13:03 | Epoch: 0 | Step: 148280 | Dataset: 0-1577024 | Loss: 0.767 | 914 ms/step , 6881.66 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 11:13:13 | Epoch: 0 | Step: 148290 | Dataset: 0-1577344 | Loss: 0.843 | 915 ms/step , 6876.84 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 11:13:22 | Epoch: 0 | Step: 148300 | Dataset: 0-1577664 | Loss: 0.731 | 914 ms/step , 6882.56 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 11:13:23 | Validation | Step: 148300 | Val_loss: 0.721 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:13:32 | Epoch: 0 | Step: 148310 | Dataset: 0-1577984 | Loss: 0.759 | 913 ms/step , 6890.81 GFLOP/s , 15282.3 tokens/s INFO:__main__:2024-11-05 11:13:42 | Epoch: 0 | Step: 148320 | Dataset: 0-1578304 | Loss: 0.774 | 913 ms/step , 6892.43 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 11:13:51 | Epoch: 0 | Step: 148330 | Dataset: 0-1578624 | Loss: 0.864 | 913 ms/step , 6887.12 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 11:14:00 | Epoch: 0 | Step: 148340 | Dataset: 0-1578944 | Loss: 0.702 | 912 ms/step , 6894.36 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 11:14:09 | Epoch: 0 | Step: 148350 | Dataset: 0-1579264 | Loss: 0.682 | 913 ms/step , 6886.58 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 11:14:18 | Epoch: 0 | Step: 148360 | Dataset: 0-1579584 | Loss: 0.670 | 913 ms/step , 6887.41 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 11:14:27 | Epoch: 0 | Step: 148370 | Dataset: 0-1579904 | Loss: 0.761 | 913 ms/step , 6887.92 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:14:36 | Epoch: 0 | Step: 148380 | Dataset: 0-1580224 | Loss: 0.775 | 914 ms/step , 6884.39 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 11:14:46 | Epoch: 0 | Step: 148390 | Dataset: 0-1580544 | Loss: 0.727 | 912 ms/step , 6899.32 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 11:14:55 | Epoch: 0 | Step: 148400 | Dataset: 0-1580864 | Loss: 0.805 | 913 ms/step , 6886.69 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 11:14:56 | Validation | Step: 148400 | Val_loss: 0.710 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:15:05 | Epoch: 0 | Step: 148410 | Dataset: 0-1581184 | Loss: 0.743 | 913 ms/step , 6885.90 GFLOP/s , 15285.5 tokens/s INFO:__main__:2024-11-05 11:15:15 | Epoch: 0 | Step: 148420 | Dataset: 0-1581504 | Loss: 0.637 | 912 ms/step , 6893.46 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 11:15:24 | Epoch: 0 | Step: 148430 | Dataset: 0-1581824 | Loss: 0.738 | 913 ms/step , 6888.52 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 11:15:33 | Epoch: 0 | Step: 148440 | Dataset: 0-1582144 | Loss: 0.785 | 913 ms/step , 6888.50 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 11:15:42 | Epoch: 0 | Step: 148450 | Dataset: 0-1582464 | Loss: 0.703 | 914 ms/step , 6883.42 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 11:15:51 | Epoch: 0 | Step: 148460 | Dataset: 0-1582784 | Loss: 0.655 | 914 ms/step , 6883.81 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 11:16:00 | Epoch: 0 | Step: 148470 | Dataset: 0-1583104 | Loss: 0.751 | 915 ms/step , 6876.89 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 11:16:09 | Epoch: 0 | Step: 148480 | Dataset: 0-1583424 | Loss: 0.667 | 913 ms/step , 6890.24 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 11:16:18 | Epoch: 0 | Step: 148490 | Dataset: 0-1583744 | Loss: 0.825 | 914 ms/step , 6879.30 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 11:16:28 | Epoch: 0 | Step: 148500 | Dataset: 0-1584064 | Loss: 0.725 | 913 ms/step , 6892.20 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 11:16:29 | Validation | Step: 148500 | Val_loss: 0.729 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:16:38 | Epoch: 0 | Step: 148510 | Dataset: 0-1584384 | Loss: 0.742 | 913 ms/step , 6891.37 GFLOP/s , 15281.4 tokens/s INFO:__main__:2024-11-05 11:16:47 | Epoch: 0 | Step: 148520 | Dataset: 0-1584704 | Loss: 0.735 | 912 ms/step , 6894.69 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 11:16:57 | Epoch: 0 | Step: 148530 | Dataset: 0-1585024 | Loss: 0.706 | 912 ms/step , 6896.31 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 11:17:06 | Epoch: 0 | Step: 148540 | Dataset: 0-1585344 | Loss: 0.711 | 912 ms/step , 6897.70 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 11:17:15 | Epoch: 0 | Step: 148550 | Dataset: 0-1585664 | Loss: 0.704 | 915 ms/step , 6876.69 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 11:17:24 | Epoch: 0 | Step: 148560 | Dataset: 0-1585984 | Loss: 0.858 | 913 ms/step , 6890.28 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 11:17:33 | Epoch: 0 | Step: 148570 | Dataset: 0-1586304 | Loss: 0.676 | 913 ms/step , 6886.34 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 11:17:42 | Epoch: 0 | Step: 148580 | Dataset: 0-1586624 | Loss: 0.765 | 914 ms/step , 6882.77 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 11:17:51 | Epoch: 0 | Step: 148590 | Dataset: 0-1586944 | Loss: 0.788 | 914 ms/step , 6879.53 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 11:18:01 | Epoch: 0 | Step: 148600 | Dataset: 0-1587264 | Loss: 0.708 | 914 ms/step , 6884.54 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 11:18:02 | Validation | Step: 148600 | Val_loss: 0.791 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:18:11 | Epoch: 0 | Step: 148610 | Dataset: 0-1587584 | Loss: 0.642 | 912 ms/step , 6894.83 GFLOP/s , 15287.8 tokens/s INFO:__main__:2024-11-05 11:18:20 | Epoch: 0 | Step: 148620 | Dataset: 0-1587904 | Loss: 0.722 | 913 ms/step , 6891.05 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 11:18:30 | Epoch: 0 | Step: 148630 | Dataset: 0-1588224 | Loss: 0.701 | 913 ms/step , 6892.40 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 11:18:39 | Epoch: 0 | Step: 148640 | Dataset: 0-1588544 | Loss: 0.707 | 914 ms/step , 6878.39 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 11:18:48 | Epoch: 0 | Step: 148650 | Dataset: 0-1588864 | Loss: 0.705 | 913 ms/step , 6887.99 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 11:18:57 | Epoch: 0 | Step: 148660 | Dataset: 0-1589184 | Loss: 0.646 | 912 ms/step , 6893.63 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 11:19:06 | Epoch: 0 | Step: 148670 | Dataset: 0-1589504 | Loss: 0.803 | 914 ms/step , 6884.25 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 11:19:15 | Epoch: 0 | Step: 148680 | Dataset: 0-1589824 | Loss: 0.847 | 914 ms/step , 6883.50 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 11:19:24 | Epoch: 0 | Step: 148690 | Dataset: 0-1590144 | Loss: 0.630 | 913 ms/step , 6891.27 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 11:19:34 | Epoch: 0 | Step: 148700 | Dataset: 0-1590464 | Loss: 0.713 | 914 ms/step , 6883.55 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 11:19:35 | Validation | Step: 148700 | Val_loss: 0.693 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:19:44 | Epoch: 0 | Step: 148710 | Dataset: 0-1590784 | Loss: 0.766 | 912 ms/step , 6895.31 GFLOP/s , 15274.8 tokens/s INFO:__main__:2024-11-05 11:19:53 | Epoch: 0 | Step: 148720 | Dataset: 0-1591104 | Loss: 0.702 | 913 ms/step , 6889.47 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 11:20:03 | Epoch: 0 | Step: 148730 | Dataset: 0-1591424 | Loss: 0.744 | 913 ms/step , 6889.30 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 11:20:12 | Epoch: 0 | Step: 148740 | Dataset: 0-1591744 | Loss: 0.763 | 913 ms/step , 6891.94 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 11:20:21 | Epoch: 0 | Step: 148750 | Dataset: 0-1592064 | Loss: 0.748 | 912 ms/step , 6896.98 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 11:20:30 | Epoch: 0 | Step: 148760 | Dataset: 0-1592384 | Loss: 0.693 | 913 ms/step , 6885.90 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 11:20:39 | Epoch: 0 | Step: 148770 | Dataset: 0-1592704 | Loss: 0.740 | 913 ms/step , 6891.58 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 11:20:48 | Epoch: 0 | Step: 148780 | Dataset: 0-1593024 | Loss: 0.703 | 914 ms/step , 6884.38 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 11:20:57 | Epoch: 0 | Step: 148790 | Dataset: 0-1593344 | Loss: 0.721 | 912 ms/step , 6895.98 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 11:21:06 | Epoch: 0 | Step: 148800 | Dataset: 0-1593664 | Loss: 0.712 | 915 ms/step , 6874.83 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 11:21:08 | Validation | Step: 148800 | Val_loss: 0.740 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:21:17 | Epoch: 0 | Step: 148810 | Dataset: 0-1593984 | Loss: 0.740 | 913 ms/step , 6885.42 GFLOP/s , 15278.0 tokens/s INFO:__main__:2024-11-05 11:21:26 | Epoch: 0 | Step: 148820 | Dataset: 0-1594304 | Loss: 0.791 | 913 ms/step , 6886.35 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 11:21:35 | Epoch: 0 | Step: 148830 | Dataset: 0-1594624 | Loss: 0.746 | 913 ms/step , 6891.55 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 11:21:45 | Epoch: 0 | Step: 148840 | Dataset: 0-1594944 | Loss: 0.747 | 912 ms/step , 6897.68 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 11:21:54 | Epoch: 0 | Step: 148850 | Dataset: 0-1595264 | Loss: 0.749 | 914 ms/step , 6884.77 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 11:22:03 | Epoch: 0 | Step: 148860 | Dataset: 0-1595584 | Loss: 0.796 | 916 ms/step , 6865.31 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:22:12 | Epoch: 0 | Step: 148870 | Dataset: 0-1595904 | Loss: 0.822 | 913 ms/step , 6885.80 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 11:22:21 | Epoch: 0 | Step: 148880 | Dataset: 0-1596224 | Loss: 0.771 | 912 ms/step , 6893.29 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 11:22:30 | Epoch: 0 | Step: 148890 | Dataset: 0-1596544 | Loss: 0.673 | 913 ms/step , 6891.63 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 11:22:39 | Epoch: 0 | Step: 148900 | Dataset: 0-1596864 | Loss: 0.778 | 912 ms/step , 6892.89 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 11:22:41 | Validation | Step: 148900 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:22:50 | Epoch: 0 | Step: 148910 | Dataset: 0-1597184 | Loss: 0.705 | 913 ms/step , 6886.47 GFLOP/s , 15281.0 tokens/s INFO:__main__:2024-11-05 11:22:59 | Epoch: 0 | Step: 148920 | Dataset: 0-1597504 | Loss: 0.740 | 913 ms/step , 6886.70 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 11:23:08 | Epoch: 0 | Step: 148930 | Dataset: 0-1597824 | Loss: 0.752 | 914 ms/step , 6883.74 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 11:23:18 | Epoch: 0 | Step: 148940 | Dataset: 0-1598144 | Loss: 0.824 | 914 ms/step , 6877.60 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 11:23:27 | Epoch: 0 | Step: 148950 | Dataset: 0-1598464 | Loss: 0.759 | 913 ms/step , 6887.60 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 11:23:36 | Epoch: 0 | Step: 148960 | Dataset: 0-1598784 | Loss: 0.817 | 913 ms/step , 6886.02 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 11:23:45 | Epoch: 0 | Step: 148970 | Dataset: 0-1599104 | Loss: 0.777 | 913 ms/step , 6887.71 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 11:23:54 | Epoch: 0 | Step: 148980 | Dataset: 0-1599424 | Loss: 0.621 | 912 ms/step , 6896.65 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:24:03 | Epoch: 0 | Step: 148990 | Dataset: 0-1599744 | Loss: 0.692 | 913 ms/step , 6890.47 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 11:24:12 | Epoch: 0 | Step: 149000 | Dataset: 0-1600064 | Loss: 0.765 | 913 ms/step , 6887.71 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 11:24:14 | Validation | Step: 149000 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:24:14 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_112414_step_149000.pt` INFO:__main__:2024-11-05 11:24:24 | Epoch: 0 | Step: 149010 | Dataset: 0-1600384 | Loss: 0.655 | 919 ms/step , 6844.30 GFLOP/s , 13807.5 tokens/s INFO:__main__:2024-11-05 11:24:33 | Epoch: 0 | Step: 149020 | Dataset: 0-1600704 | Loss: 0.649 | 913 ms/step , 6887.84 GFLOP/s , 17889.8 tokens/s INFO:__main__:2024-11-05 11:24:43 | Epoch: 0 | Step: 149030 | Dataset: 0-1601024 | Loss: 0.784 | 915 ms/step , 6874.93 GFLOP/s , 17904.8 tokens/s INFO:__main__:2024-11-05 11:24:52 | Epoch: 0 | Step: 149040 | Dataset: 0-1601344 | Loss: 0.703 | 915 ms/step , 6876.47 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 11:25:01 | Epoch: 0 | Step: 149050 | Dataset: 0-1601664 | Loss: 0.726 | 912 ms/step , 6896.34 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 11:25:10 | Epoch: 0 | Step: 149060 | Dataset: 0-1601984 | Loss: 0.743 | 913 ms/step , 6886.27 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 11:25:19 | Epoch: 0 | Step: 149070 | Dataset: 0-1602304 | Loss: 0.768 | 913 ms/step , 6889.81 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 11:25:28 | Epoch: 0 | Step: 149080 | Dataset: 0-1602624 | Loss: 0.635 | 913 ms/step , 6887.41 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 11:25:37 | Epoch: 0 | Step: 149090 | Dataset: 0-1602944 | Loss: 0.665 | 914 ms/step , 6879.08 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 11:25:46 | Epoch: 0 | Step: 149100 | Dataset: 0-1603264 | Loss: 0.724 | 913 ms/step , 6886.52 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 11:25:48 | Validation | Step: 149100 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:25:57 | Epoch: 0 | Step: 149110 | Dataset: 0-1603584 | Loss: 0.732 | 913 ms/step , 6888.74 GFLOP/s , 15271.1 tokens/s INFO:__main__:2024-11-05 11:26:06 | Epoch: 0 | Step: 149120 | Dataset: 0-1603904 | Loss: 0.735 | 913 ms/step , 6887.54 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 11:26:15 | Epoch: 0 | Step: 149130 | Dataset: 0-1604224 | Loss: 0.723 | 912 ms/step , 6896.53 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 11:26:25 | Epoch: 0 | Step: 149140 | Dataset: 0-1604544 | Loss: 0.783 | 914 ms/step , 6881.60 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 11:26:34 | Epoch: 0 | Step: 149150 | Dataset: 0-1604864 | Loss: 0.751 | 913 ms/step , 6888.48 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:26:43 | Epoch: 0 | Step: 149160 | Dataset: 0-1605184 | Loss: 0.670 | 912 ms/step , 6892.96 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 11:26:52 | Epoch: 0 | Step: 149170 | Dataset: 0-1605504 | Loss: 0.772 | 913 ms/step , 6889.78 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 11:27:01 | Epoch: 0 | Step: 149180 | Dataset: 0-1605824 | Loss: 0.726 | 913 ms/step , 6889.03 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 11:27:10 | Epoch: 0 | Step: 149190 | Dataset: 0-1606144 | Loss: 0.698 | 913 ms/step , 6888.00 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 11:27:19 | Epoch: 0 | Step: 149200 | Dataset: 0-1606464 | Loss: 0.695 | 913 ms/step , 6887.52 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 11:27:21 | Validation | Step: 149200 | Val_loss: 0.730 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:27:30 | Epoch: 0 | Step: 149210 | Dataset: 0-1606784 | Loss: 0.661 | 912 ms/step , 6898.94 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 11:27:39 | Epoch: 0 | Step: 149220 | Dataset: 0-1607104 | Loss: 0.682 | 914 ms/step , 6880.23 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 11:27:48 | Epoch: 0 | Step: 149230 | Dataset: 0-1607424 | Loss: 0.737 | 913 ms/step , 6888.65 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 11:27:58 | Epoch: 0 | Step: 149240 | Dataset: 0-1607744 | Loss: 0.593 | 913 ms/step , 6891.92 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 11:28:07 | Epoch: 0 | Step: 149250 | Dataset: 0-1608064 | Loss: 0.788 | 914 ms/step , 6884.92 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 11:28:16 | Epoch: 0 | Step: 149260 | Dataset: 0-1608384 | Loss: 0.675 | 914 ms/step , 6879.48 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 11:28:25 | Epoch: 0 | Step: 149270 | Dataset: 0-1608704 | Loss: 0.816 | 914 ms/step , 6880.05 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 11:28:34 | Epoch: 0 | Step: 149280 | Dataset: 0-1609024 | Loss: 0.723 | 913 ms/step , 6891.85 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 11:28:43 | Epoch: 0 | Step: 149290 | Dataset: 0-1609344 | Loss: 0.682 | 914 ms/step , 6884.49 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 11:28:52 | Epoch: 0 | Step: 149300 | Dataset: 0-1609664 | Loss: 0.662 | 914 ms/step , 6878.93 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 11:28:54 | Validation | Step: 149300 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:29:03 | Epoch: 0 | Step: 149310 | Dataset: 0-1609984 | Loss: 0.709 | 912 ms/step , 6895.55 GFLOP/s , 15274.9 tokens/s INFO:__main__:2024-11-05 11:29:12 | Epoch: 0 | Step: 149320 | Dataset: 0-1610304 | Loss: 0.755 | 913 ms/step , 6892.39 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 11:29:21 | Epoch: 0 | Step: 149330 | Dataset: 0-1610624 | Loss: 0.718 | 913 ms/step , 6890.79 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 11:29:31 | Epoch: 0 | Step: 149340 | Dataset: 0-1610944 | Loss: 0.807 | 914 ms/step , 6878.58 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 11:29:40 | Epoch: 0 | Step: 149350 | Dataset: 0-1611264 | Loss: 0.725 | 913 ms/step , 6891.90 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 11:29:49 | Epoch: 0 | Step: 149360 | Dataset: 0-1611584 | Loss: 0.653 | 912 ms/step , 6893.05 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 11:29:58 | Epoch: 0 | Step: 149370 | Dataset: 0-1611904 | Loss: 0.742 | 914 ms/step , 6881.48 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 11:30:07 | Epoch: 0 | Step: 149380 | Dataset: 0-1612224 | Loss: 0.758 | 915 ms/step , 6875.42 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 11:30:16 | Epoch: 0 | Step: 149390 | Dataset: 0-1612544 | Loss: 0.648 | 913 ms/step , 6891.83 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 11:30:25 | Epoch: 0 | Step: 149400 | Dataset: 0-1612864 | Loss: 0.781 | 913 ms/step , 6892.04 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 11:30:27 | Validation | Step: 149400 | Val_loss: 0.688 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:30:36 | Epoch: 0 | Step: 149410 | Dataset: 0-1613184 | Loss: 0.663 | 913 ms/step , 6887.02 GFLOP/s , 15274.8 tokens/s INFO:__main__:2024-11-05 11:30:45 | Epoch: 0 | Step: 149420 | Dataset: 0-1613504 | Loss: 0.622 | 912 ms/step , 6897.03 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 11:30:54 | Epoch: 0 | Step: 149430 | Dataset: 0-1613824 | Loss: 0.748 | 913 ms/step , 6885.89 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 11:31:04 | Epoch: 0 | Step: 149440 | Dataset: 0-1614144 | Loss: 0.749 | 913 ms/step , 6889.01 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 11:31:13 | Epoch: 0 | Step: 149450 | Dataset: 0-1614464 | Loss: 0.711 | 914 ms/step , 6879.36 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 11:31:22 | Epoch: 0 | Step: 149460 | Dataset: 0-1614784 | Loss: 0.833 | 912 ms/step , 6893.31 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 11:31:31 | Epoch: 0 | Step: 149470 | Dataset: 0-1615104 | Loss: 0.775 | 914 ms/step , 6882.35 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 11:31:40 | Epoch: 0 | Step: 149480 | Dataset: 0-1615424 | Loss: 0.755 | 913 ms/step , 6888.71 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 11:31:49 | Epoch: 0 | Step: 149490 | Dataset: 0-1615744 | Loss: 0.785 | 914 ms/step , 6880.89 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 11:31:58 | Epoch: 0 | Step: 149500 | Dataset: 0-1616064 | Loss: 0.865 | 913 ms/step , 6889.85 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 11:32:00 | Validation | Step: 149500 | Val_loss: 0.783 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:32:09 | Epoch: 0 | Step: 149510 | Dataset: 0-1616384 | Loss: 0.841 | 913 ms/step , 6888.12 GFLOP/s , 15281.4 tokens/s INFO:__main__:2024-11-05 11:32:18 | Epoch: 0 | Step: 149520 | Dataset: 0-1616704 | Loss: 0.697 | 912 ms/step , 6894.90 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 11:32:27 | Epoch: 0 | Step: 149530 | Dataset: 0-1617024 | Loss: 0.715 | 912 ms/step , 6895.25 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 11:32:36 | Epoch: 0 | Step: 149540 | Dataset: 0-1617344 | Loss: 0.690 | 913 ms/step , 6891.33 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 11:32:46 | Epoch: 0 | Step: 149550 | Dataset: 0-1617664 | Loss: 0.734 | 913 ms/step , 6892.20 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 11:32:55 | Epoch: 0 | Step: 149560 | Dataset: 0-1617984 | Loss: 0.737 | 912 ms/step , 6893.87 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 11:33:04 | Epoch: 0 | Step: 149570 | Dataset: 0-1618304 | Loss: 0.734 | 913 ms/step , 6890.02 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 11:33:13 | Epoch: 0 | Step: 149580 | Dataset: 0-1618624 | Loss: 0.679 | 913 ms/step , 6885.14 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 11:33:22 | Epoch: 0 | Step: 149590 | Dataset: 0-1618944 | Loss: 0.643 | 912 ms/step , 6897.38 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 11:33:31 | Epoch: 0 | Step: 149600 | Dataset: 0-1619264 | Loss: 0.724 | 914 ms/step , 6878.42 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 11:33:33 | Validation | Step: 149600 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:33:42 | Epoch: 0 | Step: 149610 | Dataset: 0-1619584 | Loss: 0.791 | 914 ms/step , 6882.57 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-05 11:33:51 | Epoch: 0 | Step: 149620 | Dataset: 0-1619904 | Loss: 0.681 | 913 ms/step , 6888.66 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 11:34:00 | Epoch: 0 | Step: 149630 | Dataset: 0-1620224 | Loss: 0.734 | 914 ms/step , 6881.71 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 11:34:09 | Epoch: 0 | Step: 149640 | Dataset: 0-1620544 | Loss: 0.793 | 913 ms/step , 6887.68 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 11:34:19 | Epoch: 0 | Step: 149650 | Dataset: 0-1620864 | Loss: 0.729 | 913 ms/step , 6885.96 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 11:34:28 | Epoch: 0 | Step: 149660 | Dataset: 0-1621184 | Loss: 0.729 | 913 ms/step , 6885.20 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 11:34:37 | Epoch: 0 | Step: 149670 | Dataset: 0-1621504 | Loss: 0.768 | 914 ms/step , 6879.44 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 11:34:46 | Epoch: 0 | Step: 149680 | Dataset: 0-1621824 | Loss: 0.775 | 913 ms/step , 6886.51 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 11:34:55 | Epoch: 0 | Step: 149690 | Dataset: 0-1622144 | Loss: 0.700 | 913 ms/step , 6885.12 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 11:35:04 | Epoch: 0 | Step: 149700 | Dataset: 0-1622464 | Loss: 0.741 | 913 ms/step , 6889.80 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 11:35:06 | Validation | Step: 149700 | Val_loss: 0.769 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:35:15 | Epoch: 0 | Step: 149710 | Dataset: 0-1622784 | Loss: 0.789 | 914 ms/step , 6884.65 GFLOP/s , 15283.2 tokens/s INFO:__main__:2024-11-05 11:35:24 | Epoch: 0 | Step: 149720 | Dataset: 0-1623104 | Loss: 0.754 | 915 ms/step , 6873.06 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 11:35:33 | Epoch: 0 | Step: 149730 | Dataset: 0-1623424 | Loss: 0.781 | 912 ms/step , 6897.49 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 11:35:42 | Epoch: 0 | Step: 149740 | Dataset: 0-1623744 | Loss: 0.672 | 914 ms/step , 6884.88 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 11:35:52 | Epoch: 0 | Step: 149750 | Dataset: 0-1624064 | Loss: 0.689 | 914 ms/step , 6882.65 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:36:01 | Epoch: 0 | Step: 149760 | Dataset: 0-1624384 | Loss: 0.797 | 913 ms/step , 6889.12 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 11:36:10 | Epoch: 0 | Step: 149770 | Dataset: 0-1624704 | Loss: 0.826 | 913 ms/step , 6887.24 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 11:36:19 | Epoch: 0 | Step: 149780 | Dataset: 0-1625024 | Loss: 0.673 | 913 ms/step , 6890.30 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 11:36:28 | Epoch: 0 | Step: 149790 | Dataset: 0-1625344 | Loss: 0.690 | 913 ms/step , 6887.80 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 11:36:37 | Epoch: 0 | Step: 149800 | Dataset: 0-1625664 | Loss: 0.756 | 914 ms/step , 6878.81 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:36:39 | Validation | Step: 149800 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:36:48 | Epoch: 0 | Step: 149810 | Dataset: 0-1625984 | Loss: 0.709 | 914 ms/step , 6883.92 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-05 11:36:57 | Epoch: 0 | Step: 149820 | Dataset: 0-1626304 | Loss: 0.754 | 915 ms/step , 6874.46 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 11:37:06 | Epoch: 0 | Step: 149830 | Dataset: 0-1626624 | Loss: 0.656 | 913 ms/step , 6888.21 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 11:37:15 | Epoch: 0 | Step: 149840 | Dataset: 0-1626944 | Loss: 0.808 | 914 ms/step , 6884.06 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 11:37:25 | Epoch: 0 | Step: 149850 | Dataset: 0-1627264 | Loss: 0.696 | 913 ms/step , 6889.33 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 11:37:34 | Epoch: 0 | Step: 149860 | Dataset: 0-1627584 | Loss: 0.733 | 912 ms/step , 6897.43 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 11:37:43 | Epoch: 0 | Step: 149870 | Dataset: 0-1627904 | Loss: 0.795 | 913 ms/step , 6886.74 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 11:37:52 | Epoch: 0 | Step: 149880 | Dataset: 0-1628224 | Loss: 0.710 | 913 ms/step , 6889.45 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 11:38:01 | Epoch: 0 | Step: 149890 | Dataset: 0-1628544 | Loss: 0.732 | 913 ms/step , 6887.87 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 11:38:10 | Epoch: 0 | Step: 149900 | Dataset: 0-1628864 | Loss: 0.698 | 915 ms/step , 6872.31 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 11:38:12 | Validation | Step: 149900 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:38:21 | Epoch: 0 | Step: 149910 | Dataset: 0-1629184 | Loss: 0.697 | 913 ms/step , 6888.20 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-05 11:38:30 | Epoch: 0 | Step: 149920 | Dataset: 0-1629504 | Loss: 0.723 | 913 ms/step , 6889.47 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 11:38:39 | Epoch: 0 | Step: 149930 | Dataset: 0-1629824 | Loss: 0.826 | 915 ms/step , 6876.78 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 11:38:48 | Epoch: 0 | Step: 149940 | Dataset: 0-1630144 | Loss: 0.697 | 913 ms/step , 6887.93 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 11:38:58 | Epoch: 0 | Step: 149950 | Dataset: 0-1630464 | Loss: 0.676 | 915 ms/step , 6875.72 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 11:39:07 | Epoch: 0 | Step: 149960 | Dataset: 0-1630784 | Loss: 0.673 | 913 ms/step , 6891.75 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 11:39:16 | Epoch: 0 | Step: 149970 | Dataset: 0-1631104 | Loss: 0.783 | 914 ms/step , 6884.39 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 11:39:25 | Epoch: 0 | Step: 149980 | Dataset: 0-1631424 | Loss: 0.836 | 914 ms/step , 6879.63 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 11:39:34 | Epoch: 0 | Step: 149990 | Dataset: 0-1631744 | Loss: 0.855 | 914 ms/step , 6878.60 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 11:39:43 | Epoch: 0 | Step: 150000 | Dataset: 0-1632064 | Loss: 0.728 | 913 ms/step , 6890.36 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 11:39:45 | Validation | Step: 150000 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:39:45 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_113945_step_150000.pt` INFO:__main__:2024-11-05 11:39:55 | Epoch: 0 | Step: 150010 | Dataset: 0-1632384 | Loss: 0.695 | 925 ms/step , 6797.93 GFLOP/s , 13798.3 tokens/s INFO:__main__:2024-11-05 11:40:04 | Epoch: 0 | Step: 150020 | Dataset: 0-1632704 | Loss: 0.704 | 913 ms/step , 6885.21 GFLOP/s , 17899.5 tokens/s INFO:__main__:2024-11-05 11:40:13 | Epoch: 0 | Step: 150030 | Dataset: 0-1633024 | Loss: 0.648 | 914 ms/step , 6877.97 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 11:40:22 | Epoch: 0 | Step: 150040 | Dataset: 0-1633344 | Loss: 0.729 | 915 ms/step , 6876.54 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 11:40:32 | Epoch: 0 | Step: 150050 | Dataset: 0-1633664 | Loss: 0.823 | 916 ms/step , 6866.34 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 11:40:41 | Epoch: 0 | Step: 150060 | Dataset: 0-1633984 | Loss: 0.734 | 914 ms/step , 6883.03 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 11:40:50 | Epoch: 0 | Step: 150070 | Dataset: 0-1634304 | Loss: 0.718 | 913 ms/step , 6886.19 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 11:40:59 | Epoch: 0 | Step: 150080 | Dataset: 0-1634624 | Loss: 0.724 | 913 ms/step , 6889.73 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 11:41:08 | Epoch: 0 | Step: 150090 | Dataset: 0-1634944 | Loss: 0.709 | 914 ms/step , 6878.23 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 11:41:17 | Epoch: 0 | Step: 150100 | Dataset: 0-1635264 | Loss: 0.778 | 914 ms/step , 6881.29 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 11:41:19 | Validation | Step: 150100 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:41:28 | Epoch: 0 | Step: 150110 | Dataset: 0-1635584 | Loss: 0.722 | 913 ms/step , 6888.38 GFLOP/s , 15272.0 tokens/s INFO:__main__:2024-11-05 11:41:37 | Epoch: 0 | Step: 150120 | Dataset: 0-1635904 | Loss: 0.724 | 912 ms/step , 6892.88 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 11:41:46 | Epoch: 0 | Step: 150130 | Dataset: 0-1636224 | Loss: 0.740 | 913 ms/step , 6889.94 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 11:41:55 | Epoch: 0 | Step: 150140 | Dataset: 0-1636544 | Loss: 0.888 | 914 ms/step , 6880.86 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 11:42:05 | Epoch: 0 | Step: 150150 | Dataset: 0-1636864 | Loss: 0.795 | 912 ms/step , 6894.52 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 11:42:14 | Epoch: 0 | Step: 150160 | Dataset: 0-1637184 | Loss: 0.828 | 913 ms/step , 6892.30 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 11:42:23 | Epoch: 0 | Step: 150170 | Dataset: 0-1637504 | Loss: 0.709 | 912 ms/step , 6895.09 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 11:42:32 | Epoch: 0 | Step: 150180 | Dataset: 0-1637824 | Loss: 0.839 | 912 ms/step , 6893.65 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 11:42:41 | Epoch: 0 | Step: 150190 | Dataset: 0-1638144 | Loss: 0.774 | 913 ms/step , 6891.59 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 11:42:50 | Epoch: 0 | Step: 150200 | Dataset: 0-1638464 | Loss: 0.836 | 913 ms/step , 6887.35 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 11:42:52 | Validation | Step: 150200 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:43:01 | Epoch: 0 | Step: 150210 | Dataset: 0-1638784 | Loss: 0.694 | 913 ms/step , 6891.03 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-05 11:43:10 | Epoch: 0 | Step: 150220 | Dataset: 0-1639104 | Loss: 0.783 | 914 ms/step , 6881.47 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 11:43:19 | Epoch: 0 | Step: 150230 | Dataset: 0-1639424 | Loss: 0.780 | 913 ms/step , 6887.26 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 11:43:28 | Epoch: 0 | Step: 150240 | Dataset: 0-1639744 | Loss: 0.855 | 914 ms/step , 6879.50 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 11:43:38 | Epoch: 0 | Step: 150250 | Dataset: 0-1640064 | Loss: 0.640 | 914 ms/step , 6882.30 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 11:43:47 | Epoch: 0 | Step: 150260 | Dataset: 0-1640384 | Loss: 0.762 | 914 ms/step , 6884.19 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 11:43:56 | Epoch: 0 | Step: 150270 | Dataset: 0-1640704 | Loss: 0.691 | 913 ms/step , 6886.94 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 11:44:05 | Epoch: 0 | Step: 150280 | Dataset: 0-1641024 | Loss: 0.650 | 913 ms/step , 6886.74 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 11:44:14 | Epoch: 0 | Step: 150290 | Dataset: 0-1641344 | Loss: 0.708 | 914 ms/step , 6883.48 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 11:44:23 | Epoch: 0 | Step: 150300 | Dataset: 0-1641664 | Loss: 0.778 | 915 ms/step , 6874.73 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 11:44:25 | Validation | Step: 150300 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:44:34 | Epoch: 0 | Step: 150310 | Dataset: 0-1641984 | Loss: 0.791 | 912 ms/step , 6893.45 GFLOP/s , 15280.6 tokens/s INFO:__main__:2024-11-05 11:44:43 | Epoch: 0 | Step: 150320 | Dataset: 0-1642304 | Loss: 0.807 | 912 ms/step , 6896.24 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 11:44:52 | Epoch: 0 | Step: 150330 | Dataset: 0-1642624 | Loss: 0.685 | 913 ms/step , 6891.81 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 11:45:01 | Epoch: 0 | Step: 150340 | Dataset: 0-1642944 | Loss: 0.759 | 913 ms/step , 6885.09 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 11:45:11 | Epoch: 0 | Step: 150350 | Dataset: 0-1643264 | Loss: 0.826 | 913 ms/step , 6887.72 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 11:45:20 | Epoch: 0 | Step: 150360 | Dataset: 0-1643584 | Loss: 0.723 | 913 ms/step , 6885.79 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 11:45:29 | Epoch: 0 | Step: 150370 | Dataset: 0-1643904 | Loss: 0.713 | 913 ms/step , 6891.48 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 11:45:38 | Epoch: 0 | Step: 150380 | Dataset: 0-1644224 | Loss: 0.792 | 913 ms/step , 6888.39 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 11:45:47 | Epoch: 0 | Step: 150390 | Dataset: 0-1644544 | Loss: 0.809 | 912 ms/step , 6897.43 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 11:45:56 | Epoch: 0 | Step: 150400 | Dataset: 0-1644864 | Loss: 0.661 | 912 ms/step , 6895.49 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 11:45:58 | Validation | Step: 150400 | Val_loss: 0.746 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:46:07 | Epoch: 0 | Step: 150410 | Dataset: 0-1645184 | Loss: 0.820 | 914 ms/step , 6883.20 GFLOP/s , 15273.6 tokens/s INFO:__main__:2024-11-05 11:46:16 | Epoch: 0 | Step: 150420 | Dataset: 0-1645504 | Loss: 0.678 | 913 ms/step , 6891.22 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 11:46:25 | Epoch: 0 | Step: 150430 | Dataset: 0-1645824 | Loss: 0.744 | 912 ms/step , 6894.41 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 11:46:34 | Epoch: 0 | Step: 150440 | Dataset: 0-1646144 | Loss: 0.792 | 913 ms/step , 6890.03 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 11:46:43 | Epoch: 0 | Step: 150450 | Dataset: 0-1646464 | Loss: 0.759 | 913 ms/step , 6887.74 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 11:46:53 | Epoch: 0 | Step: 150460 | Dataset: 0-1646784 | Loss: 0.515 | 914 ms/step , 6883.20 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 11:47:02 | Epoch: 0 | Step: 150470 | Dataset: 0-1647104 | Loss: 0.803 | 914 ms/step , 6884.27 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 11:47:11 | Epoch: 0 | Step: 150480 | Dataset: 0-1647424 | Loss: 0.576 | 912 ms/step , 6899.99 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 11:47:20 | Epoch: 0 | Step: 150490 | Dataset: 0-1647744 | Loss: 0.763 | 914 ms/step , 6883.12 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 11:47:29 | Epoch: 0 | Step: 150500 | Dataset: 0-1648064 | Loss: 0.797 | 913 ms/step , 6887.69 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 11:47:31 | Validation | Step: 150500 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:47:40 | Epoch: 0 | Step: 150510 | Dataset: 0-1648384 | Loss: 0.751 | 912 ms/step , 6894.02 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 11:47:49 | Epoch: 0 | Step: 150520 | Dataset: 0-1648704 | Loss: 0.743 | 913 ms/step , 6887.73 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 11:47:58 | Epoch: 0 | Step: 150530 | Dataset: 0-1649024 | Loss: 0.829 | 913 ms/step , 6887.94 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 11:48:07 | Epoch: 0 | Step: 150540 | Dataset: 0-1649344 | Loss: 0.792 | 913 ms/step , 6889.56 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 11:48:16 | Epoch: 0 | Step: 150550 | Dataset: 0-1649664 | Loss: 0.731 | 913 ms/step , 6888.02 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 11:48:26 | Epoch: 0 | Step: 150560 | Dataset: 0-1649984 | Loss: 0.711 | 913 ms/step , 6888.34 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 11:48:35 | Epoch: 0 | Step: 150570 | Dataset: 0-1650304 | Loss: 0.587 | 912 ms/step , 6893.46 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 11:48:44 | Epoch: 0 | Step: 150580 | Dataset: 0-1650624 | Loss: 0.807 | 914 ms/step , 6878.11 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 11:48:53 | Epoch: 0 | Step: 150590 | Dataset: 0-1650944 | Loss: 0.717 | 912 ms/step , 6893.75 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 11:49:02 | Epoch: 0 | Step: 150600 | Dataset: 0-1651264 | Loss: 0.792 | 912 ms/step , 6893.47 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 11:49:04 | Validation | Step: 150600 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:49:13 | Epoch: 0 | Step: 150610 | Dataset: 0-1651584 | Loss: 0.751 | 913 ms/step , 6886.90 GFLOP/s , 15278.9 tokens/s INFO:__main__:2024-11-05 11:49:22 | Epoch: 0 | Step: 150620 | Dataset: 0-1651904 | Loss: 0.684 | 914 ms/step , 6884.83 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 11:49:31 | Epoch: 0 | Step: 150630 | Dataset: 0-1652224 | Loss: 0.853 | 913 ms/step , 6886.21 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 11:49:40 | Epoch: 0 | Step: 150640 | Dataset: 0-1652544 | Loss: 0.667 | 912 ms/step , 6896.17 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 11:49:49 | Epoch: 0 | Step: 150650 | Dataset: 0-1652864 | Loss: 0.814 | 913 ms/step , 6892.16 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 11:49:58 | Epoch: 0 | Step: 150660 | Dataset: 0-1653184 | Loss: 0.656 | 912 ms/step , 6895.38 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 11:50:08 | Epoch: 0 | Step: 150670 | Dataset: 0-1653504 | Loss: 0.909 | 912 ms/step , 6893.98 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 11:50:17 | Epoch: 0 | Step: 150680 | Dataset: 0-1653824 | Loss: 0.717 | 914 ms/step , 6882.69 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 11:50:26 | Epoch: 0 | Step: 150690 | Dataset: 0-1654144 | Loss: 0.737 | 913 ms/step , 6885.35 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 11:50:35 | Epoch: 0 | Step: 150700 | Dataset: 0-1654464 | Loss: 0.748 | 912 ms/step , 6895.90 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 11:50:37 | Validation | Step: 150700 | Val_loss: 0.767 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:50:46 | Epoch: 0 | Step: 150710 | Dataset: 0-1654784 | Loss: 0.767 | 914 ms/step , 6881.73 GFLOP/s , 15269.2 tokens/s INFO:__main__:2024-11-05 11:50:55 | Epoch: 0 | Step: 150720 | Dataset: 0-1655104 | Loss: 0.807 | 914 ms/step , 6878.10 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 11:51:04 | Epoch: 0 | Step: 150730 | Dataset: 0-1655424 | Loss: 0.856 | 913 ms/step , 6890.81 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 11:51:13 | Epoch: 0 | Step: 150740 | Dataset: 0-1655744 | Loss: 0.815 | 914 ms/step , 6881.02 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 11:51:22 | Epoch: 0 | Step: 150750 | Dataset: 0-1656064 | Loss: 0.792 | 913 ms/step , 6886.56 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 11:51:31 | Epoch: 0 | Step: 150760 | Dataset: 0-1656384 | Loss: 0.761 | 913 ms/step , 6890.61 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 11:51:41 | Epoch: 0 | Step: 150770 | Dataset: 0-1656704 | Loss: 0.744 | 913 ms/step , 6889.62 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 11:51:50 | Epoch: 0 | Step: 150780 | Dataset: 0-1657024 | Loss: 0.662 | 912 ms/step , 6894.98 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 11:51:59 | Epoch: 0 | Step: 150790 | Dataset: 0-1657344 | Loss: 0.755 | 912 ms/step , 6892.70 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 11:52:08 | Epoch: 0 | Step: 150800 | Dataset: 0-1657664 | Loss: 0.781 | 913 ms/step , 6887.94 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 11:52:10 | Validation | Step: 150800 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:52:19 | Epoch: 0 | Step: 150810 | Dataset: 0-1657984 | Loss: 0.765 | 914 ms/step , 6883.17 GFLOP/s , 15275.9 tokens/s INFO:__main__:2024-11-05 11:52:28 | Epoch: 0 | Step: 150820 | Dataset: 0-1658304 | Loss: 0.761 | 912 ms/step , 6899.14 GFLOP/s , 17951.3 tokens/s INFO:__main__:2024-11-05 11:52:37 | Epoch: 0 | Step: 150830 | Dataset: 0-1658624 | Loss: 0.751 | 913 ms/step , 6885.70 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 11:52:46 | Epoch: 0 | Step: 150840 | Dataset: 0-1658944 | Loss: 0.832 | 913 ms/step , 6890.89 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 11:52:55 | Epoch: 0 | Step: 150850 | Dataset: 0-1659264 | Loss: 0.784 | 912 ms/step , 6895.34 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 11:53:04 | Epoch: 0 | Step: 150860 | Dataset: 0-1659584 | Loss: 0.532 | 913 ms/step , 6886.23 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 11:53:13 | Epoch: 0 | Step: 150870 | Dataset: 0-1659904 | Loss: 0.729 | 911 ms/step , 6900.27 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 11:53:23 | Epoch: 0 | Step: 150880 | Dataset: 0-1660224 | Loss: 0.591 | 913 ms/step , 6888.89 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 11:53:32 | Epoch: 0 | Step: 150890 | Dataset: 0-1660544 | Loss: 0.852 | 914 ms/step , 6881.48 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 11:53:41 | Epoch: 0 | Step: 150900 | Dataset: 0-1660864 | Loss: 0.647 | 913 ms/step , 6889.76 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 11:53:42 | Validation | Step: 150900 | Val_loss: 0.771 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:53:52 | Epoch: 0 | Step: 150910 | Dataset: 0-1661184 | Loss: 0.868 | 913 ms/step , 6885.39 GFLOP/s , 15279.5 tokens/s INFO:__main__:2024-11-05 11:54:01 | Epoch: 0 | Step: 150920 | Dataset: 0-1661504 | Loss: 0.470 | 912 ms/step , 6896.22 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 11:54:10 | Epoch: 0 | Step: 150930 | Dataset: 0-1661824 | Loss: 0.641 | 912 ms/step , 6894.75 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 11:54:19 | Epoch: 0 | Step: 150940 | Dataset: 0-1662144 | Loss: 0.806 | 913 ms/step , 6888.62 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 11:54:28 | Epoch: 0 | Step: 150950 | Dataset: 0-1662464 | Loss: 0.818 | 913 ms/step , 6892.19 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 11:54:37 | Epoch: 0 | Step: 150960 | Dataset: 0-1662784 | Loss: 0.852 | 913 ms/step , 6890.03 GFLOP/s , 17946.2 tokens/s INFO:__main__:2024-11-05 11:54:46 | Epoch: 0 | Step: 150970 | Dataset: 0-1663104 | Loss: 0.682 | 913 ms/step , 6891.89 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 11:54:56 | Epoch: 0 | Step: 150980 | Dataset: 0-1663424 | Loss: 0.810 | 913 ms/step , 6888.70 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 11:55:05 | Epoch: 0 | Step: 150990 | Dataset: 0-1663744 | Loss: 0.662 | 912 ms/step , 6897.04 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 11:55:14 | Epoch: 0 | Step: 151000 | Dataset: 0-1664064 | Loss: 0.685 | 912 ms/step , 6894.23 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 11:55:15 | Validation | Step: 151000 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:55:15 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_115515_step_151000.pt` INFO:__main__:2024-11-05 11:55:26 | Epoch: 0 | Step: 151010 | Dataset: 0-1664384 | Loss: 0.654 | 912 ms/step , 6893.32 GFLOP/s , 13773.2 tokens/s INFO:__main__:2024-11-05 11:55:35 | Epoch: 0 | Step: 151020 | Dataset: 0-1664704 | Loss: 0.738 | 913 ms/step , 6891.81 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 11:55:44 | Epoch: 0 | Step: 151030 | Dataset: 0-1665024 | Loss: 0.715 | 912 ms/step , 6896.53 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 11:55:53 | Epoch: 0 | Step: 151040 | Dataset: 0-1665344 | Loss: 0.744 | 915 ms/step , 6875.05 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 11:56:02 | Epoch: 0 | Step: 151050 | Dataset: 0-1665664 | Loss: 0.893 | 913 ms/step , 6887.80 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 11:56:11 | Epoch: 0 | Step: 151060 | Dataset: 0-1665984 | Loss: 0.734 | 913 ms/step , 6890.11 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 11:56:20 | Epoch: 0 | Step: 151070 | Dataset: 0-1666304 | Loss: 0.639 | 914 ms/step , 6883.29 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 11:56:30 | Epoch: 0 | Step: 151080 | Dataset: 0-1666624 | Loss: 0.728 | 912 ms/step , 6894.05 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 11:56:39 | Epoch: 0 | Step: 151090 | Dataset: 0-1666944 | Loss: 0.758 | 914 ms/step , 6883.50 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 11:56:48 | Epoch: 0 | Step: 151100 | Dataset: 0-1667264 | Loss: 0.736 | 913 ms/step , 6885.33 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 11:56:49 | Validation | Step: 151100 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:56:59 | Epoch: 0 | Step: 151110 | Dataset: 0-1667584 | Loss: 0.539 | 912 ms/step , 6899.36 GFLOP/s , 15282.8 tokens/s INFO:__main__:2024-11-05 11:57:08 | Epoch: 0 | Step: 151120 | Dataset: 0-1667904 | Loss: 0.635 | 914 ms/step , 6883.36 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 11:57:17 | Epoch: 0 | Step: 151130 | Dataset: 0-1668224 | Loss: 0.827 | 913 ms/step , 6885.58 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 11:57:26 | Epoch: 0 | Step: 151140 | Dataset: 0-1668544 | Loss: 0.754 | 913 ms/step , 6891.67 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 11:57:35 | Epoch: 0 | Step: 151150 | Dataset: 0-1668864 | Loss: 0.779 | 912 ms/step , 6895.20 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 11:57:44 | Epoch: 0 | Step: 151160 | Dataset: 0-1669184 | Loss: 0.817 | 913 ms/step , 6889.42 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 11:57:53 | Epoch: 0 | Step: 151170 | Dataset: 0-1669504 | Loss: 0.747 | 911 ms/step , 6902.79 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 11:58:03 | Epoch: 0 | Step: 151180 | Dataset: 0-1669824 | Loss: 0.725 | 912 ms/step , 6893.66 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 11:58:12 | Epoch: 0 | Step: 151190 | Dataset: 0-1670144 | Loss: 0.741 | 912 ms/step , 6899.42 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 11:58:21 | Epoch: 0 | Step: 151200 | Dataset: 0-1670464 | Loss: 0.775 | 914 ms/step , 6883.88 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 11:58:22 | Validation | Step: 151200 | Val_loss: 0.763 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 11:58:32 | Epoch: 0 | Step: 151210 | Dataset: 0-1670784 | Loss: 0.740 | 913 ms/step , 6886.62 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-05 11:58:41 | Epoch: 0 | Step: 151220 | Dataset: 0-1671104 | Loss: 0.797 | 913 ms/step , 6889.64 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 11:58:50 | Epoch: 0 | Step: 151230 | Dataset: 0-1671424 | Loss: 0.846 | 914 ms/step , 6884.03 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 11:58:59 | Epoch: 0 | Step: 151240 | Dataset: 0-1671744 | Loss: 0.653 | 912 ms/step , 6894.18 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 11:59:08 | Epoch: 0 | Step: 151250 | Dataset: 0-1672064 | Loss: 0.777 | 913 ms/step , 6887.49 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 11:59:17 | Epoch: 0 | Step: 151260 | Dataset: 0-1672384 | Loss: 0.624 | 913 ms/step , 6891.09 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 11:59:26 | Epoch: 0 | Step: 151270 | Dataset: 0-1672704 | Loss: 0.734 | 914 ms/step , 6882.02 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 11:59:36 | Epoch: 0 | Step: 151280 | Dataset: 0-1673024 | Loss: 0.739 | 913 ms/step , 6888.98 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 11:59:45 | Epoch: 0 | Step: 151290 | Dataset: 0-1673344 | Loss: 0.799 | 914 ms/step , 6881.48 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 11:59:54 | Epoch: 0 | Step: 151300 | Dataset: 0-1673664 | Loss: 0.894 | 913 ms/step , 6888.22 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 11:59:55 | Validation | Step: 151300 | Val_loss: 0.694 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:00:04 | Epoch: 0 | Step: 151310 | Dataset: 0-1673984 | Loss: 0.712 | 912 ms/step , 6893.61 GFLOP/s , 15279.0 tokens/s INFO:__main__:2024-11-05 12:00:14 | Epoch: 0 | Step: 151320 | Dataset: 0-1674304 | Loss: 0.707 | 914 ms/step , 6881.05 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 12:00:23 | Epoch: 0 | Step: 151330 | Dataset: 0-1674624 | Loss: 0.680 | 914 ms/step , 6882.25 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 12:00:32 | Epoch: 0 | Step: 151340 | Dataset: 0-1674944 | Loss: 0.648 | 912 ms/step , 6893.87 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 12:00:41 | Epoch: 0 | Step: 151350 | Dataset: 0-1675264 | Loss: 0.723 | 912 ms/step , 6898.63 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 12:00:50 | Epoch: 0 | Step: 151360 | Dataset: 0-1675584 | Loss: 0.796 | 912 ms/step , 6892.79 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 12:00:59 | Epoch: 0 | Step: 151370 | Dataset: 0-1675904 | Loss: 0.854 | 913 ms/step , 6892.20 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 12:01:08 | Epoch: 0 | Step: 151380 | Dataset: 0-1676224 | Loss: 0.657 | 914 ms/step , 6882.33 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 12:01:18 | Epoch: 0 | Step: 151390 | Dataset: 0-1676544 | Loss: 0.808 | 913 ms/step , 6885.64 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 12:01:27 | Epoch: 0 | Step: 151400 | Dataset: 0-1676864 | Loss: 0.805 | 913 ms/step , 6887.22 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 12:01:28 | Validation | Step: 151400 | Val_loss: 0.704 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:01:37 | Epoch: 0 | Step: 151410 | Dataset: 0-1677184 | Loss: 0.710 | 913 ms/step , 6889.01 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-05 12:01:47 | Epoch: 0 | Step: 151420 | Dataset: 0-1677504 | Loss: 0.704 | 914 ms/step , 6883.23 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 12:01:56 | Epoch: 0 | Step: 151430 | Dataset: 0-1677824 | Loss: 0.620 | 913 ms/step , 6889.38 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 12:02:05 | Epoch: 0 | Step: 151440 | Dataset: 0-1678144 | Loss: 0.819 | 914 ms/step , 6880.55 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 12:02:14 | Epoch: 0 | Step: 151450 | Dataset: 0-1678464 | Loss: 0.711 | 913 ms/step , 6891.68 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 12:02:23 | Epoch: 0 | Step: 151460 | Dataset: 0-1678784 | Loss: 0.777 | 914 ms/step , 6882.07 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 12:02:32 | Epoch: 0 | Step: 151470 | Dataset: 0-1679104 | Loss: 0.569 | 913 ms/step , 6888.20 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 12:02:41 | Epoch: 0 | Step: 151480 | Dataset: 0-1679424 | Loss: 0.840 | 913 ms/step , 6885.55 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:02:51 | Epoch: 0 | Step: 151490 | Dataset: 0-1679744 | Loss: 0.761 | 913 ms/step , 6889.55 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 12:03:00 | Epoch: 0 | Step: 151500 | Dataset: 0-1680064 | Loss: 0.741 | 913 ms/step , 6887.39 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 12:03:01 | Validation | Step: 151500 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:03:10 | Epoch: 0 | Step: 151510 | Dataset: 0-1680384 | Loss: 0.629 | 913 ms/step , 6888.23 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-05 12:03:20 | Epoch: 0 | Step: 151520 | Dataset: 0-1680704 | Loss: 0.810 | 916 ms/step , 6862.93 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 12:03:29 | Epoch: 0 | Step: 151530 | Dataset: 0-1681024 | Loss: 0.763 | 913 ms/step , 6887.13 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 12:03:38 | Epoch: 0 | Step: 151540 | Dataset: 0-1681344 | Loss: 0.704 | 922 ms/step , 6821.17 GFLOP/s , 17870.0 tokens/s INFO:__main__:2024-11-05 12:03:47 | Epoch: 0 | Step: 151550 | Dataset: 0-1681664 | Loss: 0.753 | 915 ms/step , 6875.22 GFLOP/s , 17899.7 tokens/s INFO:__main__:2024-11-05 12:03:56 | Epoch: 0 | Step: 151560 | Dataset: 0-1681984 | Loss: 0.764 | 913 ms/step , 6891.88 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 12:04:05 | Epoch: 0 | Step: 151570 | Dataset: 0-1682304 | Loss: 0.822 | 912 ms/step , 6893.24 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 12:04:14 | Epoch: 0 | Step: 151580 | Dataset: 0-1682624 | Loss: 0.719 | 912 ms/step , 6894.92 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 12:04:24 | Epoch: 0 | Step: 151590 | Dataset: 0-1682944 | Loss: 0.732 | 913 ms/step , 6887.05 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 12:04:33 | Epoch: 0 | Step: 151600 | Dataset: 0-1683264 | Loss: 0.725 | 914 ms/step , 6881.93 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 12:04:34 | Validation | Step: 151600 | Val_loss: 0.781 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:04:43 | Epoch: 0 | Step: 151610 | Dataset: 0-1683584 | Loss: 0.723 | 913 ms/step , 6888.24 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-05 12:04:53 | Epoch: 0 | Step: 151620 | Dataset: 0-1683904 | Loss: 0.711 | 913 ms/step , 6885.56 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:05:02 | Epoch: 0 | Step: 151630 | Dataset: 0-1684224 | Loss: 0.824 | 914 ms/step , 6882.97 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 12:05:11 | Epoch: 0 | Step: 151640 | Dataset: 0-1684544 | Loss: 0.728 | 913 ms/step , 6889.69 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 12:05:20 | Epoch: 0 | Step: 151650 | Dataset: 0-1684864 | Loss: 0.662 | 913 ms/step , 6890.53 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 12:05:29 | Epoch: 0 | Step: 151660 | Dataset: 0-1685184 | Loss: 0.772 | 912 ms/step , 6894.25 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 12:05:38 | Epoch: 0 | Step: 151670 | Dataset: 0-1685504 | Loss: 0.692 | 913 ms/step , 6889.79 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 12:05:47 | Epoch: 0 | Step: 151680 | Dataset: 0-1685824 | Loss: 0.805 | 913 ms/step , 6886.00 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 12:05:56 | Epoch: 0 | Step: 151690 | Dataset: 0-1686144 | Loss: 0.792 | 913 ms/step , 6890.22 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 12:06:06 | Epoch: 0 | Step: 151700 | Dataset: 0-1686464 | Loss: 0.775 | 913 ms/step , 6888.70 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 12:06:07 | Validation | Step: 151700 | Val_loss: 0.798 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:06:16 | Epoch: 0 | Step: 151710 | Dataset: 0-1686784 | Loss: 0.765 | 912 ms/step , 6892.97 GFLOP/s , 15277.2 tokens/s INFO:__main__:2024-11-05 12:06:25 | Epoch: 0 | Step: 151720 | Dataset: 0-1687104 | Loss: 0.771 | 914 ms/step , 6883.40 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 12:06:35 | Epoch: 0 | Step: 151730 | Dataset: 0-1687424 | Loss: 0.760 | 912 ms/step , 6893.58 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 12:06:44 | Epoch: 0 | Step: 151740 | Dataset: 0-1687744 | Loss: 0.600 | 914 ms/step , 6882.48 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 12:06:53 | Epoch: 0 | Step: 151750 | Dataset: 0-1688064 | Loss: 0.708 | 913 ms/step , 6885.88 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 12:07:02 | Epoch: 0 | Step: 151760 | Dataset: 0-1688384 | Loss: 0.678 | 913 ms/step , 6889.70 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 12:07:11 | Epoch: 0 | Step: 151770 | Dataset: 0-1688704 | Loss: 0.681 | 912 ms/step , 6896.71 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 12:07:20 | Epoch: 0 | Step: 151780 | Dataset: 0-1689024 | Loss: 0.771 | 914 ms/step , 6884.55 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 12:07:29 | Epoch: 0 | Step: 151790 | Dataset: 0-1689344 | Loss: 0.758 | 913 ms/step , 6886.91 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 12:07:39 | Epoch: 0 | Step: 151800 | Dataset: 0-1689664 | Loss: 0.779 | 913 ms/step , 6889.57 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 12:07:40 | Validation | Step: 151800 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:07:49 | Epoch: 0 | Step: 151810 | Dataset: 0-1689984 | Loss: 0.759 | 914 ms/step , 6878.88 GFLOP/s , 15268.0 tokens/s INFO:__main__:2024-11-05 12:07:58 | Epoch: 0 | Step: 151820 | Dataset: 0-1690304 | Loss: 0.650 | 913 ms/step , 6889.70 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 12:08:08 | Epoch: 0 | Step: 151830 | Dataset: 0-1690624 | Loss: 0.695 | 913 ms/step , 6890.36 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 12:08:17 | Epoch: 0 | Step: 151840 | Dataset: 0-1690944 | Loss: 0.744 | 914 ms/step , 6883.20 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 12:08:26 | Epoch: 0 | Step: 151850 | Dataset: 0-1691264 | Loss: 0.695 | 913 ms/step , 6889.48 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 12:08:35 | Epoch: 0 | Step: 151860 | Dataset: 0-1691584 | Loss: 0.560 | 912 ms/step , 6898.02 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 12:08:44 | Epoch: 0 | Step: 151870 | Dataset: 0-1691904 | Loss: 0.654 | 914 ms/step , 6883.79 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 12:08:53 | Epoch: 0 | Step: 151880 | Dataset: 0-1692224 | Loss: 0.685 | 912 ms/step , 6894.94 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 12:09:02 | Epoch: 0 | Step: 151890 | Dataset: 0-1692544 | Loss: 0.709 | 913 ms/step , 6886.43 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 12:09:12 | Epoch: 0 | Step: 151900 | Dataset: 0-1692864 | Loss: 0.713 | 914 ms/step , 6882.36 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 12:09:13 | Validation | Step: 151900 | Val_loss: 0.767 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:09:22 | Epoch: 0 | Step: 151910 | Dataset: 0-1693184 | Loss: 0.780 | 914 ms/step , 6883.08 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-05 12:09:31 | Epoch: 0 | Step: 151920 | Dataset: 0-1693504 | Loss: 0.618 | 914 ms/step , 6880.45 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 12:09:41 | Epoch: 0 | Step: 151930 | Dataset: 0-1693824 | Loss: 0.718 | 913 ms/step , 6888.01 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 12:09:50 | Epoch: 0 | Step: 151940 | Dataset: 0-1694144 | Loss: 0.802 | 912 ms/step , 6893.86 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 12:09:59 | Epoch: 0 | Step: 151950 | Dataset: 0-1694464 | Loss: 0.776 | 914 ms/step , 6884.63 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 12:10:08 | Epoch: 0 | Step: 151960 | Dataset: 0-1694784 | Loss: 0.720 | 913 ms/step , 6889.17 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 12:10:17 | Epoch: 0 | Step: 151970 | Dataset: 0-1695104 | Loss: 0.732 | 914 ms/step , 6877.91 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 12:10:26 | Epoch: 0 | Step: 151980 | Dataset: 0-1695424 | Loss: 0.750 | 914 ms/step , 6880.60 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:10:35 | Epoch: 0 | Step: 151990 | Dataset: 0-1695744 | Loss: 0.701 | 912 ms/step , 6897.24 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 12:10:44 | Epoch: 0 | Step: 152000 | Dataset: 0-1696064 | Loss: 0.665 | 912 ms/step , 6892.80 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:10:46 | Validation | Step: 152000 | Val_loss: 0.781 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:10:46 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_121046_step_152000.pt` INFO:__main__:2024-11-05 12:10:56 | Epoch: 0 | Step: 152010 | Dataset: 0-1696384 | Loss: 0.694 | 912 ms/step , 6894.68 GFLOP/s , 13745.6 tokens/s INFO:__main__:2024-11-05 12:11:06 | Epoch: 0 | Step: 152020 | Dataset: 0-1696704 | Loss: 0.747 | 912 ms/step , 6893.68 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 12:11:15 | Epoch: 0 | Step: 152030 | Dataset: 0-1697024 | Loss: 0.705 | 914 ms/step , 6879.10 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 12:11:24 | Epoch: 0 | Step: 152040 | Dataset: 0-1697344 | Loss: 0.734 | 914 ms/step , 6885.03 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 12:11:33 | Epoch: 0 | Step: 152050 | Dataset: 0-1697664 | Loss: 0.689 | 914 ms/step , 6878.67 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 12:11:42 | Epoch: 0 | Step: 152060 | Dataset: 0-1697984 | Loss: 0.662 | 913 ms/step , 6886.95 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 12:11:51 | Epoch: 0 | Step: 152070 | Dataset: 0-1698304 | Loss: 0.698 | 915 ms/step , 6874.24 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 12:12:00 | Epoch: 0 | Step: 152080 | Dataset: 0-1698624 | Loss: 0.817 | 912 ms/step , 6893.60 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 12:12:10 | Epoch: 0 | Step: 152090 | Dataset: 0-1698944 | Loss: 0.732 | 914 ms/step , 6881.87 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 12:12:19 | Epoch: 0 | Step: 152100 | Dataset: 0-1699264 | Loss: 0.705 | 913 ms/step , 6886.62 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 12:12:20 | Validation | Step: 152100 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:12:29 | Epoch: 0 | Step: 152110 | Dataset: 0-1699584 | Loss: 0.682 | 912 ms/step , 6896.56 GFLOP/s , 15282.5 tokens/s INFO:__main__:2024-11-05 12:12:39 | Epoch: 0 | Step: 152120 | Dataset: 0-1699904 | Loss: 0.772 | 914 ms/step , 6879.48 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 12:12:48 | Epoch: 0 | Step: 152130 | Dataset: 0-1700224 | Loss: 0.677 | 913 ms/step , 6889.12 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 12:12:57 | Epoch: 0 | Step: 152140 | Dataset: 0-1700544 | Loss: 0.767 | 913 ms/step , 6892.58 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 12:13:06 | Epoch: 0 | Step: 152150 | Dataset: 0-1700864 | Loss: 0.776 | 913 ms/step , 6887.13 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 12:13:15 | Epoch: 0 | Step: 152160 | Dataset: 0-1701184 | Loss: 0.694 | 913 ms/step , 6888.31 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 12:13:24 | Epoch: 0 | Step: 152170 | Dataset: 0-1701504 | Loss: 0.722 | 912 ms/step , 6897.86 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 12:13:33 | Epoch: 0 | Step: 152180 | Dataset: 0-1701824 | Loss: 0.782 | 914 ms/step , 6882.99 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 12:13:42 | Epoch: 0 | Step: 152190 | Dataset: 0-1702144 | Loss: 0.599 | 912 ms/step , 6899.67 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 12:13:52 | Epoch: 0 | Step: 152200 | Dataset: 0-1702464 | Loss: 0.657 | 913 ms/step , 6887.08 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 12:13:53 | Validation | Step: 152200 | Val_loss: 0.670 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:14:02 | Epoch: 0 | Step: 152210 | Dataset: 0-1702784 | Loss: 0.712 | 913 ms/step , 6889.52 GFLOP/s , 15276.8 tokens/s INFO:__main__:2024-11-05 12:14:11 | Epoch: 0 | Step: 152220 | Dataset: 0-1703104 | Loss: 0.692 | 914 ms/step , 6881.83 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 12:14:21 | Epoch: 0 | Step: 152230 | Dataset: 0-1703424 | Loss: 0.714 | 912 ms/step , 6893.13 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 12:14:30 | Epoch: 0 | Step: 152240 | Dataset: 0-1703744 | Loss: 0.835 | 914 ms/step , 6881.78 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 12:14:39 | Epoch: 0 | Step: 152250 | Dataset: 0-1704064 | Loss: 0.764 | 913 ms/step , 6886.33 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 12:14:48 | Epoch: 0 | Step: 152260 | Dataset: 0-1704384 | Loss: 0.702 | 913 ms/step , 6891.31 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 12:14:57 | Epoch: 0 | Step: 152270 | Dataset: 0-1704704 | Loss: 0.681 | 914 ms/step , 6883.60 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 12:15:06 | Epoch: 0 | Step: 152280 | Dataset: 0-1705024 | Loss: 0.624 | 913 ms/step , 6891.22 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 12:15:15 | Epoch: 0 | Step: 152290 | Dataset: 0-1705344 | Loss: 0.681 | 913 ms/step , 6888.84 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 12:15:25 | Epoch: 0 | Step: 152300 | Dataset: 0-1705664 | Loss: 0.812 | 913 ms/step , 6891.36 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 12:15:26 | Validation | Step: 152300 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:15:35 | Epoch: 0 | Step: 152310 | Dataset: 0-1705984 | Loss: 0.748 | 913 ms/step , 6885.23 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-05 12:15:44 | Epoch: 0 | Step: 152320 | Dataset: 0-1706304 | Loss: 0.513 | 912 ms/step , 6896.06 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 12:15:54 | Epoch: 0 | Step: 152330 | Dataset: 0-1706624 | Loss: 0.742 | 913 ms/step , 6890.20 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 12:16:03 | Epoch: 0 | Step: 152340 | Dataset: 0-1706944 | Loss: 0.667 | 915 ms/step , 6875.29 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 12:16:12 | Epoch: 0 | Step: 152350 | Dataset: 0-1707264 | Loss: 0.753 | 914 ms/step , 6883.15 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 12:16:21 | Epoch: 0 | Step: 152360 | Dataset: 0-1707584 | Loss: 0.681 | 913 ms/step , 6885.35 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 12:16:30 | Epoch: 0 | Step: 152370 | Dataset: 0-1707904 | Loss: 0.742 | 913 ms/step , 6886.24 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 12:16:39 | Epoch: 0 | Step: 152380 | Dataset: 0-1708224 | Loss: 0.820 | 914 ms/step , 6881.51 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 12:16:48 | Epoch: 0 | Step: 152390 | Dataset: 0-1708544 | Loss: 0.721 | 913 ms/step , 6891.05 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:16:58 | Epoch: 0 | Step: 152400 | Dataset: 0-1708864 | Loss: 0.704 | 912 ms/step , 6893.98 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 12:16:59 | Validation | Step: 152400 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:17:08 | Epoch: 0 | Step: 152410 | Dataset: 0-1709184 | Loss: 0.786 | 913 ms/step , 6887.47 GFLOP/s , 15283.2 tokens/s INFO:__main__:2024-11-05 12:17:17 | Epoch: 0 | Step: 152420 | Dataset: 0-1709504 | Loss: 0.707 | 914 ms/step , 6878.09 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 12:17:27 | Epoch: 0 | Step: 152430 | Dataset: 0-1709824 | Loss: 0.744 | 914 ms/step , 6882.57 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 12:17:36 | Epoch: 0 | Step: 152440 | Dataset: 0-1710144 | Loss: 0.674 | 913 ms/step , 6891.20 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 12:17:45 | Epoch: 0 | Step: 152450 | Dataset: 0-1710464 | Loss: 0.756 | 914 ms/step , 6883.54 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 12:17:54 | Epoch: 0 | Step: 152460 | Dataset: 0-1710784 | Loss: 0.863 | 915 ms/step , 6877.22 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 12:18:03 | Epoch: 0 | Step: 152470 | Dataset: 0-1711104 | Loss: 0.790 | 913 ms/step , 6890.78 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 12:18:12 | Epoch: 0 | Step: 152480 | Dataset: 0-1711424 | Loss: 0.706 | 914 ms/step , 6878.12 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 12:18:21 | Epoch: 0 | Step: 152490 | Dataset: 0-1711744 | Loss: 0.715 | 913 ms/step , 6886.80 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 12:18:30 | Epoch: 0 | Step: 152500 | Dataset: 0-1712064 | Loss: 0.695 | 914 ms/step , 6882.91 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 12:18:32 | Validation | Step: 152500 | Val_loss: 0.817 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:18:41 | Epoch: 0 | Step: 152510 | Dataset: 0-1712384 | Loss: 0.523 | 913 ms/step , 6885.67 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 12:18:50 | Epoch: 0 | Step: 152520 | Dataset: 0-1712704 | Loss: 0.757 | 915 ms/step , 6876.99 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 12:18:59 | Epoch: 0 | Step: 152530 | Dataset: 0-1713024 | Loss: 0.724 | 912 ms/step , 6896.48 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 12:19:09 | Epoch: 0 | Step: 152540 | Dataset: 0-1713344 | Loss: 0.745 | 914 ms/step , 6884.49 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 12:19:18 | Epoch: 0 | Step: 152550 | Dataset: 0-1713664 | Loss: 0.691 | 914 ms/step , 6883.09 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 12:19:27 | Epoch: 0 | Step: 152560 | Dataset: 0-1713984 | Loss: 0.733 | 914 ms/step , 6884.10 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 12:19:36 | Epoch: 0 | Step: 152570 | Dataset: 0-1714304 | Loss: 0.745 | 913 ms/step , 6891.69 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 12:19:45 | Epoch: 0 | Step: 152580 | Dataset: 0-1714624 | Loss: 0.722 | 914 ms/step , 6884.74 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 12:19:54 | Epoch: 0 | Step: 152590 | Dataset: 0-1714944 | Loss: 0.765 | 914 ms/step , 6880.52 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 12:20:03 | Epoch: 0 | Step: 152600 | Dataset: 0-1715264 | Loss: 0.799 | 915 ms/step , 6875.38 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 12:20:05 | Validation | Step: 152600 | Val_loss: 0.760 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:20:14 | Epoch: 0 | Step: 152610 | Dataset: 0-1715584 | Loss: 0.651 | 913 ms/step , 6885.90 GFLOP/s , 15270.0 tokens/s INFO:__main__:2024-11-05 12:20:23 | Epoch: 0 | Step: 152620 | Dataset: 0-1715904 | Loss: 0.785 | 915 ms/step , 6875.83 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 12:20:32 | Epoch: 0 | Step: 152630 | Dataset: 0-1716224 | Loss: 0.795 | 916 ms/step , 6865.17 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 12:20:42 | Epoch: 0 | Step: 152640 | Dataset: 0-1716544 | Loss: 0.732 | 914 ms/step , 6880.35 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 12:20:51 | Epoch: 0 | Step: 152650 | Dataset: 0-1716864 | Loss: 0.629 | 915 ms/step , 6874.05 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 12:21:00 | Epoch: 0 | Step: 152660 | Dataset: 0-1717184 | Loss: 0.789 | 915 ms/step , 6872.08 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 12:21:09 | Epoch: 0 | Step: 152670 | Dataset: 0-1717504 | Loss: 0.812 | 914 ms/step , 6880.13 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 12:21:18 | Epoch: 0 | Step: 152680 | Dataset: 0-1717824 | Loss: 0.732 | 914 ms/step , 6881.71 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 12:21:27 | Epoch: 0 | Step: 152690 | Dataset: 0-1718144 | Loss: 0.865 | 915 ms/step , 6877.46 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 12:21:36 | Epoch: 0 | Step: 152700 | Dataset: 0-1718464 | Loss: 0.678 | 914 ms/step , 6883.18 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 12:21:38 | Validation | Step: 152700 | Val_loss: 0.806 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:21:47 | Epoch: 0 | Step: 152710 | Dataset: 0-1718784 | Loss: 0.771 | 914 ms/step , 6883.92 GFLOP/s , 15282.4 tokens/s INFO:__main__:2024-11-05 12:21:56 | Epoch: 0 | Step: 152720 | Dataset: 0-1719104 | Loss: 0.711 | 912 ms/step , 6895.68 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 12:22:05 | Epoch: 0 | Step: 152730 | Dataset: 0-1719424 | Loss: 0.759 | 914 ms/step , 6884.18 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 12:22:15 | Epoch: 0 | Step: 152740 | Dataset: 0-1719744 | Loss: 0.820 | 914 ms/step , 6878.97 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 12:22:24 | Epoch: 0 | Step: 152750 | Dataset: 0-1720064 | Loss: 0.669 | 913 ms/step , 6888.99 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 12:22:33 | Epoch: 0 | Step: 152760 | Dataset: 0-1720384 | Loss: 0.801 | 914 ms/step , 6883.94 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 12:22:42 | Epoch: 0 | Step: 152770 | Dataset: 0-1720704 | Loss: 0.688 | 912 ms/step , 6892.79 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 12:22:51 | Epoch: 0 | Step: 152780 | Dataset: 0-1721024 | Loss: 0.801 | 914 ms/step , 6883.98 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 12:23:00 | Epoch: 0 | Step: 152790 | Dataset: 0-1721344 | Loss: 0.708 | 913 ms/step , 6889.38 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 12:23:09 | Epoch: 0 | Step: 152800 | Dataset: 0-1721664 | Loss: 0.676 | 913 ms/step , 6888.02 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 12:23:11 | Validation | Step: 152800 | Val_loss: 0.708 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:23:20 | Epoch: 0 | Step: 152810 | Dataset: 0-1721984 | Loss: 0.655 | 914 ms/step , 6878.65 GFLOP/s , 15279.4 tokens/s INFO:__main__:2024-11-05 12:23:29 | Epoch: 0 | Step: 152820 | Dataset: 0-1722304 | Loss: 0.728 | 914 ms/step , 6883.81 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 12:23:38 | Epoch: 0 | Step: 152830 | Dataset: 0-1722624 | Loss: 0.752 | 913 ms/step , 6885.76 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 12:23:48 | Epoch: 0 | Step: 152840 | Dataset: 0-1722944 | Loss: 0.621 | 913 ms/step , 6887.32 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 12:23:57 | Epoch: 0 | Step: 152850 | Dataset: 0-1723264 | Loss: 0.738 | 913 ms/step , 6886.64 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 12:24:06 | Epoch: 0 | Step: 152860 | Dataset: 0-1723584 | Loss: 0.633 | 914 ms/step , 6884.88 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 12:24:15 | Epoch: 0 | Step: 152870 | Dataset: 0-1723904 | Loss: 0.785 | 913 ms/step , 6887.14 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 12:24:24 | Epoch: 0 | Step: 152880 | Dataset: 0-1724224 | Loss: 0.776 | 913 ms/step , 6885.36 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 12:24:33 | Epoch: 0 | Step: 152890 | Dataset: 0-1724544 | Loss: 0.776 | 913 ms/step , 6887.99 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 12:24:42 | Epoch: 0 | Step: 152900 | Dataset: 0-1724864 | Loss: 0.634 | 913 ms/step , 6888.46 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 12:24:44 | Validation | Step: 152900 | Val_loss: 0.767 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:24:53 | Epoch: 0 | Step: 152910 | Dataset: 0-1725184 | Loss: 0.832 | 913 ms/step , 6886.38 GFLOP/s , 15289.2 tokens/s INFO:__main__:2024-11-05 12:25:02 | Epoch: 0 | Step: 152920 | Dataset: 0-1725504 | Loss: 0.681 | 913 ms/step , 6886.10 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 12:25:11 | Epoch: 0 | Step: 152930 | Dataset: 0-1725824 | Loss: 0.749 | 912 ms/step , 6894.04 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 12:25:21 | Epoch: 0 | Step: 152940 | Dataset: 0-1726144 | Loss: 0.720 | 914 ms/step , 6878.98 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 12:25:30 | Epoch: 0 | Step: 152950 | Dataset: 0-1726464 | Loss: 0.751 | 913 ms/step , 6890.52 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 12:25:39 | Epoch: 0 | Step: 152960 | Dataset: 0-1726784 | Loss: 0.698 | 913 ms/step , 6886.17 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 12:25:48 | Epoch: 0 | Step: 152970 | Dataset: 0-1727104 | Loss: 0.756 | 914 ms/step , 6884.47 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 12:25:57 | Epoch: 0 | Step: 152980 | Dataset: 0-1727424 | Loss: 0.752 | 912 ms/step , 6895.95 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 12:26:06 | Epoch: 0 | Step: 152990 | Dataset: 0-1727744 | Loss: 0.831 | 913 ms/step , 6892.31 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 12:26:15 | Epoch: 0 | Step: 153000 | Dataset: 0-1728064 | Loss: 0.718 | 913 ms/step , 6889.62 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 12:26:17 | Validation | Step: 153000 | Val_loss: 0.754 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:26:17 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_122617_step_153000.pt` INFO:__main__:2024-11-05 12:26:27 | Epoch: 0 | Step: 153010 | Dataset: 0-1728384 | Loss: 0.834 | 913 ms/step , 6887.98 GFLOP/s , 13780.3 tokens/s INFO:__main__:2024-11-05 12:26:36 | Epoch: 0 | Step: 153020 | Dataset: 0-1728704 | Loss: 0.758 | 913 ms/step , 6886.81 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 12:26:45 | Epoch: 0 | Step: 153030 | Dataset: 0-1729024 | Loss: 0.767 | 913 ms/step , 6885.33 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 12:26:55 | Epoch: 0 | Step: 153040 | Dataset: 0-1729344 | Loss: 0.819 | 913 ms/step , 6891.60 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 12:27:04 | Epoch: 0 | Step: 153050 | Dataset: 0-1729664 | Loss: 0.859 | 913 ms/step , 6888.57 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 12:27:13 | Epoch: 0 | Step: 153060 | Dataset: 0-1729984 | Loss: 0.837 | 913 ms/step , 6886.81 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 12:27:22 | Epoch: 0 | Step: 153070 | Dataset: 0-1730304 | Loss: 0.666 | 913 ms/step , 6885.22 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 12:27:31 | Epoch: 0 | Step: 153080 | Dataset: 0-1730624 | Loss: 0.846 | 912 ms/step , 6895.57 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 12:27:40 | Epoch: 0 | Step: 153090 | Dataset: 0-1730944 | Loss: 0.721 | 913 ms/step , 6886.08 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 12:27:49 | Epoch: 0 | Step: 153100 | Dataset: 0-1731264 | Loss: 0.714 | 912 ms/step , 6896.20 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 12:27:51 | Validation | Step: 153100 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:28:00 | Epoch: 0 | Step: 153110 | Dataset: 0-1731584 | Loss: 0.713 | 914 ms/step , 6884.62 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 12:28:09 | Epoch: 0 | Step: 153120 | Dataset: 0-1731904 | Loss: 0.717 | 914 ms/step , 6884.86 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:28:18 | Epoch: 0 | Step: 153130 | Dataset: 0-1732224 | Loss: 0.802 | 913 ms/step , 6892.18 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 12:28:28 | Epoch: 0 | Step: 153140 | Dataset: 0-1732544 | Loss: 0.721 | 913 ms/step , 6889.80 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 12:28:37 | Epoch: 0 | Step: 153150 | Dataset: 0-1732864 | Loss: 0.819 | 913 ms/step , 6889.64 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 12:28:46 | Epoch: 0 | Step: 153160 | Dataset: 0-1733184 | Loss: 0.820 | 913 ms/step , 6888.78 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 12:28:55 | Epoch: 0 | Step: 153170 | Dataset: 0-1733504 | Loss: 0.843 | 913 ms/step , 6887.95 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 12:29:04 | Epoch: 0 | Step: 153180 | Dataset: 0-1733824 | Loss: 0.797 | 913 ms/step , 6892.11 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 12:29:13 | Epoch: 0 | Step: 153190 | Dataset: 0-1734144 | Loss: 0.561 | 913 ms/step , 6886.80 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 12:29:22 | Epoch: 0 | Step: 153200 | Dataset: 0-1734464 | Loss: 0.677 | 913 ms/step , 6888.78 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:29:24 | Validation | Step: 153200 | Val_loss: 0.766 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:29:33 | Epoch: 0 | Step: 153210 | Dataset: 0-1734784 | Loss: 0.773 | 914 ms/step , 6883.81 GFLOP/s , 15287.2 tokens/s INFO:__main__:2024-11-05 12:29:42 | Epoch: 0 | Step: 153220 | Dataset: 0-1735104 | Loss: 0.754 | 913 ms/step , 6888.96 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 12:29:51 | Epoch: 0 | Step: 153230 | Dataset: 0-1735424 | Loss: 0.710 | 913 ms/step , 6885.08 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 12:30:01 | Epoch: 0 | Step: 153240 | Dataset: 0-1735744 | Loss: 0.884 | 913 ms/step , 6891.94 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 12:30:10 | Epoch: 0 | Step: 153250 | Dataset: 0-1736064 | Loss: 0.686 | 912 ms/step , 6894.65 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 12:30:19 | Epoch: 0 | Step: 153260 | Dataset: 0-1736384 | Loss: 0.716 | 913 ms/step , 6885.39 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 12:30:28 | Epoch: 0 | Step: 153270 | Dataset: 0-1736704 | Loss: 0.720 | 914 ms/step , 6883.36 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 12:30:37 | Epoch: 0 | Step: 153280 | Dataset: 0-1737024 | Loss: 0.799 | 914 ms/step , 6880.66 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 12:30:46 | Epoch: 0 | Step: 153290 | Dataset: 0-1737344 | Loss: 0.801 | 912 ms/step , 6898.66 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 12:30:55 | Epoch: 0 | Step: 153300 | Dataset: 0-1737664 | Loss: 0.729 | 913 ms/step , 6886.92 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 12:30:57 | Validation | Step: 153300 | Val_loss: 0.788 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:31:06 | Epoch: 0 | Step: 153310 | Dataset: 0-1737984 | Loss: 0.766 | 913 ms/step , 6890.64 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 12:31:15 | Epoch: 0 | Step: 153320 | Dataset: 0-1738304 | Loss: 0.727 | 913 ms/step , 6892.36 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 12:31:24 | Epoch: 0 | Step: 153330 | Dataset: 0-1738624 | Loss: 0.787 | 913 ms/step , 6886.71 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 12:31:33 | Epoch: 0 | Step: 153340 | Dataset: 0-1738944 | Loss: 0.806 | 914 ms/step , 6883.29 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 12:31:43 | Epoch: 0 | Step: 153350 | Dataset: 0-1739264 | Loss: 0.601 | 912 ms/step , 6894.54 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 12:31:52 | Epoch: 0 | Step: 153360 | Dataset: 0-1739584 | Loss: 0.680 | 912 ms/step , 6895.06 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 12:32:01 | Epoch: 0 | Step: 153370 | Dataset: 0-1739904 | Loss: 0.701 | 912 ms/step , 6894.15 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 12:32:10 | Epoch: 0 | Step: 153380 | Dataset: 0-1740224 | Loss: 0.838 | 913 ms/step , 6887.26 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 12:32:19 | Epoch: 0 | Step: 153390 | Dataset: 0-1740544 | Loss: 0.663 | 913 ms/step , 6891.87 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 12:32:28 | Epoch: 0 | Step: 153400 | Dataset: 0-1740864 | Loss: 0.817 | 913 ms/step , 6889.69 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 12:32:30 | Validation | Step: 153400 | Val_loss: 0.698 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:32:39 | Epoch: 0 | Step: 153410 | Dataset: 0-1741184 | Loss: 0.774 | 914 ms/step , 6881.84 GFLOP/s , 15284.7 tokens/s INFO:__main__:2024-11-05 12:32:48 | Epoch: 0 | Step: 153420 | Dataset: 0-1741504 | Loss: 0.761 | 914 ms/step , 6879.70 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 12:32:57 | Epoch: 0 | Step: 153430 | Dataset: 0-1741824 | Loss: 0.658 | 911 ms/step , 6900.90 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 12:33:06 | Epoch: 0 | Step: 153440 | Dataset: 0-1742144 | Loss: 0.724 | 912 ms/step , 6893.95 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 12:33:15 | Epoch: 0 | Step: 153450 | Dataset: 0-1742464 | Loss: 0.777 | 913 ms/step , 6891.11 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 12:33:25 | Epoch: 0 | Step: 153460 | Dataset: 0-1742784 | Loss: 0.782 | 913 ms/step , 6887.04 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 12:33:34 | Epoch: 0 | Step: 153470 | Dataset: 0-1743104 | Loss: 0.781 | 913 ms/step , 6886.56 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 12:33:43 | Epoch: 0 | Step: 153480 | Dataset: 0-1743424 | Loss: 0.821 | 913 ms/step , 6885.92 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 12:33:52 | Epoch: 0 | Step: 153490 | Dataset: 0-1743744 | Loss: 0.677 | 913 ms/step , 6890.42 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 12:34:01 | Epoch: 0 | Step: 153500 | Dataset: 0-1744064 | Loss: 0.798 | 914 ms/step , 6883.81 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 12:34:03 | Validation | Step: 153500 | Val_loss: 0.798 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:34:12 | Epoch: 0 | Step: 153510 | Dataset: 0-1744384 | Loss: 0.499 | 912 ms/step , 6894.23 GFLOP/s , 15273.6 tokens/s INFO:__main__:2024-11-05 12:34:21 | Epoch: 0 | Step: 153520 | Dataset: 0-1744704 | Loss: 0.811 | 913 ms/step , 6885.88 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 12:34:30 | Epoch: 0 | Step: 153530 | Dataset: 0-1745024 | Loss: 0.805 | 914 ms/step , 6884.02 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:34:39 | Epoch: 0 | Step: 153540 | Dataset: 0-1745344 | Loss: 0.738 | 914 ms/step , 6884.37 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 12:34:48 | Epoch: 0 | Step: 153550 | Dataset: 0-1745664 | Loss: 0.718 | 912 ms/step , 6897.16 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 12:34:58 | Epoch: 0 | Step: 153560 | Dataset: 0-1745984 | Loss: 0.667 | 912 ms/step , 6896.14 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 12:35:07 | Epoch: 0 | Step: 153570 | Dataset: 0-1746304 | Loss: 0.613 | 914 ms/step , 6883.41 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 12:35:16 | Epoch: 0 | Step: 153580 | Dataset: 0-1746624 | Loss: 0.766 | 913 ms/step , 6890.91 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 12:35:25 | Epoch: 0 | Step: 153590 | Dataset: 0-1746944 | Loss: 0.728 | 913 ms/step , 6888.95 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 12:35:34 | Epoch: 0 | Step: 153600 | Dataset: 0-1747264 | Loss: 0.703 | 913 ms/step , 6892.59 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 12:35:36 | Validation | Step: 153600 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:35:45 | Epoch: 0 | Step: 153610 | Dataset: 0-1747584 | Loss: 0.660 | 912 ms/step , 6899.57 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-05 12:35:54 | Epoch: 0 | Step: 153620 | Dataset: 0-1747904 | Loss: 0.568 | 913 ms/step , 6891.79 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:36:03 | Epoch: 0 | Step: 153630 | Dataset: 0-1748224 | Loss: 0.799 | 913 ms/step , 6887.37 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 12:36:12 | Epoch: 0 | Step: 153640 | Dataset: 0-1748544 | Loss: 0.864 | 914 ms/step , 6879.60 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 12:36:21 | Epoch: 0 | Step: 153650 | Dataset: 0-1748864 | Loss: 0.763 | 912 ms/step , 6894.30 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 12:36:30 | Epoch: 0 | Step: 153660 | Dataset: 0-1749184 | Loss: 0.809 | 912 ms/step , 6897.51 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 12:36:40 | Epoch: 0 | Step: 153670 | Dataset: 0-1749504 | Loss: 0.660 | 913 ms/step , 6891.81 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 12:36:49 | Epoch: 0 | Step: 153680 | Dataset: 0-1749824 | Loss: 0.677 | 912 ms/step , 6895.80 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 12:36:58 | Epoch: 0 | Step: 153690 | Dataset: 0-1750144 | Loss: 0.720 | 913 ms/step , 6890.89 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 12:37:07 | Epoch: 0 | Step: 153700 | Dataset: 0-1750464 | Loss: 0.737 | 914 ms/step , 6884.45 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 12:37:09 | Validation | Step: 153700 | Val_loss: 0.802 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:37:18 | Epoch: 0 | Step: 153710 | Dataset: 0-1750784 | Loss: 0.916 | 913 ms/step , 6887.33 GFLOP/s , 15282.7 tokens/s INFO:__main__:2024-11-05 12:37:27 | Epoch: 0 | Step: 153720 | Dataset: 0-1751104 | Loss: 0.769 | 913 ms/step , 6891.13 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 12:37:36 | Epoch: 0 | Step: 153730 | Dataset: 0-1751424 | Loss: 0.719 | 913 ms/step , 6887.25 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 12:37:45 | Epoch: 0 | Step: 153740 | Dataset: 0-1751744 | Loss: 0.805 | 913 ms/step , 6889.97 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 12:37:54 | Epoch: 0 | Step: 153750 | Dataset: 0-1752064 | Loss: 0.871 | 915 ms/step , 6874.61 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 12:38:03 | Epoch: 0 | Step: 153760 | Dataset: 0-1752384 | Loss: 0.475 | 912 ms/step , 6895.88 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 12:38:13 | Epoch: 0 | Step: 153770 | Dataset: 0-1752704 | Loss: 0.451 | 912 ms/step , 6895.19 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 12:38:22 | Epoch: 0 | Step: 153780 | Dataset: 0-1753024 | Loss: 0.430 | 914 ms/step , 6882.30 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 12:38:31 | Epoch: 0 | Step: 153790 | Dataset: 0-1753344 | Loss: 0.481 | 913 ms/step , 6891.01 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 12:38:40 | Epoch: 0 | Step: 153800 | Dataset: 0-1753664 | Loss: 0.434 | 912 ms/step , 6899.43 GFLOP/s , 17954.2 tokens/s INFO:__main__:2024-11-05 12:38:42 | Validation | Step: 153800 | Val_loss: 0.791 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:38:51 | Epoch: 0 | Step: 153810 | Dataset: 0-1753984 | Loss: 0.489 | 913 ms/step , 6891.63 GFLOP/s , 15284.8 tokens/s INFO:__main__:2024-11-05 12:39:00 | Epoch: 0 | Step: 153820 | Dataset: 0-1754304 | Loss: 0.426 | 912 ms/step , 6896.92 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 12:39:09 | Epoch: 0 | Step: 153830 | Dataset: 0-1754624 | Loss: 0.441 | 912 ms/step , 6897.72 GFLOP/s , 17949.7 tokens/s INFO:__main__:2024-11-05 12:39:18 | Epoch: 0 | Step: 153840 | Dataset: 0-1754944 | Loss: 0.382 | 912 ms/step , 6896.59 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 12:39:27 | Epoch: 0 | Step: 153850 | Dataset: 0-1755264 | Loss: 0.391 | 913 ms/step , 6887.67 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 12:39:36 | Epoch: 0 | Step: 153860 | Dataset: 0-1755584 | Loss: 0.593 | 912 ms/step , 6897.30 GFLOP/s , 17947.1 tokens/s INFO:__main__:2024-11-05 12:39:45 | Epoch: 0 | Step: 153870 | Dataset: 0-1755904 | Loss: 0.538 | 913 ms/step , 6886.73 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 12:39:55 | Epoch: 0 | Step: 153880 | Dataset: 0-1756224 | Loss: 0.405 | 912 ms/step , 6896.81 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-05 12:40:04 | Epoch: 0 | Step: 153890 | Dataset: 0-1756544 | Loss: 0.359 | 914 ms/step , 6878.81 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 12:40:13 | Epoch: 0 | Step: 153900 | Dataset: 0-1756864 | Loss: 0.507 | 912 ms/step , 6897.40 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 12:40:14 | Validation | Step: 153900 | Val_loss: 0.690 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:40:24 | Epoch: 0 | Step: 153910 | Dataset: 0-1757184 | Loss: 0.389 | 912 ms/step , 6897.98 GFLOP/s , 15288.3 tokens/s INFO:__main__:2024-11-05 12:40:33 | Epoch: 0 | Step: 153920 | Dataset: 0-1757504 | Loss: 0.513 | 911 ms/step , 6905.59 GFLOP/s , 17952.4 tokens/s INFO:__main__:2024-11-05 12:40:42 | Epoch: 0 | Step: 153930 | Dataset: 0-1757824 | Loss: 0.488 | 914 ms/step , 6877.88 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 12:40:51 | Epoch: 0 | Step: 153940 | Dataset: 0-1758144 | Loss: 0.531 | 911 ms/step , 6902.08 GFLOP/s , 17961.9 tokens/s INFO:__main__:2024-11-05 12:41:00 | Epoch: 0 | Step: 153950 | Dataset: 0-1758464 | Loss: 0.479 | 911 ms/step , 6903.41 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-05 12:41:09 | Epoch: 0 | Step: 153960 | Dataset: 0-1758784 | Loss: 0.542 | 912 ms/step , 6898.79 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 12:41:18 | Epoch: 0 | Step: 153970 | Dataset: 0-1759104 | Loss: 0.365 | 912 ms/step , 6899.22 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-05 12:41:27 | Epoch: 0 | Step: 153980 | Dataset: 0-1759424 | Loss: 0.335 | 912 ms/step , 6895.97 GFLOP/s , 17954.6 tokens/s INFO:__main__:2024-11-05 12:41:37 | Epoch: 0 | Step: 153990 | Dataset: 0-1759744 | Loss: 0.327 | 912 ms/step , 6894.03 GFLOP/s , 17959.8 tokens/s INFO:__main__:2024-11-05 12:41:46 | Epoch: 0 | Step: 154000 | Dataset: 0-1760064 | Loss: 0.460 | 911 ms/step , 6901.93 GFLOP/s , 17955.1 tokens/s INFO:__main__:2024-11-05 12:41:47 | Validation | Step: 154000 | Val_loss: 0.784 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:41:47 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_124147_step_154000.pt` INFO:__main__:2024-11-05 12:41:58 | Epoch: 0 | Step: 154010 | Dataset: 0-1760384 | Loss: 0.374 | 912 ms/step , 6893.30 GFLOP/s , 13757.7 tokens/s INFO:__main__:2024-11-05 12:42:07 | Epoch: 0 | Step: 154020 | Dataset: 0-1760704 | Loss: 0.389 | 914 ms/step , 6883.51 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 12:42:16 | Epoch: 0 | Step: 154030 | Dataset: 0-1761024 | Loss: 0.281 | 912 ms/step , 6895.71 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 12:42:25 | Epoch: 0 | Step: 154040 | Dataset: 0-1761344 | Loss: 0.384 | 913 ms/step , 6890.76 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 12:42:34 | Epoch: 0 | Step: 154050 | Dataset: 0-1761664 | Loss: 0.535 | 914 ms/step , 6884.57 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 12:42:43 | Epoch: 0 | Step: 154060 | Dataset: 0-1761984 | Loss: 0.746 | 915 ms/step , 6874.78 GFLOP/s , 17907.1 tokens/s INFO:__main__:2024-11-05 12:42:52 | Epoch: 0 | Step: 154070 | Dataset: 0-1762304 | Loss: 0.717 | 913 ms/step , 6890.81 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 12:43:02 | Epoch: 0 | Step: 154080 | Dataset: 0-1762624 | Loss: 0.764 | 914 ms/step , 6878.02 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 12:43:11 | Epoch: 0 | Step: 154090 | Dataset: 0-1762944 | Loss: 0.718 | 914 ms/step , 6880.65 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:43:20 | Epoch: 0 | Step: 154100 | Dataset: 0-1763264 | Loss: 0.705 | 913 ms/step , 6885.49 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 12:43:21 | Validation | Step: 154100 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:43:31 | Epoch: 0 | Step: 154110 | Dataset: 0-1763584 | Loss: 0.639 | 912 ms/step , 6895.22 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 12:43:40 | Epoch: 0 | Step: 154120 | Dataset: 0-1763904 | Loss: 0.818 | 914 ms/step , 6879.18 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 12:43:49 | Epoch: 0 | Step: 154130 | Dataset: 0-1764224 | Loss: 0.835 | 913 ms/step , 6888.90 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 12:43:58 | Epoch: 0 | Step: 154140 | Dataset: 0-1764544 | Loss: 0.734 | 913 ms/step , 6885.23 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 12:44:07 | Epoch: 0 | Step: 154150 | Dataset: 0-1764864 | Loss: 0.696 | 912 ms/step , 6899.40 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 12:44:16 | Epoch: 0 | Step: 154160 | Dataset: 0-1765184 | Loss: 0.698 | 912 ms/step , 6896.22 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 12:44:25 | Epoch: 0 | Step: 154170 | Dataset: 0-1765504 | Loss: 0.668 | 914 ms/step , 6882.09 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 12:44:34 | Epoch: 0 | Step: 154180 | Dataset: 0-1765824 | Loss: 0.637 | 913 ms/step , 6887.20 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 12:44:44 | Epoch: 0 | Step: 154190 | Dataset: 0-1766144 | Loss: 0.793 | 912 ms/step , 6897.74 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 12:44:53 | Epoch: 0 | Step: 154200 | Dataset: 0-1766464 | Loss: 0.734 | 912 ms/step , 6893.14 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 12:44:54 | Validation | Step: 154200 | Val_loss: 0.643 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:45:03 | Epoch: 0 | Step: 154210 | Dataset: 0-1766784 | Loss: 0.786 | 913 ms/step , 6889.70 GFLOP/s , 15277.6 tokens/s INFO:__main__:2024-11-05 12:45:13 | Epoch: 0 | Step: 154220 | Dataset: 0-1767104 | Loss: 0.776 | 913 ms/step , 6890.56 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 12:45:22 | Epoch: 0 | Step: 154230 | Dataset: 0-1767424 | Loss: 0.642 | 915 ms/step , 6877.42 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 12:45:31 | Epoch: 0 | Step: 154240 | Dataset: 0-1767744 | Loss: 0.719 | 914 ms/step , 6884.62 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 12:45:40 | Epoch: 0 | Step: 154250 | Dataset: 0-1768064 | Loss: 0.764 | 914 ms/step , 6878.73 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 12:45:49 | Epoch: 0 | Step: 154260 | Dataset: 0-1768384 | Loss: 0.725 | 913 ms/step , 6889.64 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 12:45:58 | Epoch: 0 | Step: 154270 | Dataset: 0-1768704 | Loss: 0.753 | 913 ms/step , 6885.38 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 12:46:07 | Epoch: 0 | Step: 154280 | Dataset: 0-1769024 | Loss: 0.784 | 914 ms/step , 6883.11 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 12:46:17 | Epoch: 0 | Step: 154290 | Dataset: 0-1769344 | Loss: 0.732 | 915 ms/step , 6873.01 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 12:46:26 | Epoch: 0 | Step: 154300 | Dataset: 0-1769664 | Loss: 0.704 | 915 ms/step , 6870.27 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 12:46:27 | Validation | Step: 154300 | Val_loss: 0.712 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:46:36 | Epoch: 0 | Step: 154310 | Dataset: 0-1769984 | Loss: 0.664 | 913 ms/step , 6890.86 GFLOP/s , 15269.6 tokens/s INFO:__main__:2024-11-05 12:46:46 | Epoch: 0 | Step: 154320 | Dataset: 0-1770304 | Loss: 0.720 | 912 ms/step , 6895.82 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 12:46:55 | Epoch: 0 | Step: 154330 | Dataset: 0-1770624 | Loss: 0.642 | 914 ms/step , 6880.44 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 12:47:04 | Epoch: 0 | Step: 154340 | Dataset: 0-1770944 | Loss: 0.709 | 915 ms/step , 6876.64 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 12:47:13 | Epoch: 0 | Step: 154350 | Dataset: 0-1771264 | Loss: 0.814 | 915 ms/step , 6875.71 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 12:47:22 | Epoch: 0 | Step: 154360 | Dataset: 0-1771584 | Loss: 0.688 | 913 ms/step , 6887.86 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 12:47:31 | Epoch: 0 | Step: 154370 | Dataset: 0-1771904 | Loss: 0.758 | 912 ms/step , 6893.37 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 12:47:40 | Epoch: 0 | Step: 154380 | Dataset: 0-1772224 | Loss: 0.681 | 913 ms/step , 6890.06 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 12:47:50 | Epoch: 0 | Step: 154390 | Dataset: 0-1772544 | Loss: 0.780 | 913 ms/step , 6891.33 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 12:47:59 | Epoch: 0 | Step: 154400 | Dataset: 0-1772864 | Loss: 0.765 | 913 ms/step , 6892.44 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 12:48:00 | Validation | Step: 154400 | Val_loss: 0.711 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:48:09 | Epoch: 0 | Step: 154410 | Dataset: 0-1773184 | Loss: 0.778 | 913 ms/step , 6886.15 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 12:48:19 | Epoch: 0 | Step: 154420 | Dataset: 0-1773504 | Loss: 0.802 | 914 ms/step , 6883.97 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 12:48:28 | Epoch: 0 | Step: 154430 | Dataset: 0-1773824 | Loss: 0.784 | 913 ms/step , 6892.56 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 12:48:37 | Epoch: 0 | Step: 154440 | Dataset: 0-1774144 | Loss: 0.584 | 914 ms/step , 6884.02 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 12:48:46 | Epoch: 0 | Step: 154450 | Dataset: 0-1774464 | Loss: 0.743 | 914 ms/step , 6882.29 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 12:48:55 | Epoch: 0 | Step: 154460 | Dataset: 0-1774784 | Loss: 0.726 | 915 ms/step , 6874.11 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 12:49:04 | Epoch: 0 | Step: 154470 | Dataset: 0-1775104 | Loss: 0.812 | 913 ms/step , 6889.12 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 12:49:13 | Epoch: 0 | Step: 154480 | Dataset: 0-1775424 | Loss: 0.717 | 914 ms/step , 6880.73 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 12:49:23 | Epoch: 0 | Step: 154490 | Dataset: 0-1775744 | Loss: 0.718 | 913 ms/step , 6889.63 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 12:49:32 | Epoch: 0 | Step: 154500 | Dataset: 0-1776064 | Loss: 0.680 | 914 ms/step , 6882.36 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 12:49:33 | Validation | Step: 154500 | Val_loss: 0.790 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:49:42 | Epoch: 0 | Step: 154510 | Dataset: 0-1776384 | Loss: 0.721 | 913 ms/step , 6890.88 GFLOP/s , 15288.3 tokens/s INFO:__main__:2024-11-05 12:49:52 | Epoch: 0 | Step: 154520 | Dataset: 0-1776704 | Loss: 0.768 | 914 ms/step , 6883.98 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 12:50:01 | Epoch: 0 | Step: 154530 | Dataset: 0-1777024 | Loss: 0.766 | 914 ms/step , 6882.90 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 12:50:10 | Epoch: 0 | Step: 154540 | Dataset: 0-1777344 | Loss: 0.725 | 915 ms/step , 6877.35 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 12:50:19 | Epoch: 0 | Step: 154550 | Dataset: 0-1777664 | Loss: 0.800 | 913 ms/step , 6889.67 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 12:50:28 | Epoch: 0 | Step: 154560 | Dataset: 0-1777984 | Loss: 0.800 | 914 ms/step , 6883.66 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 12:50:37 | Epoch: 0 | Step: 154570 | Dataset: 0-1778304 | Loss: 0.686 | 913 ms/step , 6889.24 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 12:50:46 | Epoch: 0 | Step: 154580 | Dataset: 0-1778624 | Loss: 0.696 | 912 ms/step , 6893.75 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 12:50:55 | Epoch: 0 | Step: 154590 | Dataset: 0-1778944 | Loss: 0.681 | 913 ms/step , 6885.16 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 12:51:05 | Epoch: 0 | Step: 154600 | Dataset: 0-1779264 | Loss: 0.784 | 914 ms/step , 6878.38 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 12:51:06 | Validation | Step: 154600 | Val_loss: 0.721 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:51:15 | Epoch: 0 | Step: 154610 | Dataset: 0-1779584 | Loss: 0.812 | 914 ms/step , 6881.07 GFLOP/s , 15267.9 tokens/s INFO:__main__:2024-11-05 12:51:24 | Epoch: 0 | Step: 154620 | Dataset: 0-1779904 | Loss: 0.737 | 914 ms/step , 6884.08 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 12:51:34 | Epoch: 0 | Step: 154630 | Dataset: 0-1780224 | Loss: 0.773 | 916 ms/step , 6869.24 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 12:51:43 | Epoch: 0 | Step: 154640 | Dataset: 0-1780544 | Loss: 0.590 | 914 ms/step , 6884.69 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 12:51:52 | Epoch: 0 | Step: 154650 | Dataset: 0-1780864 | Loss: 0.625 | 912 ms/step , 6892.63 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 12:52:01 | Epoch: 0 | Step: 154660 | Dataset: 0-1781184 | Loss: 0.651 | 915 ms/step , 6877.22 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 12:52:10 | Epoch: 0 | Step: 154670 | Dataset: 0-1781504 | Loss: 0.626 | 913 ms/step , 6885.55 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 12:52:19 | Epoch: 0 | Step: 154680 | Dataset: 0-1781824 | Loss: 0.762 | 915 ms/step , 6877.32 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 12:52:28 | Epoch: 0 | Step: 154690 | Dataset: 0-1782144 | Loss: 0.728 | 912 ms/step , 6895.87 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 12:52:38 | Epoch: 0 | Step: 154700 | Dataset: 0-1782464 | Loss: 0.705 | 913 ms/step , 6891.57 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 12:52:39 | Validation | Step: 154700 | Val_loss: 0.682 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:52:48 | Epoch: 0 | Step: 154710 | Dataset: 0-1782784 | Loss: 0.787 | 913 ms/step , 6886.84 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-05 12:52:57 | Epoch: 0 | Step: 154720 | Dataset: 0-1783104 | Loss: 0.627 | 913 ms/step , 6889.48 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 12:53:07 | Epoch: 0 | Step: 154730 | Dataset: 0-1783424 | Loss: 0.725 | 913 ms/step , 6890.02 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 12:53:16 | Epoch: 0 | Step: 154740 | Dataset: 0-1783744 | Loss: 0.682 | 913 ms/step , 6889.99 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 12:53:25 | Epoch: 0 | Step: 154750 | Dataset: 0-1784064 | Loss: 0.752 | 913 ms/step , 6892.49 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 12:53:34 | Epoch: 0 | Step: 154760 | Dataset: 0-1784384 | Loss: 0.706 | 914 ms/step , 6883.20 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 12:53:43 | Epoch: 0 | Step: 154770 | Dataset: 0-1784704 | Loss: 0.749 | 913 ms/step , 6890.46 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 12:53:52 | Epoch: 0 | Step: 154780 | Dataset: 0-1785024 | Loss: 0.744 | 913 ms/step , 6889.88 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 12:54:01 | Epoch: 0 | Step: 154790 | Dataset: 0-1785344 | Loss: 0.697 | 913 ms/step , 6887.41 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 12:54:11 | Epoch: 0 | Step: 154800 | Dataset: 0-1785664 | Loss: 0.710 | 912 ms/step , 6896.24 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 12:54:12 | Validation | Step: 154800 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:54:21 | Epoch: 0 | Step: 154810 | Dataset: 0-1785984 | Loss: 0.704 | 913 ms/step , 6889.56 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 12:54:30 | Epoch: 0 | Step: 154820 | Dataset: 0-1786304 | Loss: 0.702 | 914 ms/step , 6879.56 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 12:54:40 | Epoch: 0 | Step: 154830 | Dataset: 0-1786624 | Loss: 0.726 | 914 ms/step , 6878.66 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 12:54:49 | Epoch: 0 | Step: 154840 | Dataset: 0-1786944 | Loss: 0.667 | 912 ms/step , 6894.43 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 12:54:58 | Epoch: 0 | Step: 154850 | Dataset: 0-1787264 | Loss: 0.666 | 913 ms/step , 6890.21 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 12:55:07 | Epoch: 0 | Step: 154860 | Dataset: 0-1787584 | Loss: 0.724 | 914 ms/step , 6883.65 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 12:55:16 | Epoch: 0 | Step: 154870 | Dataset: 0-1787904 | Loss: 0.766 | 914 ms/step , 6880.48 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 12:55:25 | Epoch: 0 | Step: 154880 | Dataset: 0-1788224 | Loss: 0.701 | 915 ms/step , 6872.47 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 12:55:34 | Epoch: 0 | Step: 154890 | Dataset: 0-1788544 | Loss: 0.659 | 913 ms/step , 6890.07 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 12:55:44 | Epoch: 0 | Step: 154900 | Dataset: 0-1788864 | Loss: 0.680 | 914 ms/step , 6880.10 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 12:55:45 | Validation | Step: 154900 | Val_loss: 0.719 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:55:54 | Epoch: 0 | Step: 154910 | Dataset: 0-1789184 | Loss: 0.788 | 913 ms/step , 6887.25 GFLOP/s , 15264.5 tokens/s INFO:__main__:2024-11-05 12:56:03 | Epoch: 0 | Step: 154920 | Dataset: 0-1789504 | Loss: 0.655 | 913 ms/step , 6889.12 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 12:56:13 | Epoch: 0 | Step: 154930 | Dataset: 0-1789824 | Loss: 0.700 | 914 ms/step , 6879.79 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 12:56:22 | Epoch: 0 | Step: 154940 | Dataset: 0-1790144 | Loss: 0.664 | 912 ms/step , 6896.35 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 12:56:31 | Epoch: 0 | Step: 154950 | Dataset: 0-1790464 | Loss: 0.809 | 913 ms/step , 6891.81 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 12:56:40 | Epoch: 0 | Step: 154960 | Dataset: 0-1790784 | Loss: 0.676 | 913 ms/step , 6888.21 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 12:56:49 | Epoch: 0 | Step: 154970 | Dataset: 0-1791104 | Loss: 0.789 | 915 ms/step , 6874.56 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 12:56:58 | Epoch: 0 | Step: 154980 | Dataset: 0-1791424 | Loss: 0.624 | 914 ms/step , 6883.16 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 12:57:07 | Epoch: 0 | Step: 154990 | Dataset: 0-1791744 | Loss: 0.721 | 915 ms/step , 6876.89 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 12:57:17 | Epoch: 0 | Step: 155000 | Dataset: 0-1792064 | Loss: 0.651 | 913 ms/step , 6887.02 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 12:57:18 | Validation | Step: 155000 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:57:18 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_125718_step_155000.pt` INFO:__main__:2024-11-05 12:57:28 | Epoch: 0 | Step: 155010 | Dataset: 0-1792384 | Loss: 0.595 | 914 ms/step , 6878.33 GFLOP/s , 13749.6 tokens/s INFO:__main__:2024-11-05 12:57:38 | Epoch: 0 | Step: 155020 | Dataset: 0-1792704 | Loss: 0.661 | 914 ms/step , 6882.32 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 12:57:47 | Epoch: 0 | Step: 155030 | Dataset: 0-1793024 | Loss: 0.814 | 913 ms/step , 6889.74 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 12:57:56 | Epoch: 0 | Step: 155040 | Dataset: 0-1793344 | Loss: 0.630 | 913 ms/step , 6891.13 GFLOP/s , 17892.7 tokens/s INFO:__main__:2024-11-05 12:58:05 | Epoch: 0 | Step: 155050 | Dataset: 0-1793664 | Loss: 0.805 | 913 ms/step , 6885.76 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 12:58:14 | Epoch: 0 | Step: 155060 | Dataset: 0-1793984 | Loss: 0.739 | 913 ms/step , 6885.89 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 12:58:23 | Epoch: 0 | Step: 155070 | Dataset: 0-1794304 | Loss: 0.761 | 914 ms/step , 6881.15 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 12:58:32 | Epoch: 0 | Step: 155080 | Dataset: 0-1794624 | Loss: 0.681 | 914 ms/step , 6881.73 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 12:58:42 | Epoch: 0 | Step: 155090 | Dataset: 0-1794944 | Loss: 0.752 | 913 ms/step , 6885.61 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 12:58:51 | Epoch: 0 | Step: 155100 | Dataset: 0-1795264 | Loss: 0.809 | 914 ms/step , 6880.77 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 12:58:52 | Validation | Step: 155100 | Val_loss: 0.749 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 12:59:01 | Epoch: 0 | Step: 155110 | Dataset: 0-1795584 | Loss: 0.638 | 913 ms/step , 6891.20 GFLOP/s , 15277.6 tokens/s INFO:__main__:2024-11-05 12:59:11 | Epoch: 0 | Step: 155120 | Dataset: 0-1795904 | Loss: 0.730 | 912 ms/step , 6894.65 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 12:59:20 | Epoch: 0 | Step: 155130 | Dataset: 0-1796224 | Loss: 0.703 | 913 ms/step , 6887.65 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 12:59:29 | Epoch: 0 | Step: 155140 | Dataset: 0-1796544 | Loss: 0.689 | 914 ms/step , 6881.32 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 12:59:38 | Epoch: 0 | Step: 155150 | Dataset: 0-1796864 | Loss: 0.580 | 913 ms/step , 6892.38 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 12:59:47 | Epoch: 0 | Step: 155160 | Dataset: 0-1797184 | Loss: 0.693 | 914 ms/step , 6880.27 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 12:59:56 | Epoch: 0 | Step: 155170 | Dataset: 0-1797504 | Loss: 0.524 | 913 ms/step , 6885.80 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 13:00:05 | Epoch: 0 | Step: 155180 | Dataset: 0-1797824 | Loss: 0.721 | 914 ms/step , 6884.76 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 13:00:15 | Epoch: 0 | Step: 155190 | Dataset: 0-1798144 | Loss: 0.775 | 914 ms/step , 6882.52 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 13:00:24 | Epoch: 0 | Step: 155200 | Dataset: 0-1798464 | Loss: 0.770 | 913 ms/step , 6886.95 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 13:00:25 | Validation | Step: 155200 | Val_loss: 0.687 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:00:34 | Epoch: 0 | Step: 155210 | Dataset: 0-1798784 | Loss: 0.643 | 914 ms/step , 6884.13 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 13:00:44 | Epoch: 0 | Step: 155220 | Dataset: 0-1799104 | Loss: 0.685 | 915 ms/step , 6876.58 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 13:00:53 | Epoch: 0 | Step: 155230 | Dataset: 0-1799424 | Loss: 0.753 | 916 ms/step , 6868.90 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 13:01:02 | Epoch: 0 | Step: 155240 | Dataset: 0-1799744 | Loss: 0.641 | 914 ms/step , 6882.96 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 13:01:11 | Epoch: 0 | Step: 155250 | Dataset: 0-1800064 | Loss: 0.784 | 912 ms/step , 6894.69 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 13:01:20 | Epoch: 0 | Step: 155260 | Dataset: 0-1800384 | Loss: 0.672 | 914 ms/step , 6879.77 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 13:01:29 | Epoch: 0 | Step: 155270 | Dataset: 0-1800704 | Loss: 0.606 | 914 ms/step , 6882.81 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 13:01:38 | Epoch: 0 | Step: 155280 | Dataset: 0-1801024 | Loss: 0.771 | 913 ms/step , 6887.93 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-05 13:01:48 | Epoch: 0 | Step: 155290 | Dataset: 0-1801344 | Loss: 0.731 | 914 ms/step , 6879.76 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 13:01:57 | Epoch: 0 | Step: 155300 | Dataset: 0-1801664 | Loss: 0.758 | 913 ms/step , 6885.30 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 13:01:58 | Validation | Step: 155300 | Val_loss: 0.731 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:02:07 | Epoch: 0 | Step: 155310 | Dataset: 0-1801984 | Loss: 0.772 | 914 ms/step , 6879.47 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 13:02:17 | Epoch: 0 | Step: 155320 | Dataset: 0-1802304 | Loss: 0.582 | 915 ms/step , 6876.43 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 13:02:26 | Epoch: 0 | Step: 155330 | Dataset: 0-1802624 | Loss: 0.771 | 913 ms/step , 6892.57 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 13:02:35 | Epoch: 0 | Step: 155340 | Dataset: 0-1802944 | Loss: 0.687 | 913 ms/step , 6887.02 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 13:02:44 | Epoch: 0 | Step: 155350 | Dataset: 0-1803264 | Loss: 0.853 | 912 ms/step , 6892.95 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 13:02:53 | Epoch: 0 | Step: 155360 | Dataset: 0-1803584 | Loss: 0.780 | 914 ms/step , 6884.57 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 13:03:02 | Epoch: 0 | Step: 155370 | Dataset: 0-1803904 | Loss: 0.910 | 914 ms/step , 6881.17 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 13:03:11 | Epoch: 0 | Step: 155380 | Dataset: 0-1804224 | Loss: 0.788 | 914 ms/step , 6879.62 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 13:03:21 | Epoch: 0 | Step: 155390 | Dataset: 0-1804544 | Loss: 0.722 | 913 ms/step , 6888.60 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 13:03:30 | Epoch: 0 | Step: 155400 | Dataset: 0-1804864 | Loss: 0.772 | 914 ms/step , 6880.89 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 13:03:31 | Validation | Step: 155400 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:03:40 | Epoch: 0 | Step: 155410 | Dataset: 0-1805184 | Loss: 0.676 | 913 ms/step , 6887.16 GFLOP/s , 15279.0 tokens/s INFO:__main__:2024-11-05 13:03:50 | Epoch: 0 | Step: 155420 | Dataset: 0-1805504 | Loss: 0.861 | 915 ms/step , 6870.24 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 13:03:59 | Epoch: 0 | Step: 155430 | Dataset: 0-1805824 | Loss: 0.732 | 914 ms/step , 6878.62 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 13:04:08 | Epoch: 0 | Step: 155440 | Dataset: 0-1806144 | Loss: 0.834 | 914 ms/step , 6879.66 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 13:04:17 | Epoch: 0 | Step: 155450 | Dataset: 0-1806464 | Loss: 0.660 | 914 ms/step , 6883.68 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 13:04:26 | Epoch: 0 | Step: 155460 | Dataset: 0-1806784 | Loss: 0.813 | 914 ms/step , 6883.72 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 13:04:35 | Epoch: 0 | Step: 155470 | Dataset: 0-1807104 | Loss: 0.657 | 913 ms/step , 6886.78 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 13:04:44 | Epoch: 0 | Step: 155480 | Dataset: 0-1807424 | Loss: 0.683 | 913 ms/step , 6888.24 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 13:04:54 | Epoch: 0 | Step: 155490 | Dataset: 0-1807744 | Loss: 0.693 | 913 ms/step , 6891.44 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 13:05:03 | Epoch: 0 | Step: 155500 | Dataset: 0-1808064 | Loss: 0.783 | 913 ms/step , 6885.13 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 13:05:04 | Validation | Step: 155500 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:05:13 | Epoch: 0 | Step: 155510 | Dataset: 0-1808384 | Loss: 0.565 | 913 ms/step , 6887.92 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 13:05:23 | Epoch: 0 | Step: 155520 | Dataset: 0-1808704 | Loss: 0.715 | 913 ms/step , 6885.40 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 13:05:32 | Epoch: 0 | Step: 155530 | Dataset: 0-1809024 | Loss: 0.775 | 915 ms/step , 6876.70 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 13:05:41 | Epoch: 0 | Step: 155540 | Dataset: 0-1809344 | Loss: 0.722 | 913 ms/step , 6888.17 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 13:05:50 | Epoch: 0 | Step: 155550 | Dataset: 0-1809664 | Loss: 0.794 | 913 ms/step , 6890.53 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 13:05:59 | Epoch: 0 | Step: 155560 | Dataset: 0-1809984 | Loss: 0.725 | 913 ms/step , 6890.70 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 13:06:08 | Epoch: 0 | Step: 155570 | Dataset: 0-1810304 | Loss: 0.606 | 912 ms/step , 6897.34 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 13:06:17 | Epoch: 0 | Step: 155580 | Dataset: 0-1810624 | Loss: 0.648 | 913 ms/step , 6887.00 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 13:06:26 | Epoch: 0 | Step: 155590 | Dataset: 0-1810944 | Loss: 0.767 | 914 ms/step , 6884.34 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 13:06:36 | Epoch: 0 | Step: 155600 | Dataset: 0-1811264 | Loss: 0.825 | 913 ms/step , 6885.34 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 13:06:37 | Validation | Step: 155600 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:06:46 | Epoch: 0 | Step: 155610 | Dataset: 0-1811584 | Loss: 0.808 | 913 ms/step , 6886.74 GFLOP/s , 15263.6 tokens/s INFO:__main__:2024-11-05 13:06:56 | Epoch: 0 | Step: 155620 | Dataset: 0-1811904 | Loss: 0.801 | 914 ms/step , 6881.11 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 13:07:05 | Epoch: 0 | Step: 155630 | Dataset: 0-1812224 | Loss: 0.683 | 913 ms/step , 6885.84 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 13:07:14 | Epoch: 0 | Step: 155640 | Dataset: 0-1812544 | Loss: 0.675 | 914 ms/step , 6884.35 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 13:07:23 | Epoch: 0 | Step: 155650 | Dataset: 0-1812864 | Loss: 0.828 | 914 ms/step , 6880.87 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 13:07:32 | Epoch: 0 | Step: 155660 | Dataset: 0-1813184 | Loss: 0.760 | 914 ms/step , 6882.82 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 13:07:41 | Epoch: 0 | Step: 155670 | Dataset: 0-1813504 | Loss: 0.696 | 914 ms/step , 6883.70 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 13:07:50 | Epoch: 0 | Step: 155680 | Dataset: 0-1813824 | Loss: 0.597 | 914 ms/step , 6883.24 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 13:07:59 | Epoch: 0 | Step: 155690 | Dataset: 0-1814144 | Loss: 0.820 | 914 ms/step , 6883.45 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 13:08:09 | Epoch: 0 | Step: 155700 | Dataset: 0-1814464 | Loss: 0.875 | 914 ms/step , 6881.33 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 13:08:10 | Validation | Step: 155700 | Val_loss: 0.703 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:08:19 | Epoch: 0 | Step: 155710 | Dataset: 0-1814784 | Loss: 0.824 | 913 ms/step , 6891.08 GFLOP/s , 15271.0 tokens/s INFO:__main__:2024-11-05 13:08:28 | Epoch: 0 | Step: 155720 | Dataset: 0-1815104 | Loss: 0.700 | 913 ms/step , 6890.57 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 13:08:38 | Epoch: 0 | Step: 155730 | Dataset: 0-1815424 | Loss: 0.763 | 913 ms/step , 6891.70 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 13:08:47 | Epoch: 0 | Step: 155740 | Dataset: 0-1815744 | Loss: 0.617 | 913 ms/step , 6892.49 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 13:08:56 | Epoch: 0 | Step: 155750 | Dataset: 0-1816064 | Loss: 0.802 | 915 ms/step , 6872.59 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 13:09:05 | Epoch: 0 | Step: 155760 | Dataset: 0-1816384 | Loss: 0.716 | 914 ms/step , 6883.20 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 13:09:14 | Epoch: 0 | Step: 155770 | Dataset: 0-1816704 | Loss: 0.743 | 913 ms/step , 6886.91 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 13:09:23 | Epoch: 0 | Step: 155780 | Dataset: 0-1817024 | Loss: 0.793 | 913 ms/step , 6887.87 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 13:09:32 | Epoch: 0 | Step: 155790 | Dataset: 0-1817344 | Loss: 0.716 | 912 ms/step , 6893.92 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 13:09:42 | Epoch: 0 | Step: 155800 | Dataset: 0-1817664 | Loss: 0.843 | 914 ms/step , 6884.63 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 13:09:43 | Validation | Step: 155800 | Val_loss: 0.770 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:09:52 | Epoch: 0 | Step: 155810 | Dataset: 0-1817984 | Loss: 0.758 | 913 ms/step , 6889.84 GFLOP/s , 15267.9 tokens/s INFO:__main__:2024-11-05 13:10:01 | Epoch: 0 | Step: 155820 | Dataset: 0-1818304 | Loss: 0.934 | 914 ms/step , 6878.79 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 13:10:11 | Epoch: 0 | Step: 155830 | Dataset: 0-1818624 | Loss: 0.791 | 914 ms/step , 6880.53 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 13:10:20 | Epoch: 0 | Step: 155840 | Dataset: 0-1818944 | Loss: 0.758 | 914 ms/step , 6882.58 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 13:10:29 | Epoch: 0 | Step: 155850 | Dataset: 0-1819264 | Loss: 0.695 | 914 ms/step , 6883.18 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 13:10:38 | Epoch: 0 | Step: 155860 | Dataset: 0-1819584 | Loss: 0.799 | 914 ms/step , 6884.76 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 13:10:47 | Epoch: 0 | Step: 155870 | Dataset: 0-1819904 | Loss: 0.777 | 912 ms/step , 6894.95 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 13:10:56 | Epoch: 0 | Step: 155880 | Dataset: 0-1820224 | Loss: 0.739 | 913 ms/step , 6890.55 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 13:11:05 | Epoch: 0 | Step: 155890 | Dataset: 0-1820544 | Loss: 0.715 | 913 ms/step , 6890.27 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 13:11:15 | Epoch: 0 | Step: 155900 | Dataset: 0-1820864 | Loss: 0.755 | 913 ms/step , 6886.51 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 13:11:16 | Validation | Step: 155900 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:11:25 | Epoch: 0 | Step: 155910 | Dataset: 0-1821184 | Loss: 0.864 | 914 ms/step , 6879.30 GFLOP/s , 15276.5 tokens/s INFO:__main__:2024-11-05 13:11:34 | Epoch: 0 | Step: 155920 | Dataset: 0-1821504 | Loss: 0.649 | 912 ms/step , 6893.16 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 13:11:44 | Epoch: 0 | Step: 155930 | Dataset: 0-1821824 | Loss: 0.786 | 913 ms/step , 6892.51 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 13:11:53 | Epoch: 0 | Step: 155940 | Dataset: 0-1822144 | Loss: 0.678 | 914 ms/step , 6883.11 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 13:12:02 | Epoch: 0 | Step: 155950 | Dataset: 0-1822464 | Loss: 0.707 | 913 ms/step , 6886.79 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 13:12:11 | Epoch: 0 | Step: 155960 | Dataset: 0-1822784 | Loss: 0.784 | 914 ms/step , 6882.99 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 13:12:20 | Epoch: 0 | Step: 155970 | Dataset: 0-1823104 | Loss: 0.744 | 914 ms/step , 6881.66 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 13:12:29 | Epoch: 0 | Step: 155980 | Dataset: 0-1823424 | Loss: 0.760 | 913 ms/step , 6885.13 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 13:12:38 | Epoch: 0 | Step: 155990 | Dataset: 0-1823744 | Loss: 0.653 | 913 ms/step , 6890.57 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 13:12:48 | Epoch: 0 | Step: 156000 | Dataset: 0-1824064 | Loss: 0.736 | 912 ms/step , 6893.93 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 13:12:49 | Validation | Step: 156000 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:12:49 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_131249_step_156000.pt` INFO:__main__:2024-11-05 13:12:59 | Epoch: 0 | Step: 156010 | Dataset: 0-1824384 | Loss: 0.860 | 915 ms/step , 6874.14 GFLOP/s , 13792.5 tokens/s INFO:__main__:2024-11-05 13:13:09 | Epoch: 0 | Step: 156020 | Dataset: 0-1824704 | Loss: 0.803 | 912 ms/step , 6895.44 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 13:13:18 | Epoch: 0 | Step: 156030 | Dataset: 0-1825024 | Loss: 0.685 | 913 ms/step , 6887.36 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 13:13:27 | Epoch: 0 | Step: 156040 | Dataset: 0-1825344 | Loss: 0.806 | 914 ms/step , 6880.81 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 13:13:36 | Epoch: 0 | Step: 156050 | Dataset: 0-1825664 | Loss: 0.846 | 913 ms/step , 6887.14 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 13:13:45 | Epoch: 0 | Step: 156060 | Dataset: 0-1825984 | Loss: 0.927 | 915 ms/step , 6877.45 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 13:13:54 | Epoch: 0 | Step: 156070 | Dataset: 0-1826304 | Loss: 0.764 | 913 ms/step , 6887.10 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 13:14:03 | Epoch: 0 | Step: 156080 | Dataset: 0-1826624 | Loss: 0.751 | 913 ms/step , 6889.71 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:14:13 | Epoch: 0 | Step: 156090 | Dataset: 0-1826944 | Loss: 0.719 | 914 ms/step , 6882.90 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 13:14:22 | Epoch: 0 | Step: 156100 | Dataset: 0-1827264 | Loss: 0.757 | 914 ms/step , 6881.26 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 13:14:23 | Validation | Step: 156100 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:14:32 | Epoch: 0 | Step: 156110 | Dataset: 0-1827584 | Loss: 0.683 | 913 ms/step , 6885.60 GFLOP/s , 15285.0 tokens/s INFO:__main__:2024-11-05 13:14:42 | Epoch: 0 | Step: 156120 | Dataset: 0-1827904 | Loss: 0.679 | 914 ms/step , 6879.85 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 13:14:51 | Epoch: 0 | Step: 156130 | Dataset: 0-1828224 | Loss: 0.712 | 912 ms/step , 6897.86 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 13:15:00 | Epoch: 0 | Step: 156140 | Dataset: 0-1828544 | Loss: 0.814 | 914 ms/step , 6880.13 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 13:15:09 | Epoch: 0 | Step: 156150 | Dataset: 0-1828864 | Loss: 0.736 | 913 ms/step , 6889.45 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 13:15:18 | Epoch: 0 | Step: 156160 | Dataset: 0-1829184 | Loss: 0.873 | 915 ms/step , 6876.85 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 13:15:27 | Epoch: 0 | Step: 156170 | Dataset: 0-1829504 | Loss: 0.770 | 913 ms/step , 6891.13 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 13:15:36 | Epoch: 0 | Step: 156180 | Dataset: 0-1829824 | Loss: 0.788 | 914 ms/step , 6884.61 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 13:15:46 | Epoch: 0 | Step: 156190 | Dataset: 0-1830144 | Loss: 0.720 | 913 ms/step , 6891.83 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 13:15:55 | Epoch: 0 | Step: 156200 | Dataset: 0-1830464 | Loss: 0.703 | 912 ms/step , 6896.16 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 13:15:56 | Validation | Step: 156200 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:16:05 | Epoch: 0 | Step: 156210 | Dataset: 0-1830784 | Loss: 0.806 | 915 ms/step , 6876.35 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-05 13:16:15 | Epoch: 0 | Step: 156220 | Dataset: 0-1831104 | Loss: 0.719 | 913 ms/step , 6890.67 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 13:16:24 | Epoch: 0 | Step: 156230 | Dataset: 0-1831424 | Loss: 0.647 | 913 ms/step , 6889.09 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 13:16:33 | Epoch: 0 | Step: 156240 | Dataset: 0-1831744 | Loss: 0.853 | 914 ms/step , 6879.96 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 13:16:42 | Epoch: 0 | Step: 156250 | Dataset: 0-1832064 | Loss: 0.663 | 913 ms/step , 6887.93 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 13:16:51 | Epoch: 0 | Step: 156260 | Dataset: 0-1832384 | Loss: 0.763 | 913 ms/step , 6885.68 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 13:17:00 | Epoch: 0 | Step: 156270 | Dataset: 0-1832704 | Loss: 0.796 | 915 ms/step , 6876.84 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 13:17:09 | Epoch: 0 | Step: 156280 | Dataset: 0-1833024 | Loss: 0.748 | 917 ms/step , 6861.62 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 13:17:19 | Epoch: 0 | Step: 156290 | Dataset: 0-1833344 | Loss: 0.663 | 912 ms/step , 6894.54 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 13:17:28 | Epoch: 0 | Step: 156300 | Dataset: 0-1833664 | Loss: 0.696 | 914 ms/step , 6883.59 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 13:17:29 | Validation | Step: 156300 | Val_loss: 0.696 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:17:38 | Epoch: 0 | Step: 156310 | Dataset: 0-1833984 | Loss: 0.868 | 914 ms/step , 6884.31 GFLOP/s , 15279.0 tokens/s INFO:__main__:2024-11-05 13:17:48 | Epoch: 0 | Step: 156320 | Dataset: 0-1834304 | Loss: 0.762 | 913 ms/step , 6888.73 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 13:17:57 | Epoch: 0 | Step: 156330 | Dataset: 0-1834624 | Loss: 0.739 | 913 ms/step , 6890.68 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 13:18:06 | Epoch: 0 | Step: 156340 | Dataset: 0-1834944 | Loss: 0.818 | 912 ms/step , 6895.29 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 13:18:15 | Epoch: 0 | Step: 156350 | Dataset: 0-1835264 | Loss: 0.884 | 916 ms/step , 6867.66 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 13:18:24 | Epoch: 0 | Step: 156360 | Dataset: 0-1835584 | Loss: 0.743 | 914 ms/step , 6879.02 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 13:18:33 | Epoch: 0 | Step: 156370 | Dataset: 0-1835904 | Loss: 0.745 | 914 ms/step , 6884.16 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 13:18:42 | Epoch: 0 | Step: 156380 | Dataset: 0-1836224 | Loss: 0.694 | 913 ms/step , 6887.72 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 13:18:51 | Epoch: 0 | Step: 156390 | Dataset: 0-1836544 | Loss: 0.749 | 914 ms/step , 6884.63 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 13:19:01 | Epoch: 0 | Step: 156400 | Dataset: 0-1836864 | Loss: 0.728 | 913 ms/step , 6887.75 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 13:19:02 | Validation | Step: 156400 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:19:11 | Epoch: 0 | Step: 156410 | Dataset: 0-1837184 | Loss: 0.738 | 914 ms/step , 6884.19 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-05 13:19:20 | Epoch: 0 | Step: 156420 | Dataset: 0-1837504 | Loss: 0.734 | 915 ms/step , 6876.42 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 13:19:30 | Epoch: 0 | Step: 156430 | Dataset: 0-1837824 | Loss: 0.702 | 914 ms/step , 6880.38 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 13:19:39 | Epoch: 0 | Step: 156440 | Dataset: 0-1838144 | Loss: 0.792 | 913 ms/step , 6886.93 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 13:19:48 | Epoch: 0 | Step: 156450 | Dataset: 0-1838464 | Loss: 0.645 | 913 ms/step , 6889.80 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 13:19:57 | Epoch: 0 | Step: 156460 | Dataset: 0-1838784 | Loss: 0.744 | 914 ms/step , 6883.59 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 13:20:06 | Epoch: 0 | Step: 156470 | Dataset: 0-1839104 | Loss: 0.668 | 912 ms/step , 6893.34 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 13:20:15 | Epoch: 0 | Step: 156480 | Dataset: 0-1839424 | Loss: 0.695 | 912 ms/step , 6896.71 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 13:20:24 | Epoch: 0 | Step: 156490 | Dataset: 0-1839744 | Loss: 0.711 | 913 ms/step , 6887.78 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 13:20:34 | Epoch: 0 | Step: 156500 | Dataset: 0-1840064 | Loss: 0.833 | 914 ms/step , 6884.13 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 13:20:35 | Validation | Step: 156500 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:20:44 | Epoch: 0 | Step: 156510 | Dataset: 0-1840384 | Loss: 0.654 | 912 ms/step , 6892.87 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-05 13:20:53 | Epoch: 0 | Step: 156520 | Dataset: 0-1840704 | Loss: 0.814 | 913 ms/step , 6889.20 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 13:21:03 | Epoch: 0 | Step: 156530 | Dataset: 0-1841024 | Loss: 0.759 | 912 ms/step , 6895.61 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 13:21:12 | Epoch: 0 | Step: 156540 | Dataset: 0-1841344 | Loss: 0.842 | 913 ms/step , 6890.31 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 13:21:21 | Epoch: 0 | Step: 156550 | Dataset: 0-1841664 | Loss: 0.661 | 913 ms/step , 6887.55 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 13:21:30 | Epoch: 0 | Step: 156560 | Dataset: 0-1841984 | Loss: 0.698 | 913 ms/step , 6887.52 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 13:21:39 | Epoch: 0 | Step: 156570 | Dataset: 0-1842304 | Loss: 0.606 | 913 ms/step , 6890.92 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 13:21:48 | Epoch: 0 | Step: 156580 | Dataset: 0-1842624 | Loss: 0.846 | 913 ms/step , 6892.12 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 13:21:57 | Epoch: 0 | Step: 156590 | Dataset: 0-1842944 | Loss: 0.750 | 912 ms/step , 6897.29 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 13:22:07 | Epoch: 0 | Step: 156600 | Dataset: 0-1843264 | Loss: 0.792 | 913 ms/step , 6887.10 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 13:22:08 | Validation | Step: 156600 | Val_loss: 0.809 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:22:17 | Epoch: 0 | Step: 156610 | Dataset: 0-1843584 | Loss: 0.744 | 915 ms/step , 6870.36 GFLOP/s , 15289.7 tokens/s INFO:__main__:2024-11-05 13:22:26 | Epoch: 0 | Step: 156620 | Dataset: 0-1843904 | Loss: 0.682 | 913 ms/step , 6890.53 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 13:22:36 | Epoch: 0 | Step: 156630 | Dataset: 0-1844224 | Loss: 0.733 | 915 ms/step , 6873.53 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 13:22:45 | Epoch: 0 | Step: 156640 | Dataset: 0-1844544 | Loss: 0.694 | 913 ms/step , 6888.52 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 13:22:54 | Epoch: 0 | Step: 156650 | Dataset: 0-1844864 | Loss: 0.796 | 913 ms/step , 6887.97 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 13:23:03 | Epoch: 0 | Step: 156660 | Dataset: 0-1845184 | Loss: 0.775 | 913 ms/step , 6886.95 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 13:23:12 | Epoch: 0 | Step: 156670 | Dataset: 0-1845504 | Loss: 0.772 | 913 ms/step , 6889.79 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:23:21 | Epoch: 0 | Step: 156680 | Dataset: 0-1845824 | Loss: 0.661 | 914 ms/step , 6879.56 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 13:23:30 | Epoch: 0 | Step: 156690 | Dataset: 0-1846144 | Loss: 0.550 | 914 ms/step , 6884.42 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 13:23:39 | Epoch: 0 | Step: 156700 | Dataset: 0-1846464 | Loss: 0.768 | 914 ms/step , 6883.54 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 13:23:41 | Validation | Step: 156700 | Val_loss: 0.733 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:23:50 | Epoch: 0 | Step: 156710 | Dataset: 0-1846784 | Loss: 0.805 | 912 ms/step , 6893.76 GFLOP/s , 15275.9 tokens/s INFO:__main__:2024-11-05 13:23:59 | Epoch: 0 | Step: 156720 | Dataset: 0-1847104 | Loss: 0.889 | 913 ms/step , 6887.90 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 13:24:08 | Epoch: 0 | Step: 156730 | Dataset: 0-1847424 | Loss: 0.753 | 913 ms/step , 6891.04 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 13:24:18 | Epoch: 0 | Step: 156740 | Dataset: 0-1847744 | Loss: 0.671 | 913 ms/step , 6887.53 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 13:24:27 | Epoch: 0 | Step: 156750 | Dataset: 0-1848064 | Loss: 0.736 | 914 ms/step , 6883.21 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 13:24:36 | Epoch: 0 | Step: 156760 | Dataset: 0-1848384 | Loss: 0.772 | 914 ms/step , 6884.50 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 13:24:45 | Epoch: 0 | Step: 156770 | Dataset: 0-1848704 | Loss: 0.767 | 913 ms/step , 6885.62 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 13:24:54 | Epoch: 0 | Step: 156780 | Dataset: 0-1849024 | Loss: 0.893 | 913 ms/step , 6888.75 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 13:25:03 | Epoch: 0 | Step: 156790 | Dataset: 0-1849344 | Loss: 0.843 | 913 ms/step , 6888.04 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 13:25:12 | Epoch: 0 | Step: 156800 | Dataset: 0-1849664 | Loss: 0.757 | 913 ms/step , 6889.66 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 13:25:14 | Validation | Step: 156800 | Val_loss: 0.696 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:25:23 | Epoch: 0 | Step: 156810 | Dataset: 0-1849984 | Loss: 0.630 | 912 ms/step , 6896.42 GFLOP/s , 15283.4 tokens/s INFO:__main__:2024-11-05 13:25:32 | Epoch: 0 | Step: 156820 | Dataset: 0-1850304 | Loss: 0.835 | 914 ms/step , 6882.83 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 13:25:41 | Epoch: 0 | Step: 156830 | Dataset: 0-1850624 | Loss: 0.750 | 914 ms/step , 6883.70 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 13:25:51 | Epoch: 0 | Step: 156840 | Dataset: 0-1850944 | Loss: 0.758 | 912 ms/step , 6892.72 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 13:26:00 | Epoch: 0 | Step: 156850 | Dataset: 0-1851264 | Loss: 0.641 | 914 ms/step , 6882.06 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 13:26:09 | Epoch: 0 | Step: 156860 | Dataset: 0-1851584 | Loss: 0.779 | 913 ms/step , 6885.82 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 13:26:18 | Epoch: 0 | Step: 156870 | Dataset: 0-1851904 | Loss: 0.691 | 912 ms/step , 6893.58 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 13:26:27 | Epoch: 0 | Step: 156880 | Dataset: 0-1852224 | Loss: 0.756 | 912 ms/step , 6893.40 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 13:26:36 | Epoch: 0 | Step: 156890 | Dataset: 0-1852544 | Loss: 0.940 | 912 ms/step , 6896.99 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 13:26:45 | Epoch: 0 | Step: 156900 | Dataset: 0-1852864 | Loss: 0.667 | 913 ms/step , 6887.51 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 13:26:47 | Validation | Step: 156900 | Val_loss: 0.777 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:26:56 | Epoch: 0 | Step: 156910 | Dataset: 0-1853184 | Loss: 0.794 | 914 ms/step , 6878.77 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-05 13:27:05 | Epoch: 0 | Step: 156920 | Dataset: 0-1853504 | Loss: 0.744 | 912 ms/step , 6893.19 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 13:27:14 | Epoch: 0 | Step: 156930 | Dataset: 0-1853824 | Loss: 0.680 | 913 ms/step , 6888.82 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 13:27:24 | Epoch: 0 | Step: 156940 | Dataset: 0-1854144 | Loss: 0.713 | 913 ms/step , 6891.12 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 13:27:33 | Epoch: 0 | Step: 156950 | Dataset: 0-1854464 | Loss: 0.681 | 913 ms/step , 6888.52 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 13:27:42 | Epoch: 0 | Step: 156960 | Dataset: 0-1854784 | Loss: 0.732 | 914 ms/step , 6879.73 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 13:27:51 | Epoch: 0 | Step: 156970 | Dataset: 0-1855104 | Loss: 0.830 | 915 ms/step , 6875.08 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 13:28:00 | Epoch: 0 | Step: 156980 | Dataset: 0-1855424 | Loss: 0.761 | 913 ms/step , 6886.54 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 13:28:09 | Epoch: 0 | Step: 156990 | Dataset: 0-1855744 | Loss: 0.830 | 914 ms/step , 6882.29 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 13:28:18 | Epoch: 0 | Step: 157000 | Dataset: 0-1856064 | Loss: 0.792 | 912 ms/step , 6895.20 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 13:28:20 | Validation | Step: 157000 | Val_loss: 0.766 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:28:20 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_132820_step_157000.pt` INFO:__main__:2024-11-05 13:28:30 | Epoch: 0 | Step: 157010 | Dataset: 0-1856384 | Loss: 0.747 | 913 ms/step , 6885.40 GFLOP/s , 13798.7 tokens/s INFO:__main__:2024-11-05 13:28:39 | Epoch: 0 | Step: 157020 | Dataset: 0-1856704 | Loss: 0.636 | 913 ms/step , 6891.04 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 13:28:48 | Epoch: 0 | Step: 157030 | Dataset: 0-1857024 | Loss: 0.772 | 913 ms/step , 6887.73 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 13:28:58 | Epoch: 0 | Step: 157040 | Dataset: 0-1857344 | Loss: 0.835 | 914 ms/step , 6884.29 GFLOP/s , 17892.9 tokens/s INFO:__main__:2024-11-05 13:29:07 | Epoch: 0 | Step: 157050 | Dataset: 0-1857664 | Loss: 0.870 | 914 ms/step , 6882.45 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 13:29:16 | Epoch: 0 | Step: 157060 | Dataset: 0-1857984 | Loss: 0.839 | 914 ms/step , 6878.33 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 13:29:25 | Epoch: 0 | Step: 157070 | Dataset: 0-1858304 | Loss: 0.820 | 914 ms/step , 6884.48 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 13:29:34 | Epoch: 0 | Step: 157080 | Dataset: 0-1858624 | Loss: 0.766 | 914 ms/step , 6880.20 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 13:29:43 | Epoch: 0 | Step: 157090 | Dataset: 0-1858944 | Loss: 0.751 | 913 ms/step , 6888.96 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 13:29:52 | Epoch: 0 | Step: 157100 | Dataset: 0-1859264 | Loss: 0.672 | 913 ms/step , 6889.83 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 13:29:54 | Validation | Step: 157100 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:30:03 | Epoch: 0 | Step: 157110 | Dataset: 0-1859584 | Loss: 0.762 | 913 ms/step , 6887.17 GFLOP/s , 15266.8 tokens/s INFO:__main__:2024-11-05 13:30:12 | Epoch: 0 | Step: 157120 | Dataset: 0-1859904 | Loss: 0.757 | 912 ms/step , 6893.77 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 13:30:21 | Epoch: 0 | Step: 157130 | Dataset: 0-1860224 | Loss: 0.778 | 912 ms/step , 6893.92 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 13:30:31 | Epoch: 0 | Step: 157140 | Dataset: 0-1860544 | Loss: 0.686 | 913 ms/step , 6885.98 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 13:30:40 | Epoch: 0 | Step: 157150 | Dataset: 0-1860864 | Loss: 0.815 | 914 ms/step , 6883.94 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 13:30:49 | Epoch: 0 | Step: 157160 | Dataset: 0-1861184 | Loss: 0.726 | 912 ms/step , 6894.13 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 13:30:58 | Epoch: 0 | Step: 157170 | Dataset: 0-1861504 | Loss: 0.792 | 914 ms/step , 6885.00 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 13:31:07 | Epoch: 0 | Step: 157180 | Dataset: 0-1861824 | Loss: 0.719 | 914 ms/step , 6882.74 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 13:31:16 | Epoch: 0 | Step: 157190 | Dataset: 0-1862144 | Loss: 0.835 | 914 ms/step , 6884.20 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 13:31:25 | Epoch: 0 | Step: 157200 | Dataset: 0-1862464 | Loss: 0.704 | 913 ms/step , 6887.07 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 13:31:27 | Validation | Step: 157200 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:31:36 | Epoch: 0 | Step: 157210 | Dataset: 0-1862784 | Loss: 0.687 | 914 ms/step , 6882.77 GFLOP/s , 15282.5 tokens/s INFO:__main__:2024-11-05 13:31:45 | Epoch: 0 | Step: 157220 | Dataset: 0-1863104 | Loss: 0.723 | 913 ms/step , 6889.30 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 13:31:54 | Epoch: 0 | Step: 157230 | Dataset: 0-1863424 | Loss: 0.691 | 913 ms/step , 6888.02 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 13:32:04 | Epoch: 0 | Step: 157240 | Dataset: 0-1863744 | Loss: 0.626 | 913 ms/step , 6887.95 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 13:32:13 | Epoch: 0 | Step: 157250 | Dataset: 0-1864064 | Loss: 0.686 | 912 ms/step , 6895.59 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 13:32:22 | Epoch: 0 | Step: 157260 | Dataset: 0-1864384 | Loss: 0.701 | 913 ms/step , 6889.11 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 13:32:31 | Epoch: 0 | Step: 157270 | Dataset: 0-1864704 | Loss: 0.773 | 915 ms/step , 6873.06 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 13:32:40 | Epoch: 0 | Step: 157280 | Dataset: 0-1865024 | Loss: 0.627 | 913 ms/step , 6889.91 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 13:32:49 | Epoch: 0 | Step: 157290 | Dataset: 0-1865344 | Loss: 0.743 | 914 ms/step , 6880.57 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 13:32:58 | Epoch: 0 | Step: 157300 | Dataset: 0-1865664 | Loss: 0.690 | 914 ms/step , 6883.57 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 13:33:00 | Validation | Step: 157300 | Val_loss: 0.727 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:33:09 | Epoch: 0 | Step: 157310 | Dataset: 0-1865984 | Loss: 0.711 | 913 ms/step , 6887.56 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 13:33:18 | Epoch: 0 | Step: 157320 | Dataset: 0-1866304 | Loss: 0.674 | 913 ms/step , 6887.01 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 13:33:27 | Epoch: 0 | Step: 157330 | Dataset: 0-1866624 | Loss: 0.725 | 914 ms/step , 6882.99 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 13:33:36 | Epoch: 0 | Step: 157340 | Dataset: 0-1866944 | Loss: 0.828 | 914 ms/step , 6880.53 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 13:33:46 | Epoch: 0 | Step: 157350 | Dataset: 0-1867264 | Loss: 0.805 | 913 ms/step , 6886.74 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 13:33:55 | Epoch: 0 | Step: 157360 | Dataset: 0-1867584 | Loss: 0.581 | 913 ms/step , 6888.82 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 13:34:04 | Epoch: 0 | Step: 157370 | Dataset: 0-1867904 | Loss: 0.816 | 913 ms/step , 6887.22 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 13:34:13 | Epoch: 0 | Step: 157380 | Dataset: 0-1868224 | Loss: 0.677 | 912 ms/step , 6899.91 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 13:34:22 | Epoch: 0 | Step: 157390 | Dataset: 0-1868544 | Loss: 0.807 | 914 ms/step , 6883.45 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 13:34:31 | Epoch: 0 | Step: 157400 | Dataset: 0-1868864 | Loss: 0.818 | 914 ms/step , 6882.48 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 13:34:33 | Validation | Step: 157400 | Val_loss: 0.730 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:34:42 | Epoch: 0 | Step: 157410 | Dataset: 0-1869184 | Loss: 0.407 | 913 ms/step , 6891.08 GFLOP/s , 15273.2 tokens/s INFO:__main__:2024-11-05 13:34:51 | Epoch: 0 | Step: 157420 | Dataset: 0-1869504 | Loss: 0.725 | 913 ms/step , 6888.64 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:35:00 | Epoch: 0 | Step: 157430 | Dataset: 0-1869824 | Loss: 0.827 | 914 ms/step , 6880.31 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 13:35:09 | Epoch: 0 | Step: 157440 | Dataset: 0-1870144 | Loss: 0.631 | 914 ms/step , 6882.04 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 13:35:19 | Epoch: 0 | Step: 157450 | Dataset: 0-1870464 | Loss: 0.795 | 914 ms/step , 6883.21 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 13:35:28 | Epoch: 0 | Step: 157460 | Dataset: 0-1870784 | Loss: 0.770 | 913 ms/step , 6890.86 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 13:35:37 | Epoch: 0 | Step: 157470 | Dataset: 0-1871104 | Loss: 0.719 | 913 ms/step , 6890.59 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 13:35:46 | Epoch: 0 | Step: 157480 | Dataset: 0-1871424 | Loss: 0.776 | 913 ms/step , 6887.36 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 13:35:55 | Epoch: 0 | Step: 157490 | Dataset: 0-1871744 | Loss: 0.785 | 914 ms/step , 6881.99 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 13:36:04 | Epoch: 0 | Step: 157500 | Dataset: 0-1872064 | Loss: 0.707 | 913 ms/step , 6889.30 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 13:36:06 | Validation | Step: 157500 | Val_loss: 0.748 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:36:15 | Epoch: 0 | Step: 157510 | Dataset: 0-1872384 | Loss: 0.805 | 914 ms/step , 6884.00 GFLOP/s , 15282.1 tokens/s INFO:__main__:2024-11-05 13:36:24 | Epoch: 0 | Step: 157520 | Dataset: 0-1872704 | Loss: 0.650 | 912 ms/step , 6899.91 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 13:36:33 | Epoch: 0 | Step: 157530 | Dataset: 0-1873024 | Loss: 0.760 | 914 ms/step , 6880.62 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:36:42 | Epoch: 0 | Step: 157540 | Dataset: 0-1873344 | Loss: 0.730 | 913 ms/step , 6887.32 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 13:36:52 | Epoch: 0 | Step: 157550 | Dataset: 0-1873664 | Loss: 0.747 | 913 ms/step , 6888.05 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 13:37:01 | Epoch: 0 | Step: 157560 | Dataset: 0-1873984 | Loss: 0.787 | 913 ms/step , 6887.33 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 13:37:10 | Epoch: 0 | Step: 157570 | Dataset: 0-1874304 | Loss: 0.462 | 912 ms/step , 6898.48 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 13:37:19 | Epoch: 0 | Step: 157580 | Dataset: 0-1874624 | Loss: 0.827 | 915 ms/step , 6875.36 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 13:37:28 | Epoch: 0 | Step: 157590 | Dataset: 0-1874944 | Loss: 0.648 | 912 ms/step , 6893.09 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 13:37:37 | Epoch: 0 | Step: 157600 | Dataset: 0-1875264 | Loss: 0.778 | 913 ms/step , 6889.06 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 13:37:39 | Validation | Step: 157600 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:37:48 | Epoch: 0 | Step: 157610 | Dataset: 0-1875584 | Loss: 0.782 | 913 ms/step , 6889.42 GFLOP/s , 15279.6 tokens/s INFO:__main__:2024-11-05 13:37:57 | Epoch: 0 | Step: 157620 | Dataset: 0-1875904 | Loss: 0.696 | 912 ms/step , 6893.69 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 13:38:06 | Epoch: 0 | Step: 157630 | Dataset: 0-1876224 | Loss: 0.721 | 913 ms/step , 6886.32 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 13:38:15 | Epoch: 0 | Step: 157640 | Dataset: 0-1876544 | Loss: 0.755 | 913 ms/step , 6887.59 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 13:38:24 | Epoch: 0 | Step: 157650 | Dataset: 0-1876864 | Loss: 0.669 | 915 ms/step , 6874.44 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 13:38:34 | Epoch: 0 | Step: 157660 | Dataset: 0-1877184 | Loss: 0.682 | 912 ms/step , 6893.53 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 13:38:43 | Epoch: 0 | Step: 157670 | Dataset: 0-1877504 | Loss: 0.729 | 913 ms/step , 6885.66 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 13:38:52 | Epoch: 0 | Step: 157680 | Dataset: 0-1877824 | Loss: 0.830 | 913 ms/step , 6892.51 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 13:39:01 | Epoch: 0 | Step: 157690 | Dataset: 0-1878144 | Loss: 0.605 | 912 ms/step , 6893.84 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 13:39:10 | Epoch: 0 | Step: 157700 | Dataset: 0-1878464 | Loss: 0.782 | 915 ms/step , 6870.69 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 13:39:12 | Validation | Step: 157700 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:39:21 | Epoch: 0 | Step: 157710 | Dataset: 0-1878784 | Loss: 0.766 | 913 ms/step , 6892.50 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-05 13:39:30 | Epoch: 0 | Step: 157720 | Dataset: 0-1879104 | Loss: 0.758 | 913 ms/step , 6890.08 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 13:39:39 | Epoch: 0 | Step: 157730 | Dataset: 0-1879424 | Loss: 0.841 | 913 ms/step , 6885.27 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 13:39:48 | Epoch: 0 | Step: 157740 | Dataset: 0-1879744 | Loss: 0.703 | 914 ms/step , 6884.01 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 13:39:57 | Epoch: 0 | Step: 157750 | Dataset: 0-1880064 | Loss: 0.703 | 914 ms/step , 6881.65 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 13:40:07 | Epoch: 0 | Step: 157760 | Dataset: 0-1880384 | Loss: 0.792 | 915 ms/step , 6873.54 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 13:40:16 | Epoch: 0 | Step: 157770 | Dataset: 0-1880704 | Loss: 0.726 | 913 ms/step , 6889.30 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 13:40:25 | Epoch: 0 | Step: 157780 | Dataset: 0-1881024 | Loss: 0.759 | 913 ms/step , 6885.08 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 13:40:34 | Epoch: 0 | Step: 157790 | Dataset: 0-1881344 | Loss: 0.733 | 912 ms/step , 6896.46 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 13:40:43 | Epoch: 0 | Step: 157800 | Dataset: 0-1881664 | Loss: 0.731 | 913 ms/step , 6885.05 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 13:40:45 | Validation | Step: 157800 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:40:54 | Epoch: 0 | Step: 157810 | Dataset: 0-1881984 | Loss: 0.783 | 915 ms/step , 6877.46 GFLOP/s , 15285.7 tokens/s INFO:__main__:2024-11-05 13:41:03 | Epoch: 0 | Step: 157820 | Dataset: 0-1882304 | Loss: 0.874 | 912 ms/step , 6894.13 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 13:41:12 | Epoch: 0 | Step: 157830 | Dataset: 0-1882624 | Loss: 0.612 | 912 ms/step , 6894.58 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 13:41:21 | Epoch: 0 | Step: 157840 | Dataset: 0-1882944 | Loss: 0.797 | 914 ms/step , 6877.85 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 13:41:30 | Epoch: 0 | Step: 157850 | Dataset: 0-1883264 | Loss: 0.728 | 913 ms/step , 6885.13 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 13:41:40 | Epoch: 0 | Step: 157860 | Dataset: 0-1883584 | Loss: 0.706 | 913 ms/step , 6891.44 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 13:41:49 | Epoch: 0 | Step: 157870 | Dataset: 0-1883904 | Loss: 0.833 | 914 ms/step , 6884.86 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 13:41:58 | Epoch: 0 | Step: 157880 | Dataset: 0-1884224 | Loss: 0.836 | 914 ms/step , 6882.94 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 13:42:07 | Epoch: 0 | Step: 157890 | Dataset: 0-1884544 | Loss: 0.654 | 912 ms/step , 6892.98 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 13:42:16 | Epoch: 0 | Step: 157900 | Dataset: 0-1884864 | Loss: 0.819 | 913 ms/step , 6891.57 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 13:42:18 | Validation | Step: 157900 | Val_loss: 0.767 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:42:27 | Epoch: 0 | Step: 157910 | Dataset: 0-1885184 | Loss: 0.838 | 913 ms/step , 6889.39 GFLOP/s , 15276.8 tokens/s INFO:__main__:2024-11-05 13:42:36 | Epoch: 0 | Step: 157920 | Dataset: 0-1885504 | Loss: 0.766 | 913 ms/step , 6889.03 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 13:42:45 | Epoch: 0 | Step: 157930 | Dataset: 0-1885824 | Loss: 0.821 | 913 ms/step , 6889.02 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 13:42:54 | Epoch: 0 | Step: 157940 | Dataset: 0-1886144 | Loss: 0.772 | 913 ms/step , 6888.39 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 13:43:03 | Epoch: 0 | Step: 157950 | Dataset: 0-1886464 | Loss: 0.916 | 914 ms/step , 6882.48 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 13:43:12 | Epoch: 0 | Step: 157960 | Dataset: 0-1886784 | Loss: 0.690 | 914 ms/step , 6881.85 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 13:43:22 | Epoch: 0 | Step: 157970 | Dataset: 0-1887104 | Loss: 0.642 | 914 ms/step , 6884.36 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 13:43:31 | Epoch: 0 | Step: 157980 | Dataset: 0-1887424 | Loss: 0.769 | 914 ms/step , 6880.97 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 13:43:40 | Epoch: 0 | Step: 157990 | Dataset: 0-1887744 | Loss: 0.793 | 914 ms/step , 6880.00 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 13:43:49 | Epoch: 0 | Step: 158000 | Dataset: 0-1888064 | Loss: 0.769 | 913 ms/step , 6891.11 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 13:43:51 | Validation | Step: 158000 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:43:51 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_134351_step_158000.pt` INFO:__main__:2024-11-05 13:44:01 | Epoch: 0 | Step: 158010 | Dataset: 0-1888384 | Loss: 0.683 | 914 ms/step , 6881.93 GFLOP/s , 13803.2 tokens/s INFO:__main__:2024-11-05 13:44:10 | Epoch: 0 | Step: 158020 | Dataset: 0-1888704 | Loss: 0.538 | 913 ms/step , 6886.60 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 13:44:19 | Epoch: 0 | Step: 158030 | Dataset: 0-1889024 | Loss: 0.773 | 913 ms/step , 6885.39 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 13:44:28 | Epoch: 0 | Step: 158040 | Dataset: 0-1889344 | Loss: 0.673 | 912 ms/step , 6892.66 GFLOP/s , 17882.9 tokens/s INFO:__main__:2024-11-05 13:44:37 | Epoch: 0 | Step: 158050 | Dataset: 0-1889664 | Loss: 0.760 | 913 ms/step , 6888.14 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 13:44:47 | Epoch: 0 | Step: 158060 | Dataset: 0-1889984 | Loss: 0.676 | 914 ms/step , 6881.06 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 13:44:56 | Epoch: 0 | Step: 158070 | Dataset: 0-1890304 | Loss: 0.769 | 915 ms/step , 6877.19 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 13:45:05 | Epoch: 0 | Step: 158080 | Dataset: 0-1890624 | Loss: 0.683 | 912 ms/step , 6895.51 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 13:45:14 | Epoch: 0 | Step: 158090 | Dataset: 0-1890944 | Loss: 0.790 | 914 ms/step , 6884.91 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 13:45:23 | Epoch: 0 | Step: 158100 | Dataset: 0-1891264 | Loss: 0.690 | 912 ms/step , 6897.32 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 13:45:25 | Validation | Step: 158100 | Val_loss: 0.769 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:45:34 | Epoch: 0 | Step: 158110 | Dataset: 0-1891584 | Loss: 0.782 | 913 ms/step , 6888.30 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-05 13:45:43 | Epoch: 0 | Step: 158120 | Dataset: 0-1891904 | Loss: 0.791 | 914 ms/step , 6884.66 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 13:45:52 | Epoch: 0 | Step: 158130 | Dataset: 0-1892224 | Loss: 0.965 | 914 ms/step , 6878.42 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 13:46:01 | Epoch: 0 | Step: 158140 | Dataset: 0-1892544 | Loss: 0.713 | 913 ms/step , 6891.23 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 13:46:10 | Epoch: 0 | Step: 158150 | Dataset: 0-1892864 | Loss: 0.759 | 914 ms/step , 6880.76 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 13:46:20 | Epoch: 0 | Step: 158160 | Dataset: 0-1893184 | Loss: 0.852 | 913 ms/step , 6885.42 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:46:29 | Epoch: 0 | Step: 158170 | Dataset: 0-1893504 | Loss: 0.857 | 914 ms/step , 6883.33 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 13:46:38 | Epoch: 0 | Step: 158180 | Dataset: 0-1893824 | Loss: 0.601 | 913 ms/step , 6887.60 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 13:46:47 | Epoch: 0 | Step: 158190 | Dataset: 0-1894144 | Loss: 0.741 | 913 ms/step , 6890.67 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 13:46:56 | Epoch: 0 | Step: 158200 | Dataset: 0-1894464 | Loss: 0.676 | 913 ms/step , 6887.16 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 13:46:58 | Validation | Step: 158200 | Val_loss: 0.824 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:47:07 | Epoch: 0 | Step: 158210 | Dataset: 0-1894784 | Loss: 0.805 | 915 ms/step , 6874.33 GFLOP/s , 15274.5 tokens/s INFO:__main__:2024-11-05 13:47:16 | Epoch: 0 | Step: 158220 | Dataset: 0-1895104 | Loss: 0.821 | 912 ms/step , 6893.75 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 13:47:25 | Epoch: 0 | Step: 158230 | Dataset: 0-1895424 | Loss: 0.691 | 912 ms/step , 6893.59 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 13:47:34 | Epoch: 0 | Step: 158240 | Dataset: 0-1895744 | Loss: 0.752 | 913 ms/step , 6886.89 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 13:47:43 | Epoch: 0 | Step: 158250 | Dataset: 0-1896064 | Loss: 0.772 | 914 ms/step , 6881.63 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 13:47:53 | Epoch: 0 | Step: 158260 | Dataset: 0-1896384 | Loss: 0.758 | 914 ms/step , 6878.74 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 13:48:02 | Epoch: 0 | Step: 158270 | Dataset: 0-1896704 | Loss: 0.652 | 913 ms/step , 6892.42 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 13:48:11 | Epoch: 0 | Step: 158280 | Dataset: 0-1897024 | Loss: 0.613 | 913 ms/step , 6885.97 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 13:48:20 | Epoch: 0 | Step: 158290 | Dataset: 0-1897344 | Loss: 0.635 | 913 ms/step , 6891.71 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 13:48:29 | Epoch: 0 | Step: 158300 | Dataset: 0-1897664 | Loss: 0.765 | 914 ms/step , 6882.98 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 13:48:31 | Validation | Step: 158300 | Val_loss: 0.802 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:48:40 | Epoch: 0 | Step: 158310 | Dataset: 0-1897984 | Loss: 0.543 | 913 ms/step , 6891.52 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-05 13:48:49 | Epoch: 0 | Step: 158320 | Dataset: 0-1898304 | Loss: 0.749 | 915 ms/step , 6876.52 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 13:48:58 | Epoch: 0 | Step: 158330 | Dataset: 0-1898624 | Loss: 0.735 | 913 ms/step , 6888.91 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 13:49:07 | Epoch: 0 | Step: 158340 | Dataset: 0-1898944 | Loss: 0.713 | 913 ms/step , 6889.29 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 13:49:16 | Epoch: 0 | Step: 158350 | Dataset: 0-1899264 | Loss: 0.711 | 914 ms/step , 6881.94 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 13:49:26 | Epoch: 0 | Step: 158360 | Dataset: 0-1899584 | Loss: 0.775 | 913 ms/step , 6885.84 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 13:49:35 | Epoch: 0 | Step: 158370 | Dataset: 0-1899904 | Loss: 0.672 | 913 ms/step , 6886.48 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 13:49:44 | Epoch: 0 | Step: 158380 | Dataset: 0-1900224 | Loss: 0.793 | 913 ms/step , 6891.10 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 13:49:53 | Epoch: 0 | Step: 158390 | Dataset: 0-1900544 | Loss: 0.772 | 915 ms/step , 6871.62 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 13:50:02 | Epoch: 0 | Step: 158400 | Dataset: 0-1900864 | Loss: 0.800 | 914 ms/step , 6881.93 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 13:50:04 | Validation | Step: 158400 | Val_loss: 0.689 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:50:13 | Epoch: 0 | Step: 158410 | Dataset: 0-1901184 | Loss: 0.673 | 912 ms/step , 6898.65 GFLOP/s , 15273.5 tokens/s INFO:__main__:2024-11-05 13:50:22 | Epoch: 0 | Step: 158420 | Dataset: 0-1901504 | Loss: 0.751 | 913 ms/step , 6886.29 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 13:50:31 | Epoch: 0 | Step: 158430 | Dataset: 0-1901824 | Loss: 0.794 | 913 ms/step , 6887.71 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 13:50:40 | Epoch: 0 | Step: 158440 | Dataset: 0-1902144 | Loss: 0.705 | 913 ms/step , 6888.59 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 13:50:49 | Epoch: 0 | Step: 158450 | Dataset: 0-1902464 | Loss: 0.749 | 914 ms/step , 6879.52 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 13:50:58 | Epoch: 0 | Step: 158460 | Dataset: 0-1902784 | Loss: 0.810 | 913 ms/step , 6891.39 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 13:51:08 | Epoch: 0 | Step: 158470 | Dataset: 0-1903104 | Loss: 0.680 | 913 ms/step , 6892.26 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 13:51:17 | Epoch: 0 | Step: 158480 | Dataset: 0-1903424 | Loss: 0.802 | 913 ms/step , 6886.45 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 13:51:26 | Epoch: 0 | Step: 158490 | Dataset: 0-1903744 | Loss: 0.683 | 913 ms/step , 6891.28 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 13:51:35 | Epoch: 0 | Step: 158500 | Dataset: 0-1904064 | Loss: 0.708 | 914 ms/step , 6882.33 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 13:51:37 | Validation | Step: 158500 | Val_loss: 0.723 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:51:46 | Epoch: 0 | Step: 158510 | Dataset: 0-1904384 | Loss: 0.610 | 913 ms/step , 6890.60 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-05 13:51:55 | Epoch: 0 | Step: 158520 | Dataset: 0-1904704 | Loss: 0.621 | 912 ms/step , 6895.46 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 13:52:04 | Epoch: 0 | Step: 158530 | Dataset: 0-1905024 | Loss: 0.770 | 914 ms/step , 6880.13 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 13:52:13 | Epoch: 0 | Step: 158540 | Dataset: 0-1905344 | Loss: 0.740 | 912 ms/step , 6896.04 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 13:52:22 | Epoch: 0 | Step: 158550 | Dataset: 0-1905664 | Loss: 0.694 | 913 ms/step , 6889.57 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 13:52:31 | Epoch: 0 | Step: 158560 | Dataset: 0-1905984 | Loss: 0.824 | 913 ms/step , 6892.02 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 13:52:41 | Epoch: 0 | Step: 158570 | Dataset: 0-1906304 | Loss: 0.733 | 913 ms/step , 6887.41 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 13:52:50 | Epoch: 0 | Step: 158580 | Dataset: 0-1906624 | Loss: 0.621 | 914 ms/step , 6880.04 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 13:52:59 | Epoch: 0 | Step: 158590 | Dataset: 0-1906944 | Loss: 0.738 | 913 ms/step , 6885.27 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 13:53:08 | Epoch: 0 | Step: 158600 | Dataset: 0-1907264 | Loss: 0.644 | 912 ms/step , 6894.17 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 13:53:10 | Validation | Step: 158600 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:53:19 | Epoch: 0 | Step: 158610 | Dataset: 0-1907584 | Loss: 0.780 | 914 ms/step , 6884.17 GFLOP/s , 15267.2 tokens/s INFO:__main__:2024-11-05 13:53:28 | Epoch: 0 | Step: 158620 | Dataset: 0-1907904 | Loss: 0.733 | 913 ms/step , 6889.27 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 13:53:37 | Epoch: 0 | Step: 158630 | Dataset: 0-1908224 | Loss: 0.783 | 913 ms/step , 6891.65 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 13:53:46 | Epoch: 0 | Step: 158640 | Dataset: 0-1908544 | Loss: 0.713 | 915 ms/step , 6877.50 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 13:53:55 | Epoch: 0 | Step: 158650 | Dataset: 0-1908864 | Loss: 0.765 | 914 ms/step , 6881.65 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 13:54:04 | Epoch: 0 | Step: 158660 | Dataset: 0-1909184 | Loss: 0.721 | 913 ms/step , 6888.49 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 13:54:14 | Epoch: 0 | Step: 158670 | Dataset: 0-1909504 | Loss: 0.818 | 913 ms/step , 6886.68 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 13:54:23 | Epoch: 0 | Step: 158680 | Dataset: 0-1909824 | Loss: 0.693 | 913 ms/step , 6889.02 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 13:54:32 | Epoch: 0 | Step: 158690 | Dataset: 0-1910144 | Loss: 0.745 | 914 ms/step , 6879.70 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 13:54:41 | Epoch: 0 | Step: 158700 | Dataset: 0-1910464 | Loss: 0.639 | 913 ms/step , 6889.13 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 13:54:43 | Validation | Step: 158700 | Val_loss: 0.799 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:54:52 | Epoch: 0 | Step: 158710 | Dataset: 0-1910784 | Loss: 0.610 | 913 ms/step , 6891.63 GFLOP/s , 15276.8 tokens/s INFO:__main__:2024-11-05 13:55:01 | Epoch: 0 | Step: 158720 | Dataset: 0-1911104 | Loss: 0.718 | 914 ms/step , 6878.92 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 13:55:10 | Epoch: 0 | Step: 158730 | Dataset: 0-1911424 | Loss: 0.727 | 914 ms/step , 6877.67 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 13:55:19 | Epoch: 0 | Step: 158740 | Dataset: 0-1911744 | Loss: 0.599 | 912 ms/step , 6897.49 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 13:55:28 | Epoch: 0 | Step: 158750 | Dataset: 0-1912064 | Loss: 0.778 | 914 ms/step , 6884.58 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 13:55:37 | Epoch: 0 | Step: 158760 | Dataset: 0-1912384 | Loss: 0.809 | 913 ms/step , 6885.12 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 13:55:47 | Epoch: 0 | Step: 158770 | Dataset: 0-1912704 | Loss: 0.823 | 913 ms/step , 6892.32 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 13:55:56 | Epoch: 0 | Step: 158780 | Dataset: 0-1913024 | Loss: 0.764 | 912 ms/step , 6893.19 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 13:56:05 | Epoch: 0 | Step: 158790 | Dataset: 0-1913344 | Loss: 0.737 | 914 ms/step , 6882.27 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 13:56:14 | Epoch: 0 | Step: 158800 | Dataset: 0-1913664 | Loss: 0.702 | 913 ms/step , 6889.83 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 13:56:16 | Validation | Step: 158800 | Val_loss: 0.727 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:56:25 | Epoch: 0 | Step: 158810 | Dataset: 0-1913984 | Loss: 0.744 | 913 ms/step , 6887.21 GFLOP/s , 15283.3 tokens/s INFO:__main__:2024-11-05 13:56:34 | Epoch: 0 | Step: 158820 | Dataset: 0-1914304 | Loss: 0.820 | 914 ms/step , 6884.15 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 13:56:43 | Epoch: 0 | Step: 158830 | Dataset: 0-1914624 | Loss: 0.700 | 914 ms/step , 6883.34 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 13:56:52 | Epoch: 0 | Step: 158840 | Dataset: 0-1914944 | Loss: 0.639 | 913 ms/step , 6888.90 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 13:57:01 | Epoch: 0 | Step: 158850 | Dataset: 0-1915264 | Loss: 0.765 | 913 ms/step , 6886.34 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 13:57:10 | Epoch: 0 | Step: 158860 | Dataset: 0-1915584 | Loss: 0.677 | 914 ms/step , 6879.16 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 13:57:19 | Epoch: 0 | Step: 158870 | Dataset: 0-1915904 | Loss: 0.675 | 914 ms/step , 6881.51 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 13:57:29 | Epoch: 0 | Step: 158880 | Dataset: 0-1916224 | Loss: 0.654 | 913 ms/step , 6891.01 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 13:57:38 | Epoch: 0 | Step: 158890 | Dataset: 0-1916544 | Loss: 0.691 | 913 ms/step , 6887.20 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 13:57:47 | Epoch: 0 | Step: 158900 | Dataset: 0-1916864 | Loss: 0.606 | 913 ms/step , 6885.07 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 13:57:49 | Validation | Step: 158900 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:57:58 | Epoch: 0 | Step: 158910 | Dataset: 0-1917184 | Loss: 0.862 | 914 ms/step , 6878.31 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-05 13:58:07 | Epoch: 0 | Step: 158920 | Dataset: 0-1917504 | Loss: 0.711 | 913 ms/step , 6889.74 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 13:58:16 | Epoch: 0 | Step: 158930 | Dataset: 0-1917824 | Loss: 0.757 | 913 ms/step , 6890.98 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 13:58:25 | Epoch: 0 | Step: 158940 | Dataset: 0-1918144 | Loss: 0.693 | 914 ms/step , 6882.68 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 13:58:34 | Epoch: 0 | Step: 158950 | Dataset: 0-1918464 | Loss: 0.689 | 913 ms/step , 6885.06 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 13:58:43 | Epoch: 0 | Step: 158960 | Dataset: 0-1918784 | Loss: 0.681 | 915 ms/step , 6872.91 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 13:58:52 | Epoch: 0 | Step: 158970 | Dataset: 0-1919104 | Loss: 0.737 | 915 ms/step , 6872.74 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 13:59:02 | Epoch: 0 | Step: 158980 | Dataset: 0-1919424 | Loss: 0.695 | 914 ms/step , 6884.24 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 13:59:11 | Epoch: 0 | Step: 158990 | Dataset: 0-1919744 | Loss: 0.720 | 914 ms/step , 6879.39 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 13:59:20 | Epoch: 0 | Step: 159000 | Dataset: 0-1920064 | Loss: 0.745 | 912 ms/step , 6897.02 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 13:59:21 | Validation | Step: 159000 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 13:59:21 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_135921_step_159000.pt` INFO:__main__:2024-11-05 13:59:32 | Epoch: 0 | Step: 159010 | Dataset: 0-1920384 | Loss: 0.740 | 914 ms/step , 6884.68 GFLOP/s , 13793.6 tokens/s INFO:__main__:2024-11-05 13:59:41 | Epoch: 0 | Step: 159020 | Dataset: 0-1920704 | Loss: 0.602 | 913 ms/step , 6891.49 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 13:59:50 | Epoch: 0 | Step: 159030 | Dataset: 0-1921024 | Loss: 0.676 | 913 ms/step , 6885.36 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 13:59:59 | Epoch: 0 | Step: 159040 | Dataset: 0-1921344 | Loss: 0.654 | 915 ms/step , 6876.89 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 14:00:08 | Epoch: 0 | Step: 159050 | Dataset: 0-1921664 | Loss: 0.709 | 913 ms/step , 6885.06 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 14:00:17 | Epoch: 0 | Step: 159060 | Dataset: 0-1921984 | Loss: 0.729 | 914 ms/step , 6884.89 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 14:00:27 | Epoch: 0 | Step: 159070 | Dataset: 0-1922304 | Loss: 0.767 | 914 ms/step , 6878.50 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 14:00:36 | Epoch: 0 | Step: 159080 | Dataset: 0-1922624 | Loss: 0.804 | 915 ms/step , 6876.43 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 14:00:45 | Epoch: 0 | Step: 159090 | Dataset: 0-1922944 | Loss: 0.815 | 913 ms/step , 6890.45 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 14:00:54 | Epoch: 0 | Step: 159100 | Dataset: 0-1923264 | Loss: 0.726 | 914 ms/step , 6878.71 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 14:00:56 | Validation | Step: 159100 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:01:05 | Epoch: 0 | Step: 159110 | Dataset: 0-1923584 | Loss: 0.798 | 914 ms/step , 6882.24 GFLOP/s , 15265.4 tokens/s INFO:__main__:2024-11-05 14:01:14 | Epoch: 0 | Step: 159120 | Dataset: 0-1923904 | Loss: 0.757 | 912 ms/step , 6897.69 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 14:01:23 | Epoch: 0 | Step: 159130 | Dataset: 0-1924224 | Loss: 0.714 | 913 ms/step , 6889.32 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 14:01:32 | Epoch: 0 | Step: 159140 | Dataset: 0-1924544 | Loss: 0.700 | 913 ms/step , 6891.49 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 14:01:41 | Epoch: 0 | Step: 159150 | Dataset: 0-1924864 | Loss: 0.801 | 913 ms/step , 6887.24 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 14:01:50 | Epoch: 0 | Step: 159160 | Dataset: 0-1925184 | Loss: 0.722 | 914 ms/step , 6885.05 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 14:02:00 | Epoch: 0 | Step: 159170 | Dataset: 0-1925504 | Loss: 0.754 | 913 ms/step , 6885.92 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 14:02:09 | Epoch: 0 | Step: 159180 | Dataset: 0-1925824 | Loss: 0.822 | 913 ms/step , 6888.12 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 14:02:18 | Epoch: 0 | Step: 159190 | Dataset: 0-1926144 | Loss: 0.673 | 913 ms/step , 6892.02 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 14:02:27 | Epoch: 0 | Step: 159200 | Dataset: 0-1926464 | Loss: 0.677 | 913 ms/step , 6890.29 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 14:02:29 | Validation | Step: 159200 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:02:38 | Epoch: 0 | Step: 159210 | Dataset: 0-1926784 | Loss: 0.656 | 912 ms/step , 6892.91 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-05 14:02:47 | Epoch: 0 | Step: 159220 | Dataset: 0-1927104 | Loss: 0.686 | 913 ms/step , 6887.32 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 14:02:56 | Epoch: 0 | Step: 159230 | Dataset: 0-1927424 | Loss: 0.802 | 913 ms/step , 6891.77 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 14:03:05 | Epoch: 0 | Step: 159240 | Dataset: 0-1927744 | Loss: 0.744 | 913 ms/step , 6892.59 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 14:03:14 | Epoch: 0 | Step: 159250 | Dataset: 0-1928064 | Loss: 0.744 | 913 ms/step , 6890.18 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 14:03:23 | Epoch: 0 | Step: 159260 | Dataset: 0-1928384 | Loss: 0.734 | 914 ms/step , 6880.90 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 14:03:33 | Epoch: 0 | Step: 159270 | Dataset: 0-1928704 | Loss: 0.774 | 915 ms/step , 6871.10 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 14:03:42 | Epoch: 0 | Step: 159280 | Dataset: 0-1929024 | Loss: 0.728 | 915 ms/step , 6873.59 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 14:03:51 | Epoch: 0 | Step: 159290 | Dataset: 0-1929344 | Loss: 0.743 | 914 ms/step , 6882.18 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 14:04:00 | Epoch: 0 | Step: 159300 | Dataset: 0-1929664 | Loss: 0.729 | 914 ms/step , 6881.38 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 14:04:02 | Validation | Step: 159300 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:04:11 | Epoch: 0 | Step: 159310 | Dataset: 0-1929984 | Loss: 0.757 | 914 ms/step , 6882.09 GFLOP/s , 15267.5 tokens/s INFO:__main__:2024-11-05 14:04:20 | Epoch: 0 | Step: 159320 | Dataset: 0-1930304 | Loss: 0.711 | 912 ms/step , 6897.30 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 14:04:29 | Epoch: 0 | Step: 159330 | Dataset: 0-1930624 | Loss: 0.690 | 913 ms/step , 6887.45 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 14:04:38 | Epoch: 0 | Step: 159340 | Dataset: 0-1930944 | Loss: 0.726 | 915 ms/step , 6873.39 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 14:04:47 | Epoch: 0 | Step: 159350 | Dataset: 0-1931264 | Loss: 0.679 | 913 ms/step , 6886.21 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 14:04:56 | Epoch: 0 | Step: 159360 | Dataset: 0-1931584 | Loss: 0.649 | 913 ms/step , 6889.15 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 14:05:06 | Epoch: 0 | Step: 159370 | Dataset: 0-1931904 | Loss: 0.719 | 913 ms/step , 6887.50 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 14:05:15 | Epoch: 0 | Step: 159380 | Dataset: 0-1932224 | Loss: 0.714 | 914 ms/step , 6881.53 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 14:05:24 | Epoch: 0 | Step: 159390 | Dataset: 0-1932544 | Loss: 0.672 | 914 ms/step , 6883.05 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 14:05:33 | Epoch: 0 | Step: 159400 | Dataset: 0-1932864 | Loss: 0.804 | 914 ms/step , 6877.76 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 14:05:35 | Validation | Step: 159400 | Val_loss: 0.698 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:05:44 | Epoch: 0 | Step: 159410 | Dataset: 0-1933184 | Loss: 0.655 | 914 ms/step , 6884.80 GFLOP/s , 15288.2 tokens/s INFO:__main__:2024-11-05 14:05:53 | Epoch: 0 | Step: 159420 | Dataset: 0-1933504 | Loss: 0.797 | 915 ms/step , 6871.74 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 14:06:02 | Epoch: 0 | Step: 159430 | Dataset: 0-1933824 | Loss: 0.743 | 913 ms/step , 6885.66 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 14:06:11 | Epoch: 0 | Step: 159440 | Dataset: 0-1934144 | Loss: 0.796 | 914 ms/step , 6879.58 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 14:06:20 | Epoch: 0 | Step: 159450 | Dataset: 0-1934464 | Loss: 0.774 | 914 ms/step , 6884.91 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 14:06:29 | Epoch: 0 | Step: 159460 | Dataset: 0-1934784 | Loss: 0.683 | 913 ms/step , 6889.32 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 14:06:39 | Epoch: 0 | Step: 159470 | Dataset: 0-1935104 | Loss: 0.674 | 913 ms/step , 6891.15 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 14:06:48 | Epoch: 0 | Step: 159480 | Dataset: 0-1935424 | Loss: 0.698 | 914 ms/step , 6878.92 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 14:06:57 | Epoch: 0 | Step: 159490 | Dataset: 0-1935744 | Loss: 0.691 | 913 ms/step , 6888.30 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 14:07:06 | Epoch: 0 | Step: 159500 | Dataset: 0-1936064 | Loss: 0.741 | 915 ms/step , 6875.52 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 14:07:08 | Validation | Step: 159500 | Val_loss: 0.717 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:07:17 | Epoch: 0 | Step: 159510 | Dataset: 0-1936384 | Loss: 0.660 | 913 ms/step , 6886.27 GFLOP/s , 15262.6 tokens/s INFO:__main__:2024-11-05 14:07:26 | Epoch: 0 | Step: 159520 | Dataset: 0-1936704 | Loss: 0.543 | 913 ms/step , 6887.02 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 14:07:35 | Epoch: 0 | Step: 159530 | Dataset: 0-1937024 | Loss: 0.822 | 914 ms/step , 6879.28 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 14:07:44 | Epoch: 0 | Step: 159540 | Dataset: 0-1937344 | Loss: 0.830 | 914 ms/step , 6880.59 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 14:07:53 | Epoch: 0 | Step: 159550 | Dataset: 0-1937664 | Loss: 0.755 | 913 ms/step , 6888.27 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 14:08:02 | Epoch: 0 | Step: 159560 | Dataset: 0-1937984 | Loss: 0.752 | 914 ms/step , 6881.19 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 14:08:12 | Epoch: 0 | Step: 159570 | Dataset: 0-1938304 | Loss: 0.875 | 915 ms/step , 6871.33 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 14:08:21 | Epoch: 0 | Step: 159580 | Dataset: 0-1938624 | Loss: 0.775 | 915 ms/step , 6874.20 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 14:08:30 | Epoch: 0 | Step: 159590 | Dataset: 0-1938944 | Loss: 0.837 | 914 ms/step , 6879.12 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 14:08:39 | Epoch: 0 | Step: 159600 | Dataset: 0-1939264 | Loss: 0.753 | 915 ms/step , 6876.85 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 14:08:41 | Validation | Step: 159600 | Val_loss: 0.733 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:08:50 | Epoch: 0 | Step: 159610 | Dataset: 0-1939584 | Loss: 0.696 | 912 ms/step , 6892.97 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 14:08:59 | Epoch: 0 | Step: 159620 | Dataset: 0-1939904 | Loss: 0.758 | 914 ms/step , 6884.62 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 14:09:08 | Epoch: 0 | Step: 159630 | Dataset: 0-1940224 | Loss: 0.777 | 915 ms/step , 6871.54 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 14:09:17 | Epoch: 0 | Step: 159640 | Dataset: 0-1940544 | Loss: 0.799 | 915 ms/step , 6872.87 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 14:09:26 | Epoch: 0 | Step: 159650 | Dataset: 0-1940864 | Loss: 0.850 | 913 ms/step , 6886.23 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 14:09:35 | Epoch: 0 | Step: 159660 | Dataset: 0-1941184 | Loss: 0.831 | 914 ms/step , 6881.48 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 14:09:45 | Epoch: 0 | Step: 159670 | Dataset: 0-1941504 | Loss: 0.783 | 915 ms/step , 6876.75 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 14:09:54 | Epoch: 0 | Step: 159680 | Dataset: 0-1941824 | Loss: 0.783 | 915 ms/step , 6872.67 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 14:10:03 | Epoch: 0 | Step: 159690 | Dataset: 0-1942144 | Loss: 0.807 | 915 ms/step , 6877.44 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 14:10:12 | Epoch: 0 | Step: 159700 | Dataset: 0-1942464 | Loss: 0.746 | 915 ms/step , 6875.98 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 14:10:14 | Validation | Step: 159700 | Val_loss: 0.763 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:10:23 | Epoch: 0 | Step: 159710 | Dataset: 0-1942784 | Loss: 0.730 | 914 ms/step , 6884.91 GFLOP/s , 15264.3 tokens/s INFO:__main__:2024-11-05 14:10:32 | Epoch: 0 | Step: 159720 | Dataset: 0-1943104 | Loss: 0.786 | 915 ms/step , 6876.65 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 14:10:41 | Epoch: 0 | Step: 159730 | Dataset: 0-1943424 | Loss: 0.718 | 914 ms/step , 6877.62 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 14:10:50 | Epoch: 0 | Step: 159740 | Dataset: 0-1943744 | Loss: 0.828 | 914 ms/step , 6881.25 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 14:10:59 | Epoch: 0 | Step: 159750 | Dataset: 0-1944064 | Loss: 0.715 | 913 ms/step , 6887.83 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 14:11:09 | Epoch: 0 | Step: 159760 | Dataset: 0-1944384 | Loss: 0.731 | 913 ms/step , 6888.03 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 14:11:18 | Epoch: 0 | Step: 159770 | Dataset: 0-1944704 | Loss: 0.760 | 913 ms/step , 6888.11 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 14:11:27 | Epoch: 0 | Step: 159780 | Dataset: 0-1945024 | Loss: 0.752 | 914 ms/step , 6882.18 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 14:11:36 | Epoch: 0 | Step: 159790 | Dataset: 0-1945344 | Loss: 0.758 | 913 ms/step , 6885.73 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 14:11:45 | Epoch: 0 | Step: 159800 | Dataset: 0-1945664 | Loss: 0.741 | 914 ms/step , 6883.31 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 14:11:47 | Validation | Step: 159800 | Val_loss: 0.746 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:11:56 | Epoch: 0 | Step: 159810 | Dataset: 0-1945984 | Loss: 0.732 | 912 ms/step , 6893.72 GFLOP/s , 15262.1 tokens/s INFO:__main__:2024-11-05 14:12:05 | Epoch: 0 | Step: 159820 | Dataset: 0-1946304 | Loss: 0.769 | 914 ms/step , 6881.74 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 14:12:14 | Epoch: 0 | Step: 159830 | Dataset: 0-1946624 | Loss: 0.761 | 914 ms/step , 6884.07 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 14:12:23 | Epoch: 0 | Step: 159840 | Dataset: 0-1946944 | Loss: 0.820 | 913 ms/step , 6888.83 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 14:12:32 | Epoch: 0 | Step: 159850 | Dataset: 0-1947264 | Loss: 0.916 | 915 ms/step , 6873.74 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 14:12:42 | Epoch: 0 | Step: 159860 | Dataset: 0-1947584 | Loss: 0.755 | 913 ms/step , 6889.63 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 14:12:51 | Epoch: 0 | Step: 159870 | Dataset: 0-1947904 | Loss: 0.814 | 914 ms/step , 6879.90 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 14:13:00 | Epoch: 0 | Step: 159880 | Dataset: 0-1948224 | Loss: 0.721 | 915 ms/step , 6874.28 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-05 14:13:09 | Epoch: 0 | Step: 159890 | Dataset: 0-1948544 | Loss: 0.783 | 914 ms/step , 6880.32 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 14:13:18 | Epoch: 0 | Step: 159900 | Dataset: 0-1948864 | Loss: 0.744 | 913 ms/step , 6885.25 GFLOP/s , 17908.4 tokens/s INFO:__main__:2024-11-05 14:13:20 | Validation | Step: 159900 | Val_loss: 0.703 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:13:29 | Epoch: 0 | Step: 159910 | Dataset: 0-1949184 | Loss: 0.721 | 914 ms/step , 6878.46 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-05 14:13:38 | Epoch: 0 | Step: 159920 | Dataset: 0-1949504 | Loss: 0.750 | 914 ms/step , 6883.58 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 14:13:47 | Epoch: 0 | Step: 159930 | Dataset: 0-1949824 | Loss: 0.816 | 915 ms/step , 6875.24 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 14:13:56 | Epoch: 0 | Step: 159940 | Dataset: 0-1950144 | Loss: 0.743 | 915 ms/step , 6872.77 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 14:14:05 | Epoch: 0 | Step: 159950 | Dataset: 0-1950464 | Loss: 0.842 | 912 ms/step , 6893.75 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 14:14:15 | Epoch: 0 | Step: 159960 | Dataset: 0-1950784 | Loss: 0.806 | 914 ms/step , 6879.16 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 14:14:24 | Epoch: 0 | Step: 159970 | Dataset: 0-1951104 | Loss: 0.792 | 915 ms/step , 6873.65 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 14:14:33 | Epoch: 0 | Step: 159980 | Dataset: 0-1951424 | Loss: 0.780 | 913 ms/step , 6889.00 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 14:14:42 | Epoch: 0 | Step: 159990 | Dataset: 0-1951744 | Loss: 0.747 | 913 ms/step , 6885.38 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 14:14:51 | Epoch: 0 | Step: 160000 | Dataset: 0-1952064 | Loss: 0.784 | 915 ms/step , 6874.75 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 14:14:53 | Validation | Step: 160000 | Val_loss: 0.669 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:14:53 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_141453_step_160000.pt` INFO:__main__:2024-11-05 14:15:03 | Epoch: 0 | Step: 160010 | Dataset: 0-1952384 | Loss: 0.758 | 914 ms/step , 6880.59 GFLOP/s , 13787.7 tokens/s INFO:__main__:2024-11-05 14:15:12 | Epoch: 0 | Step: 160020 | Dataset: 0-1952704 | Loss: 0.765 | 915 ms/step , 6875.04 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 14:15:21 | Epoch: 0 | Step: 160030 | Dataset: 0-1953024 | Loss: 0.786 | 915 ms/step , 6876.44 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 14:15:30 | Epoch: 0 | Step: 160040 | Dataset: 0-1953344 | Loss: 0.740 | 916 ms/step , 6863.02 GFLOP/s , 17882.2 tokens/s INFO:__main__:2024-11-05 14:15:40 | Epoch: 0 | Step: 160050 | Dataset: 0-1953664 | Loss: 0.766 | 916 ms/step , 6866.15 GFLOP/s , 17899.4 tokens/s INFO:__main__:2024-11-05 14:15:49 | Epoch: 0 | Step: 160060 | Dataset: 0-1953984 | Loss: 0.706 | 916 ms/step , 6862.78 GFLOP/s , 17885.9 tokens/s INFO:__main__:2024-11-05 14:15:58 | Epoch: 0 | Step: 160070 | Dataset: 0-1954304 | Loss: 0.796 | 916 ms/step , 6864.60 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 14:16:07 | Epoch: 0 | Step: 160080 | Dataset: 0-1954624 | Loss: 0.726 | 914 ms/step , 6884.63 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-05 14:16:16 | Epoch: 0 | Step: 160090 | Dataset: 0-1954944 | Loss: 0.757 | 915 ms/step , 6874.93 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 14:16:25 | Epoch: 0 | Step: 160100 | Dataset: 0-1955264 | Loss: 0.749 | 915 ms/step , 6871.87 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 14:16:27 | Validation | Step: 160100 | Val_loss: 0.706 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:16:36 | Epoch: 0 | Step: 160110 | Dataset: 0-1955584 | Loss: 0.765 | 914 ms/step , 6883.52 GFLOP/s , 15263.3 tokens/s INFO:__main__:2024-11-05 14:16:45 | Epoch: 0 | Step: 160120 | Dataset: 0-1955904 | Loss: 0.685 | 912 ms/step , 6894.72 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 14:16:54 | Epoch: 0 | Step: 160130 | Dataset: 0-1956224 | Loss: 0.762 | 915 ms/step , 6877.19 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 14:17:04 | Epoch: 0 | Step: 160140 | Dataset: 0-1956544 | Loss: 0.873 | 914 ms/step , 6883.35 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 14:17:13 | Epoch: 0 | Step: 160150 | Dataset: 0-1956864 | Loss: 0.765 | 913 ms/step , 6887.90 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 14:17:22 | Epoch: 0 | Step: 160160 | Dataset: 0-1957184 | Loss: 0.787 | 914 ms/step , 6878.50 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 14:17:31 | Epoch: 0 | Step: 160170 | Dataset: 0-1957504 | Loss: 0.742 | 914 ms/step , 6877.70 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 14:17:40 | Epoch: 0 | Step: 160180 | Dataset: 0-1957824 | Loss: 0.885 | 914 ms/step , 6881.22 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 14:17:49 | Epoch: 0 | Step: 160190 | Dataset: 0-1958144 | Loss: 0.776 | 914 ms/step , 6882.73 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 14:17:58 | Epoch: 0 | Step: 160200 | Dataset: 0-1958464 | Loss: 0.801 | 914 ms/step , 6879.98 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 14:18:00 | Validation | Step: 160200 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:18:09 | Epoch: 0 | Step: 160210 | Dataset: 0-1958784 | Loss: 0.810 | 914 ms/step , 6883.80 GFLOP/s , 15284.9 tokens/s INFO:__main__:2024-11-05 14:18:18 | Epoch: 0 | Step: 160220 | Dataset: 0-1959104 | Loss: 0.802 | 915 ms/step , 6876.33 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 14:18:27 | Epoch: 0 | Step: 160230 | Dataset: 0-1959424 | Loss: 0.756 | 914 ms/step , 6878.62 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 14:18:37 | Epoch: 0 | Step: 160240 | Dataset: 0-1959744 | Loss: 0.793 | 916 ms/step , 6868.54 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 14:18:46 | Epoch: 0 | Step: 160250 | Dataset: 0-1960064 | Loss: 0.731 | 916 ms/step , 6867.48 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 14:18:55 | Epoch: 0 | Step: 160260 | Dataset: 0-1960384 | Loss: 0.763 | 915 ms/step , 6876.23 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 14:19:04 | Epoch: 0 | Step: 160270 | Dataset: 0-1960704 | Loss: 0.854 | 914 ms/step , 6884.69 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 14:19:13 | Epoch: 0 | Step: 160280 | Dataset: 0-1961024 | Loss: 0.686 | 913 ms/step , 6888.03 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 14:19:22 | Epoch: 0 | Step: 160290 | Dataset: 0-1961344 | Loss: 0.853 | 914 ms/step , 6879.77 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 14:19:31 | Epoch: 0 | Step: 160300 | Dataset: 0-1961664 | Loss: 0.746 | 915 ms/step , 6871.26 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 14:19:33 | Validation | Step: 160300 | Val_loss: 0.681 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:19:42 | Epoch: 0 | Step: 160310 | Dataset: 0-1961984 | Loss: 0.834 | 914 ms/step , 6884.43 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-05 14:19:51 | Epoch: 0 | Step: 160320 | Dataset: 0-1962304 | Loss: 0.953 | 914 ms/step , 6879.20 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 14:20:00 | Epoch: 0 | Step: 160330 | Dataset: 0-1962624 | Loss: 0.897 | 915 ms/step , 6873.33 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 14:20:10 | Epoch: 0 | Step: 160340 | Dataset: 0-1962944 | Loss: 0.811 | 913 ms/step , 6887.36 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 14:20:19 | Epoch: 0 | Step: 160350 | Dataset: 0-1963264 | Loss: 0.896 | 914 ms/step , 6877.89 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 14:20:28 | Epoch: 0 | Step: 160360 | Dataset: 0-1963584 | Loss: 0.920 | 913 ms/step , 6889.92 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 14:20:37 | Epoch: 0 | Step: 160370 | Dataset: 0-1963904 | Loss: 0.837 | 914 ms/step , 6881.36 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 14:20:46 | Epoch: 0 | Step: 160380 | Dataset: 0-1964224 | Loss: 0.823 | 913 ms/step , 6888.32 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 14:20:55 | Epoch: 0 | Step: 160390 | Dataset: 0-1964544 | Loss: 0.880 | 914 ms/step , 6881.17 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 14:21:04 | Epoch: 0 | Step: 160400 | Dataset: 0-1964864 | Loss: 0.820 | 915 ms/step , 6876.40 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 14:21:06 | Validation | Step: 160400 | Val_loss: 0.787 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:21:15 | Epoch: 0 | Step: 160410 | Dataset: 0-1965184 | Loss: 0.888 | 913 ms/step , 6887.28 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-05 14:21:24 | Epoch: 0 | Step: 160420 | Dataset: 0-1965504 | Loss: 0.805 | 914 ms/step , 6884.12 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 14:21:33 | Epoch: 0 | Step: 160430 | Dataset: 0-1965824 | Loss: 0.800 | 914 ms/step , 6882.48 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 14:21:43 | Epoch: 0 | Step: 160440 | Dataset: 0-1966144 | Loss: 0.818 | 914 ms/step , 6882.84 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 14:21:52 | Epoch: 0 | Step: 160450 | Dataset: 0-1966464 | Loss: 0.791 | 913 ms/step , 6887.60 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 14:22:01 | Epoch: 0 | Step: 160460 | Dataset: 0-1966784 | Loss: 0.821 | 914 ms/step , 6878.42 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 14:22:10 | Epoch: 0 | Step: 160470 | Dataset: 0-1967104 | Loss: 0.813 | 913 ms/step , 6889.06 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 14:22:19 | Epoch: 0 | Step: 160480 | Dataset: 0-1967424 | Loss: 0.881 | 916 ms/step , 6865.51 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 14:22:28 | Epoch: 0 | Step: 160490 | Dataset: 0-1967744 | Loss: 0.919 | 915 ms/step , 6875.39 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 14:22:37 | Epoch: 0 | Step: 160500 | Dataset: 0-1968064 | Loss: 0.779 | 913 ms/step , 6885.48 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 14:22:39 | Validation | Step: 160500 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:22:48 | Epoch: 0 | Step: 160510 | Dataset: 0-1968384 | Loss: 0.898 | 914 ms/step , 6884.06 GFLOP/s , 15266.9 tokens/s INFO:__main__:2024-11-05 14:22:57 | Epoch: 0 | Step: 160520 | Dataset: 0-1968704 | Loss: 0.873 | 913 ms/step , 6888.25 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 14:23:06 | Epoch: 0 | Step: 160530 | Dataset: 0-1969024 | Loss: 0.880 | 914 ms/step , 6884.06 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 14:23:16 | Epoch: 0 | Step: 160540 | Dataset: 0-1969344 | Loss: 0.942 | 914 ms/step , 6878.53 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 14:23:25 | Epoch: 0 | Step: 160550 | Dataset: 0-1969664 | Loss: 0.789 | 914 ms/step , 6884.20 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 14:23:34 | Epoch: 0 | Step: 160560 | Dataset: 0-1969984 | Loss: 0.600 | 913 ms/step , 6890.91 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 14:23:43 | Epoch: 0 | Step: 160570 | Dataset: 0-1970304 | Loss: 0.807 | 913 ms/step , 6888.22 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 14:23:52 | Epoch: 0 | Step: 160580 | Dataset: 0-1970624 | Loss: 0.616 | 911 ms/step , 6901.29 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 14:24:01 | Epoch: 0 | Step: 160590 | Dataset: 0-1970944 | Loss: 0.696 | 912 ms/step , 6897.25 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 14:24:10 | Epoch: 0 | Step: 160600 | Dataset: 0-1971264 | Loss: 0.951 | 913 ms/step , 6886.27 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 14:24:12 | Validation | Step: 160600 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:24:21 | Epoch: 0 | Step: 160610 | Dataset: 0-1971584 | Loss: 0.869 | 913 ms/step , 6889.08 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-05 14:24:30 | Epoch: 0 | Step: 160620 | Dataset: 0-1971904 | Loss: 0.640 | 913 ms/step , 6891.01 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 14:24:39 | Epoch: 0 | Step: 160630 | Dataset: 0-1972224 | Loss: 0.857 | 915 ms/step , 6874.97 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 14:24:49 | Epoch: 0 | Step: 160640 | Dataset: 0-1972544 | Loss: 0.830 | 913 ms/step , 6885.28 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 14:24:58 | Epoch: 0 | Step: 160650 | Dataset: 0-1972864 | Loss: 0.928 | 914 ms/step , 6879.58 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 14:25:07 | Epoch: 0 | Step: 160660 | Dataset: 0-1973184 | Loss: 0.853 | 914 ms/step , 6879.29 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 14:25:16 | Epoch: 0 | Step: 160670 | Dataset: 0-1973504 | Loss: 0.863 | 913 ms/step , 6887.71 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 14:25:25 | Epoch: 0 | Step: 160680 | Dataset: 0-1973824 | Loss: 0.739 | 913 ms/step , 6887.55 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 14:25:34 | Epoch: 0 | Step: 160690 | Dataset: 0-1974144 | Loss: 0.789 | 914 ms/step , 6881.08 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 14:25:43 | Epoch: 0 | Step: 160700 | Dataset: 0-1974464 | Loss: 0.812 | 915 ms/step , 6877.46 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 14:25:45 | Validation | Step: 160700 | Val_loss: 0.693 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:25:54 | Epoch: 0 | Step: 160710 | Dataset: 0-1974784 | Loss: 0.790 | 914 ms/step , 6883.97 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-05 14:26:03 | Epoch: 0 | Step: 160720 | Dataset: 0-1975104 | Loss: 0.863 | 913 ms/step , 6891.11 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 14:26:12 | Epoch: 0 | Step: 160730 | Dataset: 0-1975424 | Loss: 0.610 | 913 ms/step , 6890.61 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 14:26:21 | Epoch: 0 | Step: 160740 | Dataset: 0-1975744 | Loss: 0.853 | 913 ms/step , 6887.35 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 14:26:31 | Epoch: 0 | Step: 160750 | Dataset: 0-1976064 | Loss: 0.801 | 914 ms/step , 6878.30 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 14:26:40 | Epoch: 0 | Step: 160760 | Dataset: 0-1976384 | Loss: 0.925 | 914 ms/step , 6885.01 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 14:26:49 | Epoch: 0 | Step: 160770 | Dataset: 0-1976704 | Loss: 0.863 | 913 ms/step , 6889.41 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 14:26:58 | Epoch: 0 | Step: 160780 | Dataset: 0-1977024 | Loss: 0.965 | 913 ms/step , 6889.08 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 14:27:07 | Epoch: 0 | Step: 160790 | Dataset: 0-1977344 | Loss: 0.863 | 915 ms/step , 6877.38 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 14:27:16 | Epoch: 0 | Step: 160800 | Dataset: 0-1977664 | Loss: 0.890 | 912 ms/step , 6893.74 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 14:27:18 | Validation | Step: 160800 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:27:27 | Epoch: 0 | Step: 160810 | Dataset: 0-1977984 | Loss: 0.866 | 913 ms/step , 6888.47 GFLOP/s , 15274.3 tokens/s INFO:__main__:2024-11-05 14:27:36 | Epoch: 0 | Step: 160820 | Dataset: 0-1978304 | Loss: 0.827 | 913 ms/step , 6885.62 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 14:27:45 | Epoch: 0 | Step: 160830 | Dataset: 0-1978624 | Loss: 0.742 | 913 ms/step , 6890.73 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 14:27:54 | Epoch: 0 | Step: 160840 | Dataset: 0-1978944 | Loss: 0.869 | 914 ms/step , 6884.11 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 14:28:04 | Epoch: 0 | Step: 160850 | Dataset: 0-1979264 | Loss: 0.878 | 913 ms/step , 6886.89 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 14:28:13 | Epoch: 0 | Step: 160860 | Dataset: 0-1979584 | Loss: 0.837 | 914 ms/step , 6882.84 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 14:28:22 | Epoch: 0 | Step: 160870 | Dataset: 0-1979904 | Loss: 0.727 | 913 ms/step , 6885.16 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 14:28:31 | Epoch: 0 | Step: 160880 | Dataset: 0-1980224 | Loss: 0.783 | 914 ms/step , 6883.07 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 14:28:40 | Epoch: 0 | Step: 160890 | Dataset: 0-1980544 | Loss: 0.804 | 913 ms/step , 6888.51 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 14:28:49 | Epoch: 0 | Step: 160900 | Dataset: 0-1980864 | Loss: 0.741 | 914 ms/step , 6883.41 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 14:28:51 | Validation | Step: 160900 | Val_loss: 0.748 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:29:00 | Epoch: 0 | Step: 160910 | Dataset: 0-1981184 | Loss: 0.929 | 914 ms/step , 6882.12 GFLOP/s , 15273.0 tokens/s INFO:__main__:2024-11-05 14:29:09 | Epoch: 0 | Step: 160920 | Dataset: 0-1981504 | Loss: 0.962 | 914 ms/step , 6878.77 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 14:29:18 | Epoch: 0 | Step: 160930 | Dataset: 0-1981824 | Loss: 0.764 | 915 ms/step , 6875.37 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 14:29:27 | Epoch: 0 | Step: 160940 | Dataset: 0-1982144 | Loss: 0.884 | 914 ms/step , 6878.40 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 14:29:37 | Epoch: 0 | Step: 160950 | Dataset: 0-1982464 | Loss: 0.925 | 912 ms/step , 6895.42 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 14:29:46 | Epoch: 0 | Step: 160960 | Dataset: 0-1982784 | Loss: 0.912 | 913 ms/step , 6886.01 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 14:29:55 | Epoch: 0 | Step: 160970 | Dataset: 0-1983104 | Loss: 0.856 | 913 ms/step , 6885.56 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 14:30:04 | Epoch: 0 | Step: 160980 | Dataset: 0-1983424 | Loss: 0.882 | 913 ms/step , 6885.43 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 14:30:13 | Epoch: 0 | Step: 160990 | Dataset: 0-1983744 | Loss: 0.887 | 914 ms/step , 6879.16 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 14:30:22 | Epoch: 0 | Step: 161000 | Dataset: 0-1984064 | Loss: 0.742 | 913 ms/step , 6886.05 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 14:30:24 | Validation | Step: 161000 | Val_loss: 0.682 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:30:24 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_143024_step_161000.pt` INFO:__main__:2024-11-05 14:30:34 | Epoch: 0 | Step: 161010 | Dataset: 0-1984384 | Loss: 0.788 | 913 ms/step , 6886.02 GFLOP/s , 13818.4 tokens/s INFO:__main__:2024-11-05 14:30:43 | Epoch: 0 | Step: 161020 | Dataset: 0-1984704 | Loss: 0.949 | 915 ms/step , 6876.53 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 14:30:52 | Epoch: 0 | Step: 161030 | Dataset: 0-1985024 | Loss: 0.892 | 914 ms/step , 6884.17 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 14:31:02 | Epoch: 0 | Step: 161040 | Dataset: 0-1985344 | Loss: 0.676 | 912 ms/step , 6896.05 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 14:31:11 | Epoch: 0 | Step: 161050 | Dataset: 0-1985664 | Loss: 0.684 | 913 ms/step , 6891.70 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 14:31:20 | Epoch: 0 | Step: 161060 | Dataset: 0-1985984 | Loss: 0.913 | 913 ms/step , 6888.28 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 14:31:29 | Epoch: 0 | Step: 161070 | Dataset: 0-1986304 | Loss: 0.797 | 913 ms/step , 6885.99 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 14:31:38 | Epoch: 0 | Step: 161080 | Dataset: 0-1986624 | Loss: 0.812 | 914 ms/step , 6879.99 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 14:31:47 | Epoch: 0 | Step: 161090 | Dataset: 0-1986944 | Loss: 0.817 | 914 ms/step , 6883.75 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 14:31:56 | Epoch: 0 | Step: 161100 | Dataset: 0-1987264 | Loss: 0.868 | 914 ms/step , 6882.30 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 14:31:58 | Validation | Step: 161100 | Val_loss: 0.804 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:32:07 | Epoch: 0 | Step: 161110 | Dataset: 0-1987584 | Loss: 0.924 | 914 ms/step , 6880.40 GFLOP/s , 15288.5 tokens/s INFO:__main__:2024-11-05 14:32:16 | Epoch: 0 | Step: 161120 | Dataset: 0-1987904 | Loss: 0.800 | 913 ms/step , 6887.54 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 14:32:25 | Epoch: 0 | Step: 161130 | Dataset: 0-1988224 | Loss: 0.747 | 911 ms/step , 6900.97 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 14:32:34 | Epoch: 0 | Step: 161140 | Dataset: 0-1988544 | Loss: 0.895 | 913 ms/step , 6886.96 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 14:32:44 | Epoch: 0 | Step: 161150 | Dataset: 0-1988864 | Loss: 0.792 | 913 ms/step , 6890.65 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 14:32:53 | Epoch: 0 | Step: 161160 | Dataset: 0-1989184 | Loss: 0.972 | 913 ms/step , 6891.74 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 14:33:02 | Epoch: 0 | Step: 161170 | Dataset: 0-1989504 | Loss: 0.842 | 913 ms/step , 6887.13 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 14:33:11 | Epoch: 0 | Step: 161180 | Dataset: 0-1989824 | Loss: 0.781 | 913 ms/step , 6892.53 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 14:33:20 | Epoch: 0 | Step: 161190 | Dataset: 0-1990144 | Loss: 0.855 | 915 ms/step , 6875.28 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 14:33:29 | Epoch: 0 | Step: 161200 | Dataset: 0-1990464 | Loss: 0.597 | 912 ms/step , 6893.33 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 14:33:31 | Validation | Step: 161200 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:33:40 | Epoch: 0 | Step: 161210 | Dataset: 0-1990784 | Loss: 0.580 | 912 ms/step , 6894.01 GFLOP/s , 15276.2 tokens/s INFO:__main__:2024-11-05 14:33:49 | Epoch: 0 | Step: 161220 | Dataset: 0-1991104 | Loss: 0.744 | 913 ms/step , 6889.76 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 14:33:58 | Epoch: 0 | Step: 161230 | Dataset: 0-1991424 | Loss: 0.902 | 914 ms/step , 6878.59 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 14:34:07 | Epoch: 0 | Step: 161240 | Dataset: 0-1991744 | Loss: 0.576 | 913 ms/step , 6891.64 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 14:34:17 | Epoch: 0 | Step: 161250 | Dataset: 0-1992064 | Loss: 0.885 | 914 ms/step , 6883.72 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 14:34:26 | Epoch: 0 | Step: 161260 | Dataset: 0-1992384 | Loss: 0.811 | 913 ms/step , 6886.66 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 14:34:35 | Epoch: 0 | Step: 161270 | Dataset: 0-1992704 | Loss: 0.742 | 913 ms/step , 6891.81 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 14:34:44 | Epoch: 0 | Step: 161280 | Dataset: 0-1993024 | Loss: 0.989 | 914 ms/step , 6880.54 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 14:34:53 | Epoch: 0 | Step: 161290 | Dataset: 0-1993344 | Loss: 0.821 | 912 ms/step , 6894.73 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 14:35:02 | Epoch: 0 | Step: 161300 | Dataset: 0-1993664 | Loss: 0.864 | 914 ms/step , 6877.88 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 14:35:04 | Validation | Step: 161300 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:35:13 | Epoch: 0 | Step: 161310 | Dataset: 0-1993984 | Loss: 0.783 | 913 ms/step , 6885.80 GFLOP/s , 15278.9 tokens/s INFO:__main__:2024-11-05 14:35:22 | Epoch: 0 | Step: 161320 | Dataset: 0-1994304 | Loss: 1.040 | 913 ms/step , 6885.63 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 14:35:31 | Epoch: 0 | Step: 161330 | Dataset: 0-1994624 | Loss: 0.813 | 914 ms/step , 6882.20 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 14:35:40 | Epoch: 0 | Step: 161340 | Dataset: 0-1994944 | Loss: 0.874 | 914 ms/step , 6877.94 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 14:35:50 | Epoch: 0 | Step: 161350 | Dataset: 0-1995264 | Loss: 0.991 | 914 ms/step , 6884.65 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 14:35:59 | Epoch: 0 | Step: 161360 | Dataset: 0-1995584 | Loss: 0.742 | 913 ms/step , 6891.74 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 14:36:08 | Epoch: 0 | Step: 161370 | Dataset: 0-1995904 | Loss: 0.847 | 913 ms/step , 6887.83 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 14:36:17 | Epoch: 0 | Step: 161380 | Dataset: 0-1996224 | Loss: 0.857 | 913 ms/step , 6885.87 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 14:36:26 | Epoch: 0 | Step: 161390 | Dataset: 0-1996544 | Loss: 0.884 | 914 ms/step , 6881.81 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 14:36:35 | Epoch: 0 | Step: 161400 | Dataset: 0-1996864 | Loss: 0.969 | 915 ms/step , 6875.72 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 14:36:37 | Validation | Step: 161400 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:36:46 | Epoch: 0 | Step: 161410 | Dataset: 0-1997184 | Loss: 0.963 | 913 ms/step , 6886.97 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-05 14:36:55 | Epoch: 0 | Step: 161420 | Dataset: 0-1997504 | Loss: 0.809 | 915 ms/step , 6875.99 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 14:37:04 | Epoch: 0 | Step: 161430 | Dataset: 0-1997824 | Loss: 0.787 | 912 ms/step , 6893.34 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 14:37:13 | Epoch: 0 | Step: 161440 | Dataset: 0-1998144 | Loss: 0.822 | 913 ms/step , 6889.83 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 14:37:22 | Epoch: 0 | Step: 161450 | Dataset: 0-1998464 | Loss: 0.715 | 914 ms/step , 6885.00 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 14:37:32 | Epoch: 0 | Step: 161460 | Dataset: 0-1998784 | Loss: 0.743 | 914 ms/step , 6884.77 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 14:37:41 | Epoch: 0 | Step: 161470 | Dataset: 0-1999104 | Loss: 0.859 | 912 ms/step , 6893.20 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 14:37:50 | Epoch: 0 | Step: 161480 | Dataset: 0-1999424 | Loss: 0.845 | 915 ms/step , 6877.44 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 14:37:59 | Epoch: 0 | Step: 161490 | Dataset: 0-1999744 | Loss: 0.894 | 915 ms/step , 6875.62 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 14:38:08 | Epoch: 0 | Step: 161500 | Dataset: 0-2000064 | Loss: 0.887 | 914 ms/step , 6883.35 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 14:38:10 | Validation | Step: 161500 | Val_loss: 0.743 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:38:19 | Epoch: 0 | Step: 161510 | Dataset: 0-2000384 | Loss: 1.022 | 914 ms/step , 6884.25 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 14:38:28 | Epoch: 0 | Step: 161520 | Dataset: 0-2000704 | Loss: 0.796 | 913 ms/step , 6885.95 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 14:38:37 | Epoch: 0 | Step: 161530 | Dataset: 0-2001024 | Loss: 0.959 | 913 ms/step , 6888.14 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 14:38:46 | Epoch: 0 | Step: 161540 | Dataset: 0-2001344 | Loss: 0.795 | 912 ms/step , 6898.96 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 14:38:55 | Epoch: 0 | Step: 161550 | Dataset: 0-2001664 | Loss: 0.847 | 914 ms/step , 6877.68 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 14:39:05 | Epoch: 0 | Step: 161560 | Dataset: 0-2001984 | Loss: 0.844 | 914 ms/step , 6879.23 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 14:39:14 | Epoch: 0 | Step: 161570 | Dataset: 0-2002304 | Loss: 0.918 | 914 ms/step , 6882.68 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 14:39:23 | Epoch: 0 | Step: 161580 | Dataset: 0-2002624 | Loss: 0.861 | 912 ms/step , 6894.43 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 14:39:32 | Epoch: 0 | Step: 161590 | Dataset: 0-2002944 | Loss: 0.614 | 912 ms/step , 6893.43 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 14:39:41 | Epoch: 0 | Step: 161600 | Dataset: 0-2003264 | Loss: 0.847 | 913 ms/step , 6891.84 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 14:39:43 | Validation | Step: 161600 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:39:52 | Epoch: 0 | Step: 161610 | Dataset: 0-2003584 | Loss: 0.473 | 914 ms/step , 6880.89 GFLOP/s , 15280.0 tokens/s INFO:__main__:2024-11-05 14:40:01 | Epoch: 0 | Step: 161620 | Dataset: 0-2003904 | Loss: 0.868 | 912 ms/step , 6895.60 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 14:40:10 | Epoch: 0 | Step: 161630 | Dataset: 0-2004224 | Loss: 0.899 | 914 ms/step , 6877.84 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 14:40:19 | Epoch: 0 | Step: 161640 | Dataset: 0-2004544 | Loss: 0.769 | 914 ms/step , 6884.13 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 14:40:28 | Epoch: 0 | Step: 161650 | Dataset: 0-2004864 | Loss: 0.734 | 914 ms/step , 6882.58 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 14:40:38 | Epoch: 0 | Step: 161660 | Dataset: 0-2005184 | Loss: 0.649 | 913 ms/step , 6888.22 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 14:40:47 | Epoch: 0 | Step: 161670 | Dataset: 0-2005504 | Loss: 0.853 | 914 ms/step , 6882.99 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 14:40:56 | Epoch: 0 | Step: 161680 | Dataset: 0-2005824 | Loss: 0.874 | 914 ms/step , 6881.24 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 14:41:05 | Epoch: 0 | Step: 161690 | Dataset: 0-2006144 | Loss: 0.720 | 912 ms/step , 6896.22 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 14:41:14 | Epoch: 0 | Step: 161700 | Dataset: 0-2006464 | Loss: 0.859 | 913 ms/step , 6886.03 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 14:41:16 | Validation | Step: 161700 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:41:25 | Epoch: 0 | Step: 161710 | Dataset: 0-2006784 | Loss: 0.847 | 914 ms/step , 6878.76 GFLOP/s , 15265.1 tokens/s INFO:__main__:2024-11-05 14:41:34 | Epoch: 0 | Step: 161720 | Dataset: 0-2007104 | Loss: 0.803 | 913 ms/step , 6889.54 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 14:41:43 | Epoch: 0 | Step: 161730 | Dataset: 0-2007424 | Loss: 0.648 | 912 ms/step , 6896.41 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 14:41:52 | Epoch: 0 | Step: 161740 | Dataset: 0-2007744 | Loss: 0.802 | 913 ms/step , 6885.93 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 14:42:01 | Epoch: 0 | Step: 161750 | Dataset: 0-2008064 | Loss: 0.659 | 913 ms/step , 6885.52 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 14:42:11 | Epoch: 0 | Step: 161760 | Dataset: 0-2008384 | Loss: 0.778 | 913 ms/step , 6889.63 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 14:42:20 | Epoch: 0 | Step: 161770 | Dataset: 0-2008704 | Loss: 0.562 | 912 ms/step , 6899.38 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 14:42:29 | Epoch: 0 | Step: 161780 | Dataset: 0-2009024 | Loss: 0.882 | 914 ms/step , 6881.97 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 14:42:38 | Epoch: 0 | Step: 161790 | Dataset: 0-2009344 | Loss: 0.663 | 913 ms/step , 6888.52 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 14:42:47 | Epoch: 0 | Step: 161800 | Dataset: 0-2009664 | Loss: 0.668 | 911 ms/step , 6900.40 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 14:42:49 | Validation | Step: 161800 | Val_loss: 0.743 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:42:58 | Epoch: 0 | Step: 161810 | Dataset: 0-2009984 | Loss: 0.898 | 914 ms/step , 6878.47 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-05 14:43:07 | Epoch: 0 | Step: 161820 | Dataset: 0-2010304 | Loss: 0.624 | 913 ms/step , 6892.48 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 14:43:16 | Epoch: 0 | Step: 161830 | Dataset: 0-2010624 | Loss: 0.777 | 912 ms/step , 6895.46 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 14:43:25 | Epoch: 0 | Step: 161840 | Dataset: 0-2010944 | Loss: 0.804 | 913 ms/step , 6890.07 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 14:43:34 | Epoch: 0 | Step: 161850 | Dataset: 0-2011264 | Loss: 0.757 | 913 ms/step , 6886.86 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 14:43:43 | Epoch: 0 | Step: 161860 | Dataset: 0-2011584 | Loss: 0.980 | 915 ms/step , 6875.56 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 14:43:53 | Epoch: 0 | Step: 161870 | Dataset: 0-2011904 | Loss: 0.629 | 914 ms/step , 6883.44 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 14:44:02 | Epoch: 0 | Step: 161880 | Dataset: 0-2012224 | Loss: 0.803 | 914 ms/step , 6879.54 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 14:44:11 | Epoch: 0 | Step: 161890 | Dataset: 0-2012544 | Loss: 0.747 | 913 ms/step , 6887.73 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 14:44:20 | Epoch: 0 | Step: 161900 | Dataset: 0-2012864 | Loss: 0.879 | 913 ms/step , 6890.32 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 14:44:22 | Validation | Step: 161900 | Val_loss: 0.711 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:44:31 | Epoch: 0 | Step: 161910 | Dataset: 0-2013184 | Loss: 0.810 | 913 ms/step , 6889.91 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-05 14:44:40 | Epoch: 0 | Step: 161920 | Dataset: 0-2013504 | Loss: 0.591 | 913 ms/step , 6890.99 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 14:44:49 | Epoch: 0 | Step: 161930 | Dataset: 0-2013824 | Loss: 0.890 | 913 ms/step , 6891.36 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 14:44:58 | Epoch: 0 | Step: 161940 | Dataset: 0-2014144 | Loss: 0.812 | 913 ms/step , 6888.26 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 14:45:07 | Epoch: 0 | Step: 161950 | Dataset: 0-2014464 | Loss: 0.834 | 913 ms/step , 6886.47 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 14:45:16 | Epoch: 0 | Step: 161960 | Dataset: 0-2014784 | Loss: 0.815 | 914 ms/step , 6877.59 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 14:45:26 | Epoch: 0 | Step: 161970 | Dataset: 0-2015104 | Loss: 0.836 | 914 ms/step , 6883.54 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 14:45:35 | Epoch: 0 | Step: 161980 | Dataset: 0-2015424 | Loss: 0.879 | 915 ms/step , 6877.20 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 14:45:44 | Epoch: 0 | Step: 161990 | Dataset: 0-2015744 | Loss: 0.315 | 911 ms/step , 6901.21 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 14:45:53 | Epoch: 0 | Step: 162000 | Dataset: 0-2016064 | Loss: 0.499 | 911 ms/step , 6902.40 GFLOP/s , 17968.2 tokens/s INFO:__main__:2024-11-05 14:45:55 | Validation | Step: 162000 | Val_loss: 0.717 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:45:55 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_144555_step_162000.pt` INFO:__main__:2024-11-05 14:46:05 | Epoch: 0 | Step: 162010 | Dataset: 0-2016384 | Loss: 0.470 | 913 ms/step , 6892.29 GFLOP/s , 13835.7 tokens/s INFO:__main__:2024-11-05 14:46:14 | Epoch: 0 | Step: 162020 | Dataset: 0-2016704 | Loss: 0.464 | 911 ms/step , 6901.76 GFLOP/s , 17972.8 tokens/s INFO:__main__:2024-11-05 14:46:23 | Epoch: 0 | Step: 162030 | Dataset: 0-2017024 | Loss: 0.292 | 910 ms/step , 6910.57 GFLOP/s , 17964.2 tokens/s INFO:__main__:2024-11-05 14:46:32 | Epoch: 0 | Step: 162040 | Dataset: 0-2017344 | Loss: 0.205 | 912 ms/step , 6895.45 GFLOP/s , 17960.6 tokens/s INFO:__main__:2024-11-05 14:46:41 | Epoch: 0 | Step: 162050 | Dataset: 0-2017664 | Loss: 0.557 | 912 ms/step , 6899.79 GFLOP/s , 17965.4 tokens/s INFO:__main__:2024-11-05 14:46:50 | Epoch: 0 | Step: 162060 | Dataset: 0-2017984 | Loss: 0.246 | 912 ms/step , 6900.02 GFLOP/s , 17963.9 tokens/s INFO:__main__:2024-11-05 14:47:00 | Epoch: 0 | Step: 162070 | Dataset: 0-2018304 | Loss: 0.340 | 911 ms/step , 6901.47 GFLOP/s , 17965.2 tokens/s INFO:__main__:2024-11-05 14:47:09 | Epoch: 0 | Step: 162080 | Dataset: 0-2018624 | Loss: 0.313 | 911 ms/step , 6900.80 GFLOP/s , 17965.7 tokens/s INFO:__main__:2024-11-05 14:47:18 | Epoch: 0 | Step: 162090 | Dataset: 0-2018944 | Loss: 0.391 | 911 ms/step , 6904.30 GFLOP/s , 17965.2 tokens/s INFO:__main__:2024-11-05 14:47:27 | Epoch: 0 | Step: 162100 | Dataset: 0-2019264 | Loss: 0.403 | 913 ms/step , 6889.68 GFLOP/s , 17960.1 tokens/s INFO:__main__:2024-11-05 14:47:28 | Validation | Step: 162100 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:47:38 | Epoch: 0 | Step: 162110 | Dataset: 0-2019584 | Loss: 0.359 | 911 ms/step , 6900.60 GFLOP/s , 15303.4 tokens/s INFO:__main__:2024-11-05 14:47:47 | Epoch: 0 | Step: 162120 | Dataset: 0-2019904 | Loss: 0.291 | 911 ms/step , 6901.39 GFLOP/s , 17964.9 tokens/s INFO:__main__:2024-11-05 14:47:56 | Epoch: 0 | Step: 162130 | Dataset: 0-2020224 | Loss: 0.155 | 911 ms/step , 6903.83 GFLOP/s , 17966.5 tokens/s INFO:__main__:2024-11-05 14:48:05 | Epoch: 0 | Step: 162140 | Dataset: 0-2020544 | Loss: 0.247 | 910 ms/step , 6908.68 GFLOP/s , 17969.1 tokens/s INFO:__main__:2024-11-05 14:48:14 | Epoch: 0 | Step: 162150 | Dataset: 0-2020864 | Loss: 0.136 | 911 ms/step , 6907.12 GFLOP/s , 17965.1 tokens/s INFO:__main__:2024-11-05 14:48:23 | Epoch: 0 | Step: 162160 | Dataset: 0-2021184 | Loss: 0.375 | 912 ms/step , 6897.14 GFLOP/s , 17961.7 tokens/s INFO:__main__:2024-11-05 14:48:32 | Epoch: 0 | Step: 162170 | Dataset: 0-2021504 | Loss: 0.440 | 911 ms/step , 6901.33 GFLOP/s , 17967.8 tokens/s INFO:__main__:2024-11-05 14:48:41 | Epoch: 0 | Step: 162180 | Dataset: 0-2021824 | Loss: 0.284 | 911 ms/step , 6905.72 GFLOP/s , 17964.5 tokens/s INFO:__main__:2024-11-05 14:48:51 | Epoch: 0 | Step: 162190 | Dataset: 0-2022144 | Loss: 0.391 | 911 ms/step , 6904.39 GFLOP/s , 17963.8 tokens/s INFO:__main__:2024-11-05 14:49:00 | Epoch: 0 | Step: 162200 | Dataset: 0-2022464 | Loss: 0.472 | 911 ms/step , 6900.68 GFLOP/s , 17962.8 tokens/s INFO:__main__:2024-11-05 14:49:01 | Validation | Step: 162200 | Val_loss: 0.747 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:49:10 | Epoch: 0 | Step: 162210 | Dataset: 0-2022784 | Loss: 0.460 | 911 ms/step , 6901.16 GFLOP/s , 15295.2 tokens/s INFO:__main__:2024-11-05 14:49:20 | Epoch: 0 | Step: 162220 | Dataset: 0-2023104 | Loss: 0.434 | 912 ms/step , 6895.41 GFLOP/s , 17960.1 tokens/s INFO:__main__:2024-11-05 14:49:29 | Epoch: 0 | Step: 162230 | Dataset: 0-2023424 | Loss: 0.935 | 912 ms/step , 6896.60 GFLOP/s , 17953.7 tokens/s INFO:__main__:2024-11-05 14:49:38 | Epoch: 0 | Step: 162240 | Dataset: 0-2023744 | Loss: 0.481 | 912 ms/step , 6898.66 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 14:49:47 | Epoch: 0 | Step: 162250 | Dataset: 0-2024064 | Loss: 0.422 | 912 ms/step , 6897.45 GFLOP/s , 17956.5 tokens/s INFO:__main__:2024-11-05 14:49:56 | Epoch: 0 | Step: 162260 | Dataset: 0-2024384 | Loss: 0.785 | 914 ms/step , 6881.73 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 14:50:05 | Epoch: 0 | Step: 162270 | Dataset: 0-2024704 | Loss: 0.797 | 913 ms/step , 6886.40 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 14:50:14 | Epoch: 0 | Step: 162280 | Dataset: 0-2025024 | Loss: 0.802 | 915 ms/step , 6873.46 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 14:50:23 | Epoch: 0 | Step: 162290 | Dataset: 0-2025344 | Loss: 0.771 | 914 ms/step , 6880.67 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 14:50:33 | Epoch: 0 | Step: 162300 | Dataset: 0-2025664 | Loss: 0.783 | 914 ms/step , 6883.69 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 14:50:34 | Validation | Step: 162300 | Val_loss: 0.741 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:50:43 | Epoch: 0 | Step: 162310 | Dataset: 0-2025984 | Loss: 0.812 | 913 ms/step , 6886.50 GFLOP/s , 15263.8 tokens/s INFO:__main__:2024-11-05 14:50:52 | Epoch: 0 | Step: 162320 | Dataset: 0-2026304 | Loss: 0.817 | 914 ms/step , 6880.57 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 14:51:02 | Epoch: 0 | Step: 162330 | Dataset: 0-2026624 | Loss: 0.789 | 914 ms/step , 6878.26 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 14:51:11 | Epoch: 0 | Step: 162340 | Dataset: 0-2026944 | Loss: 0.754 | 914 ms/step , 6879.27 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 14:51:20 | Epoch: 0 | Step: 162350 | Dataset: 0-2027264 | Loss: 0.768 | 913 ms/step , 6890.69 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 14:51:29 | Epoch: 0 | Step: 162360 | Dataset: 0-2027584 | Loss: 0.741 | 913 ms/step , 6887.10 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 14:51:38 | Epoch: 0 | Step: 162370 | Dataset: 0-2027904 | Loss: 0.744 | 915 ms/step , 6874.73 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 14:51:47 | Epoch: 0 | Step: 162380 | Dataset: 0-2028224 | Loss: 0.793 | 914 ms/step , 6884.32 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 14:51:56 | Epoch: 0 | Step: 162390 | Dataset: 0-2028544 | Loss: 0.768 | 912 ms/step , 6895.14 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 14:52:06 | Epoch: 0 | Step: 162400 | Dataset: 0-2028864 | Loss: 0.763 | 913 ms/step , 6888.54 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 14:52:07 | Validation | Step: 162400 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:52:16 | Epoch: 0 | Step: 162410 | Dataset: 0-2029184 | Loss: 0.700 | 914 ms/step , 6880.75 GFLOP/s , 15263.7 tokens/s INFO:__main__:2024-11-05 14:52:26 | Epoch: 0 | Step: 162420 | Dataset: 0-2029504 | Loss: 0.778 | 913 ms/step , 6887.65 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 14:52:35 | Epoch: 0 | Step: 162430 | Dataset: 0-2029824 | Loss: 0.785 | 914 ms/step , 6883.07 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 14:52:44 | Epoch: 0 | Step: 162440 | Dataset: 0-2030144 | Loss: 0.738 | 914 ms/step , 6884.67 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 14:52:53 | Epoch: 0 | Step: 162450 | Dataset: 0-2030464 | Loss: 0.816 | 913 ms/step , 6885.23 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 14:53:02 | Epoch: 0 | Step: 162460 | Dataset: 0-2030784 | Loss: 0.773 | 915 ms/step , 6876.33 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 14:53:11 | Epoch: 0 | Step: 162470 | Dataset: 0-2031104 | Loss: 0.787 | 912 ms/step , 6895.74 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 14:53:20 | Epoch: 0 | Step: 162480 | Dataset: 0-2031424 | Loss: 0.778 | 914 ms/step , 6881.70 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 14:53:30 | Epoch: 0 | Step: 162490 | Dataset: 0-2031744 | Loss: 0.773 | 913 ms/step , 6889.49 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 14:53:39 | Epoch: 0 | Step: 162500 | Dataset: 0-2032064 | Loss: 0.748 | 912 ms/step , 6892.97 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 14:53:40 | Validation | Step: 162500 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:53:49 | Epoch: 0 | Step: 162510 | Dataset: 0-2032384 | Loss: 0.764 | 913 ms/step , 6892.59 GFLOP/s , 15268.3 tokens/s INFO:__main__:2024-11-05 14:53:59 | Epoch: 0 | Step: 162520 | Dataset: 0-2032704 | Loss: 0.800 | 917 ms/step , 6860.99 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 14:54:08 | Epoch: 0 | Step: 162530 | Dataset: 0-2033024 | Loss: 0.850 | 913 ms/step , 6885.96 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 14:54:17 | Epoch: 0 | Step: 162540 | Dataset: 0-2033344 | Loss: 0.732 | 915 ms/step , 6874.62 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 14:54:26 | Epoch: 0 | Step: 162550 | Dataset: 0-2033664 | Loss: 0.774 | 915 ms/step , 6873.74 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 14:54:35 | Epoch: 0 | Step: 162560 | Dataset: 0-2033984 | Loss: 0.773 | 912 ms/step , 6893.11 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 14:54:44 | Epoch: 0 | Step: 162570 | Dataset: 0-2034304 | Loss: 0.725 | 913 ms/step , 6886.47 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 14:54:53 | Epoch: 0 | Step: 162580 | Dataset: 0-2034624 | Loss: 0.837 | 915 ms/step , 6875.16 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 14:55:03 | Epoch: 0 | Step: 162590 | Dataset: 0-2034944 | Loss: 0.773 | 914 ms/step , 6881.49 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 14:55:12 | Epoch: 0 | Step: 162600 | Dataset: 0-2035264 | Loss: 0.773 | 915 ms/step , 6872.89 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 14:55:13 | Validation | Step: 162600 | Val_loss: 0.722 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:55:22 | Epoch: 0 | Step: 162610 | Dataset: 0-2035584 | Loss: 0.780 | 914 ms/step , 6884.25 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 14:55:32 | Epoch: 0 | Step: 162620 | Dataset: 0-2035904 | Loss: 0.732 | 914 ms/step , 6877.67 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 14:55:41 | Epoch: 0 | Step: 162630 | Dataset: 0-2036224 | Loss: 0.786 | 914 ms/step , 6879.12 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 14:55:50 | Epoch: 0 | Step: 162640 | Dataset: 0-2036544 | Loss: 0.800 | 914 ms/step , 6884.52 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 14:55:59 | Epoch: 0 | Step: 162650 | Dataset: 0-2036864 | Loss: 0.771 | 913 ms/step , 6886.11 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 14:56:08 | Epoch: 0 | Step: 162660 | Dataset: 0-2037184 | Loss: 0.767 | 914 ms/step , 6883.47 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 14:56:17 | Epoch: 0 | Step: 162670 | Dataset: 0-2037504 | Loss: 0.740 | 913 ms/step , 6885.71 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 14:56:26 | Epoch: 0 | Step: 162680 | Dataset: 0-2037824 | Loss: 0.743 | 913 ms/step , 6885.11 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 14:56:36 | Epoch: 0 | Step: 162690 | Dataset: 0-2038144 | Loss: 0.783 | 914 ms/step , 6882.71 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 14:56:45 | Epoch: 0 | Step: 162700 | Dataset: 0-2038464 | Loss: 0.851 | 914 ms/step , 6878.31 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 14:56:46 | Validation | Step: 162700 | Val_loss: 0.723 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:56:55 | Epoch: 0 | Step: 162710 | Dataset: 0-2038784 | Loss: 0.911 | 913 ms/step , 6891.38 GFLOP/s , 15263.9 tokens/s INFO:__main__:2024-11-05 14:57:05 | Epoch: 0 | Step: 162720 | Dataset: 0-2039104 | Loss: 0.713 | 913 ms/step , 6889.77 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 14:57:14 | Epoch: 0 | Step: 162730 | Dataset: 0-2039424 | Loss: 0.751 | 914 ms/step , 6878.30 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 14:57:23 | Epoch: 0 | Step: 162740 | Dataset: 0-2039744 | Loss: 0.758 | 914 ms/step , 6880.34 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 14:57:32 | Epoch: 0 | Step: 162750 | Dataset: 0-2040064 | Loss: 0.728 | 914 ms/step , 6884.59 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 14:57:41 | Epoch: 0 | Step: 162760 | Dataset: 0-2040384 | Loss: 0.790 | 914 ms/step , 6882.64 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 14:57:50 | Epoch: 0 | Step: 162770 | Dataset: 0-2040704 | Loss: 0.706 | 912 ms/step , 6894.96 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 14:57:59 | Epoch: 0 | Step: 162780 | Dataset: 0-2041024 | Loss: 0.766 | 913 ms/step , 6888.57 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 14:58:09 | Epoch: 0 | Step: 162790 | Dataset: 0-2041344 | Loss: 0.786 | 914 ms/step , 6880.20 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 14:58:18 | Epoch: 0 | Step: 162800 | Dataset: 0-2041664 | Loss: 0.760 | 914 ms/step , 6884.80 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 14:58:19 | Validation | Step: 162800 | Val_loss: 0.749 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 14:58:28 | Epoch: 0 | Step: 162810 | Dataset: 0-2041984 | Loss: 0.810 | 914 ms/step , 6881.39 GFLOP/s , 15265.7 tokens/s INFO:__main__:2024-11-05 14:58:38 | Epoch: 0 | Step: 162820 | Dataset: 0-2042304 | Loss: 0.760 | 914 ms/step , 6879.30 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 14:58:47 | Epoch: 0 | Step: 162830 | Dataset: 0-2042624 | Loss: 0.739 | 914 ms/step , 6878.33 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 14:58:56 | Epoch: 0 | Step: 162840 | Dataset: 0-2042944 | Loss: 0.784 | 914 ms/step , 6883.85 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 14:59:05 | Epoch: 0 | Step: 162850 | Dataset: 0-2043264 | Loss: 0.786 | 914 ms/step , 6878.37 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-05 14:59:14 | Epoch: 0 | Step: 162860 | Dataset: 0-2043584 | Loss: 0.794 | 914 ms/step , 6878.48 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 14:59:23 | Epoch: 0 | Step: 162870 | Dataset: 0-2043904 | Loss: 0.740 | 914 ms/step , 6882.21 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 14:59:32 | Epoch: 0 | Step: 162880 | Dataset: 0-2044224 | Loss: 0.739 | 914 ms/step , 6884.88 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 14:59:42 | Epoch: 0 | Step: 162890 | Dataset: 0-2044544 | Loss: 0.790 | 915 ms/step , 6874.95 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 14:59:51 | Epoch: 0 | Step: 162900 | Dataset: 0-2044864 | Loss: 0.671 | 913 ms/step , 6890.87 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 14:59:52 | Validation | Step: 162900 | Val_loss: 0.760 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:00:01 | Epoch: 0 | Step: 162910 | Dataset: 0-2045184 | Loss: 0.820 | 915 ms/step , 6875.20 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-05 15:00:11 | Epoch: 0 | Step: 162920 | Dataset: 0-2045504 | Loss: 0.751 | 913 ms/step , 6890.54 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 15:00:20 | Epoch: 0 | Step: 162930 | Dataset: 0-2045824 | Loss: 0.748 | 914 ms/step , 6884.73 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 15:00:29 | Epoch: 0 | Step: 162940 | Dataset: 0-2046144 | Loss: 0.734 | 914 ms/step , 6878.83 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 15:00:38 | Epoch: 0 | Step: 162950 | Dataset: 0-2046464 | Loss: 0.757 | 913 ms/step , 6887.74 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 15:00:47 | Epoch: 0 | Step: 162960 | Dataset: 0-2046784 | Loss: 0.777 | 914 ms/step , 6877.85 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 15:00:56 | Epoch: 0 | Step: 162970 | Dataset: 0-2047104 | Loss: 0.778 | 915 ms/step , 6873.97 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 15:01:05 | Epoch: 0 | Step: 162980 | Dataset: 0-2047424 | Loss: 0.694 | 914 ms/step , 6883.60 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 15:01:15 | Epoch: 0 | Step: 162990 | Dataset: 0-2047744 | Loss: 0.768 | 915 ms/step , 6875.74 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 15:01:24 | Epoch: 0 | Step: 163000 | Dataset: 0-2048064 | Loss: 0.743 | 914 ms/step , 6879.58 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 15:01:25 | Validation | Step: 163000 | Val_loss: 0.796 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:01:25 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_150125_step_163000.pt` INFO:__main__:2024-11-05 15:01:36 | Epoch: 0 | Step: 163010 | Dataset: 0-2048384 | Loss: 0.764 | 914 ms/step , 6881.53 GFLOP/s , 13794.2 tokens/s INFO:__main__:2024-11-05 15:01:45 | Epoch: 0 | Step: 163020 | Dataset: 0-2048704 | Loss: 0.813 | 914 ms/step , 6881.89 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-05 15:01:54 | Epoch: 0 | Step: 163030 | Dataset: 0-2049024 | Loss: 0.820 | 915 ms/step , 6876.37 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 15:02:03 | Epoch: 0 | Step: 163040 | Dataset: 0-2049344 | Loss: 0.754 | 914 ms/step , 6881.59 GFLOP/s , 17899.3 tokens/s INFO:__main__:2024-11-05 15:02:12 | Epoch: 0 | Step: 163050 | Dataset: 0-2049664 | Loss: 0.793 | 915 ms/step , 6875.57 GFLOP/s , 17905.6 tokens/s INFO:__main__:2024-11-05 15:02:21 | Epoch: 0 | Step: 163060 | Dataset: 0-2049984 | Loss: 0.797 | 915 ms/step , 6874.61 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 15:02:30 | Epoch: 0 | Step: 163070 | Dataset: 0-2050304 | Loss: 0.792 | 913 ms/step , 6886.96 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 15:02:40 | Epoch: 0 | Step: 163080 | Dataset: 0-2050624 | Loss: 0.766 | 915 ms/step , 6873.99 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 15:02:49 | Epoch: 0 | Step: 163090 | Dataset: 0-2050944 | Loss: 0.800 | 915 ms/step , 6874.87 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 15:02:58 | Epoch: 0 | Step: 163100 | Dataset: 0-2051264 | Loss: 0.785 | 914 ms/step , 6884.37 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 15:03:00 | Validation | Step: 163100 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:03:09 | Epoch: 0 | Step: 163110 | Dataset: 0-2051584 | Loss: 0.811 | 914 ms/step , 6877.77 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-05 15:03:18 | Epoch: 0 | Step: 163120 | Dataset: 0-2051904 | Loss: 0.764 | 913 ms/step , 6886.93 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 15:03:27 | Epoch: 0 | Step: 163130 | Dataset: 0-2052224 | Loss: 0.705 | 915 ms/step , 6876.30 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 15:03:36 | Epoch: 0 | Step: 163140 | Dataset: 0-2052544 | Loss: 0.767 | 914 ms/step , 6879.87 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 15:03:45 | Epoch: 0 | Step: 163150 | Dataset: 0-2052864 | Loss: 0.793 | 913 ms/step , 6889.31 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 15:03:54 | Epoch: 0 | Step: 163160 | Dataset: 0-2053184 | Loss: 0.766 | 914 ms/step , 6884.10 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 15:04:04 | Epoch: 0 | Step: 163170 | Dataset: 0-2053504 | Loss: 0.713 | 913 ms/step , 6885.62 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 15:04:13 | Epoch: 0 | Step: 163180 | Dataset: 0-2053824 | Loss: 0.763 | 914 ms/step , 6883.37 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 15:04:22 | Epoch: 0 | Step: 163190 | Dataset: 0-2054144 | Loss: 0.739 | 913 ms/step , 6892.18 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 15:04:31 | Epoch: 0 | Step: 163200 | Dataset: 0-2054464 | Loss: 0.791 | 915 ms/step , 6876.21 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 15:04:33 | Validation | Step: 163200 | Val_loss: 0.621 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:04:42 | Epoch: 0 | Step: 163210 | Dataset: 0-2054784 | Loss: 0.728 | 914 ms/step , 6884.37 GFLOP/s , 15266.7 tokens/s INFO:__main__:2024-11-05 15:04:51 | Epoch: 0 | Step: 163220 | Dataset: 0-2055104 | Loss: 0.720 | 914 ms/step , 6882.65 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 15:05:00 | Epoch: 0 | Step: 163230 | Dataset: 0-2055424 | Loss: 0.799 | 916 ms/step , 6868.40 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-05 15:05:09 | Epoch: 0 | Step: 163240 | Dataset: 0-2055744 | Loss: 0.763 | 915 ms/step , 6876.75 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 15:05:18 | Epoch: 0 | Step: 163250 | Dataset: 0-2056064 | Loss: 0.791 | 913 ms/step , 6885.50 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 15:05:27 | Epoch: 0 | Step: 163260 | Dataset: 0-2056384 | Loss: 0.904 | 914 ms/step , 6879.76 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-05 15:05:37 | Epoch: 0 | Step: 163270 | Dataset: 0-2056704 | Loss: 0.732 | 915 ms/step , 6876.93 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 15:05:46 | Epoch: 0 | Step: 163280 | Dataset: 0-2057024 | Loss: 0.811 | 914 ms/step , 6883.24 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 15:05:55 | Epoch: 0 | Step: 163290 | Dataset: 0-2057344 | Loss: 0.772 | 915 ms/step , 6875.23 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 15:06:04 | Epoch: 0 | Step: 163300 | Dataset: 0-2057664 | Loss: 0.818 | 914 ms/step , 6880.67 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 15:06:06 | Validation | Step: 163300 | Val_loss: 0.675 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:06:15 | Epoch: 0 | Step: 163310 | Dataset: 0-2057984 | Loss: 0.742 | 913 ms/step , 6887.94 GFLOP/s , 15263.3 tokens/s INFO:__main__:2024-11-05 15:06:24 | Epoch: 0 | Step: 163320 | Dataset: 0-2058304 | Loss: 0.763 | 914 ms/step , 6884.15 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 15:06:33 | Epoch: 0 | Step: 163330 | Dataset: 0-2058624 | Loss: 0.823 | 913 ms/step , 6888.39 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 15:06:42 | Epoch: 0 | Step: 163340 | Dataset: 0-2058944 | Loss: 0.776 | 914 ms/step , 6882.92 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 15:06:51 | Epoch: 0 | Step: 163350 | Dataset: 0-2059264 | Loss: 0.802 | 914 ms/step , 6883.44 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 15:07:00 | Epoch: 0 | Step: 163360 | Dataset: 0-2059584 | Loss: 0.736 | 914 ms/step , 6878.67 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 15:07:10 | Epoch: 0 | Step: 163370 | Dataset: 0-2059904 | Loss: 0.765 | 915 ms/step , 6873.20 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-05 15:07:19 | Epoch: 0 | Step: 163380 | Dataset: 0-2060224 | Loss: 0.805 | 913 ms/step , 6888.49 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 15:07:28 | Epoch: 0 | Step: 163390 | Dataset: 0-2060544 | Loss: 0.676 | 913 ms/step , 6887.31 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 15:07:37 | Epoch: 0 | Step: 163400 | Dataset: 0-2060864 | Loss: 0.740 | 914 ms/step , 6878.14 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 15:07:39 | Validation | Step: 163400 | Val_loss: 0.713 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:07:48 | Epoch: 0 | Step: 163410 | Dataset: 0-2061184 | Loss: 0.772 | 913 ms/step , 6886.30 GFLOP/s , 15264.0 tokens/s INFO:__main__:2024-11-05 15:07:57 | Epoch: 0 | Step: 163420 | Dataset: 0-2061504 | Loss: 0.736 | 914 ms/step , 6882.09 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 15:08:06 | Epoch: 0 | Step: 163430 | Dataset: 0-2061824 | Loss: 0.875 | 914 ms/step , 6883.36 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 15:08:15 | Epoch: 0 | Step: 163440 | Dataset: 0-2062144 | Loss: 0.866 | 916 ms/step , 6869.56 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 15:08:24 | Epoch: 0 | Step: 163450 | Dataset: 0-2062464 | Loss: 0.683 | 917 ms/step , 6861.36 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 15:08:33 | Epoch: 0 | Step: 163460 | Dataset: 0-2062784 | Loss: 0.748 | 913 ms/step , 6885.12 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 15:08:43 | Epoch: 0 | Step: 163470 | Dataset: 0-2063104 | Loss: 0.686 | 915 ms/step , 6874.88 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 15:08:52 | Epoch: 0 | Step: 163480 | Dataset: 0-2063424 | Loss: 0.690 | 912 ms/step , 6893.58 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 15:09:01 | Epoch: 0 | Step: 163490 | Dataset: 0-2063744 | Loss: 0.730 | 914 ms/step , 6879.55 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 15:09:10 | Epoch: 0 | Step: 163500 | Dataset: 0-2064064 | Loss: 0.765 | 914 ms/step , 6882.95 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 15:09:12 | Validation | Step: 163500 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:09:21 | Epoch: 0 | Step: 163510 | Dataset: 0-2064384 | Loss: 0.775 | 915 ms/step , 6877.25 GFLOP/s , 15269.5 tokens/s INFO:__main__:2024-11-05 15:09:30 | Epoch: 0 | Step: 163520 | Dataset: 0-2064704 | Loss: 0.826 | 914 ms/step , 6878.90 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 15:09:39 | Epoch: 0 | Step: 163530 | Dataset: 0-2065024 | Loss: 0.749 | 914 ms/step , 6884.21 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 15:09:48 | Epoch: 0 | Step: 163540 | Dataset: 0-2065344 | Loss: 0.787 | 913 ms/step , 6890.51 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 15:09:57 | Epoch: 0 | Step: 163550 | Dataset: 0-2065664 | Loss: 0.689 | 913 ms/step , 6885.05 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 15:10:07 | Epoch: 0 | Step: 163560 | Dataset: 0-2065984 | Loss: 0.733 | 913 ms/step , 6891.46 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 15:10:16 | Epoch: 0 | Step: 163570 | Dataset: 0-2066304 | Loss: 0.779 | 915 ms/step , 6875.47 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 15:10:25 | Epoch: 0 | Step: 163580 | Dataset: 0-2066624 | Loss: 0.701 | 913 ms/step , 6887.76 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 15:10:34 | Epoch: 0 | Step: 163590 | Dataset: 0-2066944 | Loss: 0.783 | 914 ms/step , 6882.34 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 15:10:43 | Epoch: 0 | Step: 163600 | Dataset: 0-2067264 | Loss: 0.846 | 915 ms/step , 6873.12 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 15:10:45 | Validation | Step: 163600 | Val_loss: 0.684 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:10:54 | Epoch: 0 | Step: 163610 | Dataset: 0-2067584 | Loss: 0.842 | 915 ms/step , 6872.57 GFLOP/s , 15276.8 tokens/s INFO:__main__:2024-11-05 15:11:03 | Epoch: 0 | Step: 163620 | Dataset: 0-2067904 | Loss: 0.708 | 914 ms/step , 6880.70 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 15:11:12 | Epoch: 0 | Step: 163630 | Dataset: 0-2068224 | Loss: 0.810 | 914 ms/step , 6884.49 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 15:11:21 | Epoch: 0 | Step: 163640 | Dataset: 0-2068544 | Loss: 0.782 | 915 ms/step , 6875.04 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 15:11:30 | Epoch: 0 | Step: 163650 | Dataset: 0-2068864 | Loss: 0.765 | 915 ms/step , 6871.65 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 15:11:40 | Epoch: 0 | Step: 163660 | Dataset: 0-2069184 | Loss: 0.756 | 914 ms/step , 6884.64 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 15:11:49 | Epoch: 0 | Step: 163670 | Dataset: 0-2069504 | Loss: 0.726 | 914 ms/step , 6878.51 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 15:11:58 | Epoch: 0 | Step: 163680 | Dataset: 0-2069824 | Loss: 0.725 | 914 ms/step , 6884.89 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 15:12:07 | Epoch: 0 | Step: 163690 | Dataset: 0-2070144 | Loss: 0.769 | 913 ms/step , 6885.51 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 15:12:16 | Epoch: 0 | Step: 163700 | Dataset: 0-2070464 | Loss: 0.733 | 915 ms/step , 6872.11 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 15:12:18 | Validation | Step: 163700 | Val_loss: 0.771 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:12:27 | Epoch: 0 | Step: 163710 | Dataset: 0-2070784 | Loss: 0.792 | 914 ms/step , 6884.02 GFLOP/s , 15261.3 tokens/s INFO:__main__:2024-11-05 15:12:36 | Epoch: 0 | Step: 163720 | Dataset: 0-2071104 | Loss: 0.779 | 914 ms/step , 6883.62 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 15:12:45 | Epoch: 0 | Step: 163730 | Dataset: 0-2071424 | Loss: 0.805 | 914 ms/step , 6880.11 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 15:12:54 | Epoch: 0 | Step: 163740 | Dataset: 0-2071744 | Loss: 0.774 | 914 ms/step , 6880.95 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 15:13:03 | Epoch: 0 | Step: 163750 | Dataset: 0-2072064 | Loss: 0.717 | 913 ms/step , 6886.30 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 15:13:13 | Epoch: 0 | Step: 163760 | Dataset: 0-2072384 | Loss: 0.784 | 913 ms/step , 6885.28 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 15:13:22 | Epoch: 0 | Step: 163770 | Dataset: 0-2072704 | Loss: 0.732 | 913 ms/step , 6891.38 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 15:13:31 | Epoch: 0 | Step: 163780 | Dataset: 0-2073024 | Loss: 0.750 | 913 ms/step , 6891.75 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 15:13:40 | Epoch: 0 | Step: 163790 | Dataset: 0-2073344 | Loss: 0.743 | 916 ms/step , 6866.82 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 15:13:49 | Epoch: 0 | Step: 163800 | Dataset: 0-2073664 | Loss: 0.743 | 915 ms/step , 6877.44 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 15:13:51 | Validation | Step: 163800 | Val_loss: 0.683 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:14:00 | Epoch: 0 | Step: 163810 | Dataset: 0-2073984 | Loss: 0.713 | 916 ms/step , 6869.13 GFLOP/s , 15271.0 tokens/s INFO:__main__:2024-11-05 15:14:09 | Epoch: 0 | Step: 163820 | Dataset: 0-2074304 | Loss: 0.700 | 914 ms/step , 6883.08 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 15:14:18 | Epoch: 0 | Step: 163830 | Dataset: 0-2074624 | Loss: 0.689 | 914 ms/step , 6881.33 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 15:14:27 | Epoch: 0 | Step: 163840 | Dataset: 0-2074944 | Loss: 0.714 | 915 ms/step , 6876.31 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 15:14:36 | Epoch: 0 | Step: 163850 | Dataset: 0-2075264 | Loss: 0.770 | 914 ms/step , 6880.73 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 15:14:46 | Epoch: 0 | Step: 163860 | Dataset: 0-2075584 | Loss: 0.651 | 913 ms/step , 6885.37 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 15:14:55 | Epoch: 0 | Step: 163870 | Dataset: 0-2075904 | Loss: 0.772 | 914 ms/step , 6884.98 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 15:15:04 | Epoch: 0 | Step: 163880 | Dataset: 0-2076224 | Loss: 0.800 | 914 ms/step , 6880.14 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 15:15:13 | Epoch: 0 | Step: 163890 | Dataset: 0-2076544 | Loss: 0.782 | 913 ms/step , 6887.43 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 15:15:22 | Epoch: 0 | Step: 163900 | Dataset: 0-2076864 | Loss: 0.742 | 914 ms/step , 6880.55 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 15:15:24 | Validation | Step: 163900 | Val_loss: 0.749 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:15:33 | Epoch: 0 | Step: 163910 | Dataset: 0-2077184 | Loss: 0.765 | 913 ms/step , 6886.86 GFLOP/s , 15263.9 tokens/s INFO:__main__:2024-11-05 15:15:42 | Epoch: 0 | Step: 163920 | Dataset: 0-2077504 | Loss: 0.695 | 914 ms/step , 6884.59 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 15:15:51 | Epoch: 0 | Step: 163930 | Dataset: 0-2077824 | Loss: 0.716 | 913 ms/step , 6891.39 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 15:16:00 | Epoch: 0 | Step: 163940 | Dataset: 0-2078144 | Loss: 0.797 | 916 ms/step , 6868.84 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 15:16:09 | Epoch: 0 | Step: 163950 | Dataset: 0-2078464 | Loss: 0.806 | 913 ms/step , 6885.88 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 15:16:19 | Epoch: 0 | Step: 163960 | Dataset: 0-2078784 | Loss: 0.763 | 913 ms/step , 6889.80 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 15:16:28 | Epoch: 0 | Step: 163970 | Dataset: 0-2079104 | Loss: 0.829 | 913 ms/step , 6885.65 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 15:16:37 | Epoch: 0 | Step: 163980 | Dataset: 0-2079424 | Loss: 0.760 | 913 ms/step , 6885.68 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 15:16:46 | Epoch: 0 | Step: 163990 | Dataset: 0-2079744 | Loss: 0.759 | 914 ms/step , 6880.00 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 15:16:55 | Epoch: 0 | Step: 164000 | Dataset: 0-2080064 | Loss: 0.787 | 914 ms/step , 6878.70 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:16:57 | Validation | Step: 164000 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:16:57 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_151657_step_164000.pt` INFO:__main__:2024-11-05 15:17:07 | Epoch: 0 | Step: 164010 | Dataset: 0-2080384 | Loss: 0.734 | 915 ms/step , 6871.58 GFLOP/s , 13775.8 tokens/s INFO:__main__:2024-11-05 15:17:16 | Epoch: 0 | Step: 164020 | Dataset: 0-2080704 | Loss: 0.805 | 914 ms/step , 6877.63 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 15:17:25 | Epoch: 0 | Step: 164030 | Dataset: 0-2081024 | Loss: 0.752 | 915 ms/step , 6876.29 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 15:17:35 | Epoch: 0 | Step: 164040 | Dataset: 0-2081344 | Loss: 0.742 | 914 ms/step , 6883.50 GFLOP/s , 17863.0 tokens/s INFO:__main__:2024-11-05 15:17:44 | Epoch: 0 | Step: 164050 | Dataset: 0-2081664 | Loss: 0.802 | 915 ms/step , 6872.30 GFLOP/s , 17901.0 tokens/s INFO:__main__:2024-11-05 15:17:53 | Epoch: 0 | Step: 164060 | Dataset: 0-2081984 | Loss: 0.732 | 917 ms/step , 6859.18 GFLOP/s , 17898.5 tokens/s INFO:__main__:2024-11-05 15:18:02 | Epoch: 0 | Step: 164070 | Dataset: 0-2082304 | Loss: 0.792 | 913 ms/step , 6889.32 GFLOP/s , 17908.4 tokens/s INFO:__main__:2024-11-05 15:18:11 | Epoch: 0 | Step: 164080 | Dataset: 0-2082624 | Loss: 0.782 | 913 ms/step , 6886.26 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 15:18:20 | Epoch: 0 | Step: 164090 | Dataset: 0-2082944 | Loss: 0.705 | 914 ms/step , 6883.24 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 15:18:29 | Epoch: 0 | Step: 164100 | Dataset: 0-2083264 | Loss: 0.746 | 915 ms/step , 6872.00 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 15:18:31 | Validation | Step: 164100 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:18:40 | Epoch: 0 | Step: 164110 | Dataset: 0-2083584 | Loss: 0.713 | 914 ms/step , 6884.56 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-05 15:18:49 | Epoch: 0 | Step: 164120 | Dataset: 0-2083904 | Loss: 0.802 | 914 ms/step , 6878.72 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 15:18:58 | Epoch: 0 | Step: 164130 | Dataset: 0-2084224 | Loss: 0.719 | 913 ms/step , 6887.68 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 15:19:08 | Epoch: 0 | Step: 164140 | Dataset: 0-2084544 | Loss: 0.821 | 913 ms/step , 6890.74 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 15:19:17 | Epoch: 0 | Step: 164150 | Dataset: 0-2084864 | Loss: 0.732 | 914 ms/step , 6882.95 GFLOP/s , 17905.5 tokens/s INFO:__main__:2024-11-05 15:19:26 | Epoch: 0 | Step: 164160 | Dataset: 0-2085184 | Loss: 0.735 | 913 ms/step , 6888.78 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 15:19:35 | Epoch: 0 | Step: 164170 | Dataset: 0-2085504 | Loss: 0.785 | 912 ms/step , 6892.82 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 15:19:44 | Epoch: 0 | Step: 164180 | Dataset: 0-2085824 | Loss: 0.766 | 914 ms/step , 6878.99 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-05 15:19:53 | Epoch: 0 | Step: 164190 | Dataset: 0-2086144 | Loss: 0.705 | 915 ms/step , 6877.31 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 15:20:02 | Epoch: 0 | Step: 164200 | Dataset: 0-2086464 | Loss: 0.838 | 915 ms/step , 6874.17 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-05 15:20:04 | Validation | Step: 164200 | Val_loss: 0.709 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:20:13 | Epoch: 0 | Step: 164210 | Dataset: 0-2086784 | Loss: 0.729 | 914 ms/step , 6880.80 GFLOP/s , 15261.4 tokens/s INFO:__main__:2024-11-05 15:20:22 | Epoch: 0 | Step: 164220 | Dataset: 0-2087104 | Loss: 0.789 | 914 ms/step , 6883.67 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 15:20:31 | Epoch: 0 | Step: 164230 | Dataset: 0-2087424 | Loss: 0.785 | 915 ms/step , 6873.03 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 15:20:41 | Epoch: 0 | Step: 164240 | Dataset: 0-2087744 | Loss: 0.785 | 913 ms/step , 6886.47 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 15:20:50 | Epoch: 0 | Step: 164250 | Dataset: 0-2088064 | Loss: 0.800 | 915 ms/step , 6874.45 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 15:20:59 | Epoch: 0 | Step: 164260 | Dataset: 0-2088384 | Loss: 0.758 | 914 ms/step , 6883.08 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 15:21:08 | Epoch: 0 | Step: 164270 | Dataset: 0-2088704 | Loss: 0.787 | 913 ms/step , 6891.12 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 15:21:17 | Epoch: 0 | Step: 164280 | Dataset: 0-2089024 | Loss: 0.774 | 915 ms/step , 6872.90 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-05 15:21:26 | Epoch: 0 | Step: 164290 | Dataset: 0-2089344 | Loss: 0.766 | 913 ms/step , 6890.22 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 15:21:35 | Epoch: 0 | Step: 164300 | Dataset: 0-2089664 | Loss: 0.764 | 914 ms/step , 6879.72 GFLOP/s , 17905.2 tokens/s INFO:__main__:2024-11-05 15:21:37 | Validation | Step: 164300 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:21:46 | Epoch: 0 | Step: 164310 | Dataset: 0-2089984 | Loss: 0.667 | 912 ms/step , 6895.68 GFLOP/s , 15262.3 tokens/s INFO:__main__:2024-11-05 15:21:55 | Epoch: 0 | Step: 164320 | Dataset: 0-2090304 | Loss: 0.741 | 914 ms/step , 6881.28 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 15:22:05 | Epoch: 0 | Step: 164330 | Dataset: 0-2090624 | Loss: 0.798 | 913 ms/step , 6886.96 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 15:22:14 | Epoch: 0 | Step: 164340 | Dataset: 0-2090944 | Loss: 0.761 | 914 ms/step , 6883.97 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 15:22:23 | Epoch: 0 | Step: 164350 | Dataset: 0-2091264 | Loss: 0.774 | 915 ms/step , 6873.81 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 15:22:32 | Epoch: 0 | Step: 164360 | Dataset: 0-2091584 | Loss: 0.711 | 914 ms/step , 6879.64 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 15:22:41 | Epoch: 0 | Step: 164370 | Dataset: 0-2091904 | Loss: 0.754 | 913 ms/step , 6888.91 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 15:22:50 | Epoch: 0 | Step: 164380 | Dataset: 0-2092224 | Loss: 0.714 | 915 ms/step , 6876.29 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 15:22:59 | Epoch: 1 | Step: 164390 | Dataset: 0-141 | Loss: 0.755 | 914 ms/step , 6880.14 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 15:23:09 | Epoch: 1 | Step: 164400 | Dataset: 0-461 | Loss: 0.822 | 914 ms/step , 6882.19 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 15:23:10 | Validation | Step: 164400 | Val_loss: 0.823 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:23:19 | Epoch: 1 | Step: 164410 | Dataset: 0-781 | Loss: 0.785 | 914 ms/step , 6884.30 GFLOP/s , 15258.7 tokens/s INFO:__main__:2024-11-05 15:23:28 | Epoch: 1 | Step: 164420 | Dataset: 0-1101 | Loss: 0.729 | 915 ms/step , 6874.55 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 15:23:38 | Epoch: 1 | Step: 164430 | Dataset: 0-1421 | Loss: 0.767 | 914 ms/step , 6880.78 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 15:23:47 | Epoch: 1 | Step: 164440 | Dataset: 0-1741 | Loss: 0.811 | 915 ms/step , 6874.62 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-05 15:23:56 | Epoch: 1 | Step: 164450 | Dataset: 0-2061 | Loss: 0.770 | 913 ms/step , 6885.23 GFLOP/s , 17900.7 tokens/s INFO:__main__:2024-11-05 15:24:05 | Epoch: 1 | Step: 164460 | Dataset: 0-2381 | Loss: 0.796 | 915 ms/step , 6876.95 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-05 15:24:14 | Epoch: 1 | Step: 164470 | Dataset: 0-2701 | Loss: 0.737 | 916 ms/step , 6863.61 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 15:24:23 | Epoch: 1 | Step: 164480 | Dataset: 0-3021 | Loss: 0.716 | 914 ms/step , 6880.44 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 15:24:32 | Epoch: 1 | Step: 164490 | Dataset: 0-3341 | Loss: 0.720 | 914 ms/step , 6881.39 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 15:24:42 | Epoch: 1 | Step: 164500 | Dataset: 0-3661 | Loss: 0.741 | 913 ms/step , 6885.09 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 15:24:43 | Validation | Step: 164500 | Val_loss: 0.801 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:24:52 | Epoch: 1 | Step: 164510 | Dataset: 0-3981 | Loss: 0.757 | 914 ms/step , 6879.97 GFLOP/s , 15254.6 tokens/s INFO:__main__:2024-11-05 15:25:01 | Epoch: 1 | Step: 164520 | Dataset: 0-4301 | Loss: 0.753 | 915 ms/step , 6874.98 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 15:25:11 | Epoch: 1 | Step: 164530 | Dataset: 0-4621 | Loss: 0.745 | 914 ms/step , 6883.26 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 15:25:20 | Epoch: 1 | Step: 164540 | Dataset: 0-4941 | Loss: 0.723 | 915 ms/step , 6871.63 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 15:25:29 | Epoch: 1 | Step: 164550 | Dataset: 0-5261 | Loss: 0.800 | 914 ms/step , 6879.33 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 15:25:38 | Epoch: 1 | Step: 164560 | Dataset: 0-5581 | Loss: 0.739 | 914 ms/step , 6884.10 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-05 15:25:47 | Epoch: 1 | Step: 164570 | Dataset: 0-5901 | Loss: 0.736 | 913 ms/step , 6887.55 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 15:25:56 | Epoch: 1 | Step: 164580 | Dataset: 0-6221 | Loss: 0.724 | 913 ms/step , 6886.54 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 15:26:05 | Epoch: 1 | Step: 164590 | Dataset: 0-6541 | Loss: 0.720 | 914 ms/step , 6880.88 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 15:26:15 | Epoch: 1 | Step: 164600 | Dataset: 0-6861 | Loss: 0.732 | 915 ms/step , 6877.52 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 15:26:16 | Validation | Step: 164600 | Val_loss: 0.797 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:26:25 | Epoch: 1 | Step: 164610 | Dataset: 0-7181 | Loss: 0.820 | 913 ms/step , 6885.51 GFLOP/s , 15261.8 tokens/s INFO:__main__:2024-11-05 15:26:34 | Epoch: 1 | Step: 164620 | Dataset: 0-7501 | Loss: 0.772 | 914 ms/step , 6881.63 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 15:26:44 | Epoch: 1 | Step: 164630 | Dataset: 0-7821 | Loss: 0.798 | 915 ms/step , 6876.60 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 15:26:53 | Epoch: 1 | Step: 164640 | Dataset: 0-8141 | Loss: 0.743 | 914 ms/step , 6881.11 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 15:27:02 | Epoch: 1 | Step: 164650 | Dataset: 0-8461 | Loss: 0.729 | 913 ms/step , 6888.51 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 15:27:11 | Epoch: 1 | Step: 164660 | Dataset: 0-8781 | Loss: 0.760 | 914 ms/step , 6880.15 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 15:27:20 | Epoch: 1 | Step: 164670 | Dataset: 0-9101 | Loss: 0.738 | 914 ms/step , 6884.34 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 15:27:29 | Epoch: 1 | Step: 164680 | Dataset: 0-9421 | Loss: 0.731 | 915 ms/step , 6871.39 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 15:27:39 | Epoch: 1 | Step: 164690 | Dataset: 0-9741 | Loss: 0.820 | 915 ms/step , 6875.18 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 15:27:48 | Epoch: 1 | Step: 164700 | Dataset: 0-10061 | Loss: 0.800 | 914 ms/step , 6878.44 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 15:27:49 | Validation | Step: 164700 | Val_loss: 0.790 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:27:58 | Epoch: 1 | Step: 164710 | Dataset: 0-10381 | Loss: 0.797 | 913 ms/step , 6885.68 GFLOP/s , 15256.9 tokens/s INFO:__main__:2024-11-05 15:28:08 | Epoch: 1 | Step: 164720 | Dataset: 0-10701 | Loss: 0.815 | 915 ms/step , 6874.36 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:28:17 | Epoch: 1 | Step: 164730 | Dataset: 0-11021 | Loss: 0.778 | 914 ms/step , 6877.84 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 15:28:26 | Epoch: 1 | Step: 164740 | Dataset: 0-11341 | Loss: 0.756 | 913 ms/step , 6886.50 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 15:28:35 | Epoch: 1 | Step: 164750 | Dataset: 0-11661 | Loss: 0.752 | 914 ms/step , 6879.35 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 15:28:44 | Epoch: 1 | Step: 164760 | Dataset: 0-11981 | Loss: 0.825 | 914 ms/step , 6884.94 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 15:28:53 | Epoch: 1 | Step: 164770 | Dataset: 0-12301 | Loss: 0.763 | 914 ms/step , 6879.76 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 15:29:02 | Epoch: 1 | Step: 164780 | Dataset: 0-12621 | Loss: 0.725 | 913 ms/step , 6890.01 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 15:29:12 | Epoch: 1 | Step: 164790 | Dataset: 0-12941 | Loss: 0.659 | 913 ms/step , 6885.70 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 15:29:21 | Epoch: 1 | Step: 164800 | Dataset: 0-13261 | Loss: 0.699 | 914 ms/step , 6882.84 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 15:29:22 | Validation | Step: 164800 | Val_loss: 0.783 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:29:31 | Epoch: 1 | Step: 164810 | Dataset: 0-13581 | Loss: 0.777 | 913 ms/step , 6886.34 GFLOP/s , 15263.6 tokens/s INFO:__main__:2024-11-05 15:29:41 | Epoch: 1 | Step: 164820 | Dataset: 0-13901 | Loss: 0.793 | 914 ms/step , 6879.61 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 15:29:50 | Epoch: 1 | Step: 164830 | Dataset: 0-14221 | Loss: 0.684 | 913 ms/step , 6885.31 GFLOP/s , 17907.0 tokens/s INFO:__main__:2024-11-05 15:29:59 | Epoch: 1 | Step: 164840 | Dataset: 0-14541 | Loss: 0.744 | 914 ms/step , 6880.62 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 15:30:08 | Epoch: 1 | Step: 164850 | Dataset: 0-14861 | Loss: 0.720 | 914 ms/step , 6883.01 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 15:30:17 | Epoch: 1 | Step: 164860 | Dataset: 0-15181 | Loss: 0.718 | 915 ms/step , 6871.34 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 15:30:26 | Epoch: 1 | Step: 164870 | Dataset: 0-15501 | Loss: 0.742 | 914 ms/step , 6880.94 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 15:30:35 | Epoch: 1 | Step: 164880 | Dataset: 0-15821 | Loss: 0.723 | 915 ms/step , 6875.26 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 15:30:45 | Epoch: 1 | Step: 164890 | Dataset: 0-16141 | Loss: 0.710 | 913 ms/step , 6890.41 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 15:30:54 | Epoch: 1 | Step: 164900 | Dataset: 0-16461 | Loss: 0.809 | 914 ms/step , 6881.78 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 15:30:55 | Validation | Step: 164900 | Val_loss: 0.809 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:31:04 | Epoch: 1 | Step: 164910 | Dataset: 0-16781 | Loss: 0.728 | 914 ms/step , 6884.53 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-05 15:31:14 | Epoch: 1 | Step: 164920 | Dataset: 0-17101 | Loss: 0.775 | 915 ms/step , 6871.08 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:31:23 | Epoch: 1 | Step: 164930 | Dataset: 0-17421 | Loss: 0.710 | 914 ms/step , 6883.85 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 15:31:32 | Epoch: 1 | Step: 164940 | Dataset: 0-17741 | Loss: 0.736 | 913 ms/step , 6885.71 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 15:31:41 | Epoch: 1 | Step: 164950 | Dataset: 0-18061 | Loss: 0.742 | 914 ms/step , 6880.16 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 15:31:50 | Epoch: 1 | Step: 164960 | Dataset: 0-18381 | Loss: 0.748 | 914 ms/step , 6877.86 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 15:31:59 | Epoch: 1 | Step: 164970 | Dataset: 0-18701 | Loss: 0.698 | 913 ms/step , 6889.55 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 15:32:09 | Epoch: 1 | Step: 164980 | Dataset: 0-19021 | Loss: 0.752 | 916 ms/step , 6869.81 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 15:32:18 | Epoch: 1 | Step: 164990 | Dataset: 0-19341 | Loss: 0.791 | 913 ms/step , 6888.93 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 15:32:27 | Epoch: 1 | Step: 165000 | Dataset: 0-19661 | Loss: 0.767 | 915 ms/step , 6877.48 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 15:32:28 | Validation | Step: 165000 | Val_loss: 0.803 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:32:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_153228_step_165000.pt` INFO:__main__:2024-11-05 15:32:39 | Epoch: 1 | Step: 165010 | Dataset: 0-19981 | Loss: 0.857 | 914 ms/step , 6880.26 GFLOP/s , 13824.1 tokens/s INFO:__main__:2024-11-05 15:32:48 | Epoch: 1 | Step: 165020 | Dataset: 0-20301 | Loss: 0.744 | 916 ms/step , 6867.13 GFLOP/s , 17899.5 tokens/s INFO:__main__:2024-11-05 15:32:57 | Epoch: 1 | Step: 165030 | Dataset: 0-20621 | Loss: 0.774 | 915 ms/step , 6871.63 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 15:33:06 | Epoch: 1 | Step: 165040 | Dataset: 0-20941 | Loss: 0.791 | 915 ms/step , 6870.62 GFLOP/s , 17887.1 tokens/s INFO:__main__:2024-11-05 15:33:15 | Epoch: 1 | Step: 165050 | Dataset: 0-21261 | Loss: 0.752 | 914 ms/step , 6884.18 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 15:33:24 | Epoch: 1 | Step: 165060 | Dataset: 0-21581 | Loss: 0.660 | 913 ms/step , 6888.63 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 15:33:34 | Epoch: 1 | Step: 165070 | Dataset: 0-21901 | Loss: 0.724 | 914 ms/step , 6880.25 GFLOP/s , 17907.1 tokens/s INFO:__main__:2024-11-05 15:33:43 | Epoch: 1 | Step: 165080 | Dataset: 0-22221 | Loss: 0.750 | 914 ms/step , 6879.75 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 15:33:52 | Epoch: 1 | Step: 165090 | Dataset: 0-22541 | Loss: 0.788 | 914 ms/step , 6880.84 GFLOP/s , 17903.0 tokens/s INFO:__main__:2024-11-05 15:34:01 | Epoch: 1 | Step: 165100 | Dataset: 0-22861 | Loss: 0.778 | 915 ms/step , 6876.47 GFLOP/s , 17903.4 tokens/s INFO:__main__:2024-11-05 15:34:03 | Validation | Step: 165100 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:34:12 | Epoch: 1 | Step: 165110 | Dataset: 0-23181 | Loss: 0.705 | 914 ms/step , 6883.81 GFLOP/s , 15245.2 tokens/s INFO:__main__:2024-11-05 15:34:21 | Epoch: 1 | Step: 165120 | Dataset: 0-23501 | Loss: 0.723 | 916 ms/step , 6868.58 GFLOP/s , 17899.5 tokens/s INFO:__main__:2024-11-05 15:34:30 | Epoch: 1 | Step: 165130 | Dataset: 0-23821 | Loss: 0.742 | 916 ms/step , 6862.96 GFLOP/s , 17893.9 tokens/s INFO:__main__:2024-11-05 15:34:39 | Epoch: 1 | Step: 165140 | Dataset: 0-24141 | Loss: 0.708 | 915 ms/step , 6871.40 GFLOP/s , 17904.4 tokens/s INFO:__main__:2024-11-05 15:34:48 | Epoch: 1 | Step: 165150 | Dataset: 0-24461 | Loss: 0.747 | 916 ms/step , 6867.24 GFLOP/s , 17903.5 tokens/s INFO:__main__:2024-11-05 15:34:57 | Epoch: 1 | Step: 165160 | Dataset: 0-24781 | Loss: 0.757 | 914 ms/step , 6884.75 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 15:35:07 | Epoch: 1 | Step: 165170 | Dataset: 0-25101 | Loss: 0.797 | 915 ms/step , 6876.14 GFLOP/s , 17907.0 tokens/s INFO:__main__:2024-11-05 15:35:16 | Epoch: 1 | Step: 165180 | Dataset: 0-25421 | Loss: 0.693 | 915 ms/step , 6877.05 GFLOP/s , 17903.2 tokens/s INFO:__main__:2024-11-05 15:35:25 | Epoch: 1 | Step: 165190 | Dataset: 0-25741 | Loss: 0.817 | 916 ms/step , 6868.85 GFLOP/s , 17906.4 tokens/s INFO:__main__:2024-11-05 15:35:34 | Epoch: 1 | Step: 165200 | Dataset: 0-26061 | Loss: 0.696 | 915 ms/step , 6876.27 GFLOP/s , 17905.5 tokens/s INFO:__main__:2024-11-05 15:35:36 | Validation | Step: 165200 | Val_loss: 0.790 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:35:45 | Epoch: 1 | Step: 165210 | Dataset: 0-26381 | Loss: 0.707 | 914 ms/step , 6879.29 GFLOP/s , 15254.0 tokens/s INFO:__main__:2024-11-05 15:35:54 | Epoch: 1 | Step: 165220 | Dataset: 0-26701 | Loss: 0.800 | 916 ms/step , 6863.33 GFLOP/s , 17904.5 tokens/s INFO:__main__:2024-11-05 15:36:03 | Epoch: 1 | Step: 165230 | Dataset: 0-27021 | Loss: 0.720 | 915 ms/step , 6874.11 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 15:36:12 | Epoch: 1 | Step: 165240 | Dataset: 0-27341 | Loss: 0.728 | 916 ms/step , 6869.60 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 15:36:21 | Epoch: 1 | Step: 165250 | Dataset: 0-27661 | Loss: 0.706 | 914 ms/step , 6878.13 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 15:36:31 | Epoch: 1 | Step: 165260 | Dataset: 0-27981 | Loss: 0.658 | 915 ms/step , 6877.47 GFLOP/s , 17905.6 tokens/s INFO:__main__:2024-11-05 15:36:40 | Epoch: 1 | Step: 165270 | Dataset: 0-28301 | Loss: 0.724 | 914 ms/step , 6879.07 GFLOP/s , 17903.0 tokens/s INFO:__main__:2024-11-05 15:36:49 | Epoch: 1 | Step: 165280 | Dataset: 0-28621 | Loss: 0.803 | 913 ms/step , 6890.32 GFLOP/s , 17905.7 tokens/s INFO:__main__:2024-11-05 15:36:58 | Epoch: 1 | Step: 165290 | Dataset: 0-28941 | Loss: 0.741 | 914 ms/step , 6884.32 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:37:07 | Epoch: 1 | Step: 165300 | Dataset: 0-29261 | Loss: 0.763 | 914 ms/step , 6877.73 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:37:09 | Validation | Step: 165300 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:37:18 | Epoch: 1 | Step: 165310 | Dataset: 0-29581 | Loss: 0.779 | 916 ms/step , 6867.89 GFLOP/s , 15249.3 tokens/s INFO:__main__:2024-11-05 15:37:27 | Epoch: 1 | Step: 165320 | Dataset: 0-29901 | Loss: 0.749 | 916 ms/step , 6866.40 GFLOP/s , 17907.0 tokens/s INFO:__main__:2024-11-05 15:37:36 | Epoch: 1 | Step: 165330 | Dataset: 0-30221 | Loss: 0.749 | 915 ms/step , 6873.70 GFLOP/s , 17902.3 tokens/s INFO:__main__:2024-11-05 15:37:45 | Epoch: 1 | Step: 165340 | Dataset: 0-30541 | Loss: 0.756 | 915 ms/step , 6876.69 GFLOP/s , 17904.1 tokens/s INFO:__main__:2024-11-05 15:37:55 | Epoch: 1 | Step: 165350 | Dataset: 0-30861 | Loss: 0.763 | 914 ms/step , 6878.30 GFLOP/s , 17899.9 tokens/s INFO:__main__:2024-11-05 15:38:04 | Epoch: 1 | Step: 165360 | Dataset: 0-31181 | Loss: 0.795 | 914 ms/step , 6877.83 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 15:38:13 | Epoch: 1 | Step: 165370 | Dataset: 0-31501 | Loss: 0.770 | 917 ms/step , 6861.43 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 15:38:22 | Epoch: 1 | Step: 165380 | Dataset: 0-31821 | Loss: 0.736 | 914 ms/step , 6883.69 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 15:38:31 | Epoch: 1 | Step: 165390 | Dataset: 0-32141 | Loss: 0.731 | 915 ms/step , 6871.97 GFLOP/s , 17899.5 tokens/s INFO:__main__:2024-11-05 15:38:40 | Epoch: 1 | Step: 165400 | Dataset: 0-32461 | Loss: 0.780 | 914 ms/step , 6880.94 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 15:38:42 | Validation | Step: 165400 | Val_loss: 0.811 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:38:51 | Epoch: 1 | Step: 165410 | Dataset: 0-32781 | Loss: 0.716 | 915 ms/step , 6873.44 GFLOP/s , 15250.9 tokens/s INFO:__main__:2024-11-05 15:39:00 | Epoch: 1 | Step: 165420 | Dataset: 0-33101 | Loss: 0.716 | 915 ms/step , 6874.01 GFLOP/s , 17905.2 tokens/s INFO:__main__:2024-11-05 15:39:09 | Epoch: 1 | Step: 165430 | Dataset: 0-33421 | Loss: 0.708 | 915 ms/step , 6872.20 GFLOP/s , 17904.9 tokens/s INFO:__main__:2024-11-05 15:39:18 | Epoch: 1 | Step: 165440 | Dataset: 0-33741 | Loss: 0.730 | 914 ms/step , 6884.80 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 15:39:28 | Epoch: 1 | Step: 165450 | Dataset: 0-34061 | Loss: 0.698 | 915 ms/step , 6875.49 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-05 15:39:37 | Epoch: 1 | Step: 165460 | Dataset: 0-34381 | Loss: 0.835 | 914 ms/step , 6878.06 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 15:39:46 | Epoch: 1 | Step: 165470 | Dataset: 0-34701 | Loss: 0.756 | 913 ms/step , 6887.47 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 15:39:55 | Epoch: 1 | Step: 165480 | Dataset: 0-35021 | Loss: 0.739 | 914 ms/step , 6880.58 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 15:40:04 | Epoch: 1 | Step: 165490 | Dataset: 0-35341 | Loss: 0.770 | 915 ms/step , 6874.07 GFLOP/s , 17901.5 tokens/s INFO:__main__:2024-11-05 15:40:13 | Epoch: 1 | Step: 165500 | Dataset: 0-35661 | Loss: 0.751 | 914 ms/step , 6878.49 GFLOP/s , 17907.5 tokens/s INFO:__main__:2024-11-05 15:40:15 | Validation | Step: 165500 | Val_loss: 0.815 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:40:24 | Epoch: 1 | Step: 165510 | Dataset: 0-35981 | Loss: 0.754 | 913 ms/step , 6885.16 GFLOP/s , 15257.1 tokens/s INFO:__main__:2024-11-05 15:40:33 | Epoch: 1 | Step: 165520 | Dataset: 0-36301 | Loss: 0.773 | 913 ms/step , 6885.95 GFLOP/s , 17912.4 tokens/s INFO:__main__:2024-11-05 15:40:42 | Epoch: 1 | Step: 165530 | Dataset: 0-36621 | Loss: 0.763 | 914 ms/step , 6880.13 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 15:40:52 | Epoch: 1 | Step: 165540 | Dataset: 0-36941 | Loss: 0.766 | 915 ms/step , 6872.94 GFLOP/s , 17901.2 tokens/s INFO:__main__:2024-11-05 15:41:01 | Epoch: 1 | Step: 165550 | Dataset: 0-37261 | Loss: 0.718 | 914 ms/step , 6878.27 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 15:41:10 | Epoch: 1 | Step: 165560 | Dataset: 0-37581 | Loss: 0.817 | 914 ms/step , 6877.98 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 15:41:19 | Epoch: 1 | Step: 165570 | Dataset: 0-37901 | Loss: 0.744 | 914 ms/step , 6883.61 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 15:41:28 | Epoch: 1 | Step: 165580 | Dataset: 0-38221 | Loss: 0.712 | 913 ms/step , 6885.63 GFLOP/s , 17901.6 tokens/s INFO:__main__:2024-11-05 15:41:37 | Epoch: 1 | Step: 165590 | Dataset: 0-38541 | Loss: 0.733 | 914 ms/step , 6881.70 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 15:41:46 | Epoch: 1 | Step: 165600 | Dataset: 0-38861 | Loss: 0.754 | 914 ms/step , 6880.07 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 15:41:48 | Validation | Step: 165600 | Val_loss: 0.778 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:41:57 | Epoch: 1 | Step: 165610 | Dataset: 0-39181 | Loss: 0.774 | 914 ms/step , 6884.55 GFLOP/s , 15254.1 tokens/s INFO:__main__:2024-11-05 15:42:06 | Epoch: 1 | Step: 165620 | Dataset: 0-39501 | Loss: 0.739 | 915 ms/step , 6874.00 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 15:42:15 | Epoch: 1 | Step: 165630 | Dataset: 0-39821 | Loss: 0.654 | 913 ms/step , 6885.93 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 15:42:25 | Epoch: 1 | Step: 165640 | Dataset: 0-40141 | Loss: 0.785 | 915 ms/step , 6875.63 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 15:42:34 | Epoch: 1 | Step: 165650 | Dataset: 0-40461 | Loss: 0.728 | 914 ms/step , 6878.04 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-05 15:42:43 | Epoch: 1 | Step: 165660 | Dataset: 0-40781 | Loss: 0.763 | 914 ms/step , 6884.71 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 15:42:52 | Epoch: 1 | Step: 165670 | Dataset: 0-41101 | Loss: 0.738 | 915 ms/step , 6873.25 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 15:43:01 | Epoch: 1 | Step: 165680 | Dataset: 0-41421 | Loss: 0.775 | 914 ms/step , 6879.48 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 15:43:10 | Epoch: 1 | Step: 165690 | Dataset: 0-41741 | Loss: 0.763 | 916 ms/step , 6868.43 GFLOP/s , 17904.9 tokens/s INFO:__main__:2024-11-05 15:43:19 | Epoch: 1 | Step: 165700 | Dataset: 0-42061 | Loss: 0.765 | 913 ms/step , 6885.12 GFLOP/s , 17908.4 tokens/s INFO:__main__:2024-11-05 15:43:21 | Validation | Step: 165700 | Val_loss: 0.740 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:43:30 | Epoch: 1 | Step: 165710 | Dataset: 0-42381 | Loss: 0.798 | 914 ms/step , 6884.34 GFLOP/s , 15253.3 tokens/s INFO:__main__:2024-11-05 15:43:39 | Epoch: 1 | Step: 165720 | Dataset: 0-42701 | Loss: 0.736 | 915 ms/step , 6873.52 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 15:43:49 | Epoch: 1 | Step: 165730 | Dataset: 0-43021 | Loss: 0.795 | 915 ms/step , 6876.64 GFLOP/s , 17906.4 tokens/s INFO:__main__:2024-11-05 15:43:58 | Epoch: 1 | Step: 165740 | Dataset: 0-43341 | Loss: 0.808 | 914 ms/step , 6880.61 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 15:44:07 | Epoch: 1 | Step: 165750 | Dataset: 0-43661 | Loss: 0.757 | 915 ms/step , 6875.74 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 15:44:16 | Epoch: 1 | Step: 165760 | Dataset: 0-43981 | Loss: 0.780 | 914 ms/step , 6881.25 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 15:44:25 | Epoch: 1 | Step: 165770 | Dataset: 0-44301 | Loss: 0.792 | 913 ms/step , 6886.75 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 15:44:34 | Epoch: 1 | Step: 165780 | Dataset: 0-44621 | Loss: 0.755 | 914 ms/step , 6881.97 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 15:44:43 | Epoch: 1 | Step: 165790 | Dataset: 0-44941 | Loss: 0.741 | 913 ms/step , 6887.62 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-05 15:44:53 | Epoch: 1 | Step: 165800 | Dataset: 0-45261 | Loss: 0.818 | 915 ms/step , 6871.86 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 15:44:54 | Validation | Step: 165800 | Val_loss: 0.703 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:45:03 | Epoch: 1 | Step: 165810 | Dataset: 0-45581 | Loss: 0.784 | 914 ms/step , 6880.48 GFLOP/s , 15254.9 tokens/s INFO:__main__:2024-11-05 15:45:12 | Epoch: 1 | Step: 165820 | Dataset: 0-45901 | Loss: 0.747 | 916 ms/step , 6868.62 GFLOP/s , 17903.4 tokens/s INFO:__main__:2024-11-05 15:45:22 | Epoch: 1 | Step: 165830 | Dataset: 0-46221 | Loss: 0.788 | 914 ms/step , 6883.80 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 15:45:31 | Epoch: 1 | Step: 165840 | Dataset: 0-46541 | Loss: 0.776 | 915 ms/step , 6871.22 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 15:45:40 | Epoch: 1 | Step: 165850 | Dataset: 0-46861 | Loss: 0.665 | 916 ms/step , 6869.42 GFLOP/s , 17901.0 tokens/s INFO:__main__:2024-11-05 15:45:49 | Epoch: 1 | Step: 165860 | Dataset: 0-47181 | Loss: 0.777 | 915 ms/step , 6874.38 GFLOP/s , 17900.2 tokens/s INFO:__main__:2024-11-05 15:45:58 | Epoch: 1 | Step: 165870 | Dataset: 0-47501 | Loss: 0.720 | 916 ms/step , 6868.19 GFLOP/s , 17905.7 tokens/s INFO:__main__:2024-11-05 15:46:07 | Epoch: 1 | Step: 165880 | Dataset: 0-47821 | Loss: 0.740 | 915 ms/step , 6875.42 GFLOP/s , 17906.7 tokens/s INFO:__main__:2024-11-05 15:46:16 | Epoch: 1 | Step: 165890 | Dataset: 0-48141 | Loss: 0.748 | 915 ms/step , 6875.10 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 15:46:26 | Epoch: 1 | Step: 165900 | Dataset: 0-48461 | Loss: 0.784 | 916 ms/step , 6866.97 GFLOP/s , 17902.8 tokens/s INFO:__main__:2024-11-05 15:46:27 | Validation | Step: 165900 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:46:36 | Epoch: 1 | Step: 165910 | Dataset: 0-48781 | Loss: 0.683 | 914 ms/step , 6881.34 GFLOP/s , 15253.7 tokens/s INFO:__main__:2024-11-05 15:46:46 | Epoch: 1 | Step: 165920 | Dataset: 0-49101 | Loss: 0.785 | 914 ms/step , 6879.16 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 15:46:55 | Epoch: 1 | Step: 165930 | Dataset: 0-49421 | Loss: 0.767 | 913 ms/step , 6885.79 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 15:47:04 | Epoch: 1 | Step: 165940 | Dataset: 0-49741 | Loss: 0.723 | 915 ms/step , 6874.69 GFLOP/s , 17905.9 tokens/s INFO:__main__:2024-11-05 15:47:13 | Epoch: 1 | Step: 165950 | Dataset: 0-50061 | Loss: 0.723 | 913 ms/step , 6886.80 GFLOP/s , 17907.0 tokens/s INFO:__main__:2024-11-05 15:47:22 | Epoch: 1 | Step: 165960 | Dataset: 0-50381 | Loss: 0.776 | 913 ms/step , 6891.29 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 15:47:31 | Epoch: 1 | Step: 165970 | Dataset: 0-50701 | Loss: 0.747 | 916 ms/step , 6869.82 GFLOP/s , 17903.9 tokens/s INFO:__main__:2024-11-05 15:47:40 | Epoch: 1 | Step: 165980 | Dataset: 0-51021 | Loss: 0.773 | 913 ms/step , 6891.07 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 15:47:50 | Epoch: 1 | Step: 165990 | Dataset: 0-51341 | Loss: 0.760 | 914 ms/step , 6882.54 GFLOP/s , 17907.6 tokens/s INFO:__main__:2024-11-05 15:47:59 | Epoch: 1 | Step: 166000 | Dataset: 0-51661 | Loss: 0.815 | 914 ms/step , 6884.59 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 15:48:00 | Validation | Step: 166000 | Val_loss: 0.699 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:48:00 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_154800_step_166000.pt` INFO:__main__:2024-11-05 15:48:11 | Epoch: 1 | Step: 166010 | Dataset: 0-51981 | Loss: 0.721 | 914 ms/step , 6878.26 GFLOP/s , 13798.3 tokens/s INFO:__main__:2024-11-05 15:48:20 | Epoch: 1 | Step: 166020 | Dataset: 0-52301 | Loss: 0.820 | 916 ms/step , 6867.96 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 15:48:29 | Epoch: 1 | Step: 166030 | Dataset: 0-52621 | Loss: 0.834 | 915 ms/step , 6875.54 GFLOP/s , 17904.5 tokens/s INFO:__main__:2024-11-05 15:48:38 | Epoch: 1 | Step: 166040 | Dataset: 0-52941 | Loss: 0.712 | 914 ms/step , 6883.95 GFLOP/s , 17870.7 tokens/s INFO:__main__:2024-11-05 15:48:47 | Epoch: 1 | Step: 166050 | Dataset: 0-53261 | Loss: 0.723 | 915 ms/step , 6870.44 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 15:48:56 | Epoch: 1 | Step: 166060 | Dataset: 0-53581 | Loss: 0.739 | 914 ms/step , 6880.68 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 15:49:05 | Epoch: 1 | Step: 166070 | Dataset: 0-53901 | Loss: 0.746 | 913 ms/step , 6885.72 GFLOP/s , 17905.9 tokens/s INFO:__main__:2024-11-05 15:49:15 | Epoch: 1 | Step: 166080 | Dataset: 0-54221 | Loss: 0.771 | 915 ms/step , 6872.72 GFLOP/s , 17902.1 tokens/s INFO:__main__:2024-11-05 15:49:24 | Epoch: 1 | Step: 166090 | Dataset: 0-54541 | Loss: 0.816 | 915 ms/step , 6872.19 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 15:49:33 | Epoch: 1 | Step: 166100 | Dataset: 0-54861 | Loss: 0.693 | 916 ms/step , 6867.20 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-05 15:49:35 | Validation | Step: 166100 | Val_loss: 0.725 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:49:44 | Epoch: 1 | Step: 166110 | Dataset: 0-55181 | Loss: 0.715 | 913 ms/step , 6885.14 GFLOP/s , 15255.7 tokens/s INFO:__main__:2024-11-05 15:49:53 | Epoch: 1 | Step: 166120 | Dataset: 0-55501 | Loss: 0.726 | 917 ms/step , 6858.94 GFLOP/s , 17903.9 tokens/s INFO:__main__:2024-11-05 15:50:02 | Epoch: 1 | Step: 166130 | Dataset: 0-55821 | Loss: 0.783 | 916 ms/step , 6868.98 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 15:50:11 | Epoch: 1 | Step: 166140 | Dataset: 0-56141 | Loss: 0.751 | 916 ms/step , 6869.73 GFLOP/s , 17903.4 tokens/s INFO:__main__:2024-11-05 15:50:20 | Epoch: 1 | Step: 166150 | Dataset: 0-56461 | Loss: 0.773 | 913 ms/step , 6887.37 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 15:50:29 | Epoch: 1 | Step: 166160 | Dataset: 0-56781 | Loss: 0.679 | 914 ms/step , 6879.89 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-05 15:50:39 | Epoch: 1 | Step: 166170 | Dataset: 0-57101 | Loss: 0.771 | 914 ms/step , 6882.96 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 15:50:48 | Epoch: 1 | Step: 166180 | Dataset: 0-57421 | Loss: 0.712 | 914 ms/step , 6879.44 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 15:50:57 | Epoch: 1 | Step: 166190 | Dataset: 0-57741 | Loss: 0.720 | 914 ms/step , 6884.13 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 15:51:06 | Epoch: 1 | Step: 166200 | Dataset: 0-58061 | Loss: 0.771 | 913 ms/step , 6886.64 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 15:51:08 | Validation | Step: 166200 | Val_loss: 0.766 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:51:17 | Epoch: 1 | Step: 166210 | Dataset: 0-58381 | Loss: 0.737 | 914 ms/step , 6878.90 GFLOP/s , 15256.9 tokens/s INFO:__main__:2024-11-05 15:51:26 | Epoch: 1 | Step: 166220 | Dataset: 0-58701 | Loss: 0.708 | 914 ms/step , 6881.94 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-05 15:51:35 | Epoch: 1 | Step: 166230 | Dataset: 0-59021 | Loss: 0.780 | 913 ms/step , 6888.00 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 15:51:44 | Epoch: 1 | Step: 166240 | Dataset: 0-59341 | Loss: 0.805 | 915 ms/step , 6876.83 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 15:51:53 | Epoch: 1 | Step: 166250 | Dataset: 0-59661 | Loss: 0.762 | 914 ms/step , 6883.88 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-05 15:52:02 | Epoch: 1 | Step: 166260 | Dataset: 0-59981 | Loss: 0.712 | 913 ms/step , 6885.94 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 15:52:12 | Epoch: 1 | Step: 166270 | Dataset: 0-60301 | Loss: 0.776 | 913 ms/step , 6887.09 GFLOP/s , 17903.1 tokens/s INFO:__main__:2024-11-05 15:52:21 | Epoch: 1 | Step: 166280 | Dataset: 0-60621 | Loss: 0.751 | 914 ms/step , 6883.33 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 15:52:30 | Epoch: 1 | Step: 166290 | Dataset: 0-60941 | Loss: 0.748 | 915 ms/step , 6873.53 GFLOP/s , 17903.6 tokens/s INFO:__main__:2024-11-05 15:52:39 | Epoch: 1 | Step: 166300 | Dataset: 0-61261 | Loss: 0.709 | 914 ms/step , 6878.89 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 15:52:41 | Validation | Step: 166300 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:52:50 | Epoch: 1 | Step: 166310 | Dataset: 0-61581 | Loss: 0.736 | 916 ms/step , 6866.21 GFLOP/s , 15257.9 tokens/s INFO:__main__:2024-11-05 15:52:59 | Epoch: 1 | Step: 166320 | Dataset: 0-61901 | Loss: 0.770 | 914 ms/step , 6880.72 GFLOP/s , 17903.0 tokens/s INFO:__main__:2024-11-05 15:53:08 | Epoch: 1 | Step: 166330 | Dataset: 0-62221 | Loss: 0.746 | 914 ms/step , 6882.57 GFLOP/s , 17907.4 tokens/s INFO:__main__:2024-11-05 15:53:17 | Epoch: 1 | Step: 166340 | Dataset: 0-62541 | Loss: 0.729 | 915 ms/step , 6873.12 GFLOP/s , 17905.6 tokens/s INFO:__main__:2024-11-05 15:53:26 | Epoch: 1 | Step: 166350 | Dataset: 0-62861 | Loss: 0.687 | 914 ms/step , 6878.01 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-05 15:53:36 | Epoch: 1 | Step: 166360 | Dataset: 0-63181 | Loss: 0.679 | 914 ms/step , 6882.34 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 15:53:45 | Epoch: 1 | Step: 166370 | Dataset: 0-63501 | Loss: 0.741 | 916 ms/step , 6866.79 GFLOP/s , 17903.4 tokens/s INFO:__main__:2024-11-05 15:53:54 | Epoch: 1 | Step: 166380 | Dataset: 0-63821 | Loss: 0.785 | 914 ms/step , 6884.40 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-05 15:54:03 | Epoch: 1 | Step: 166390 | Dataset: 0-64141 | Loss: 0.698 | 914 ms/step , 6884.65 GFLOP/s , 17905.2 tokens/s INFO:__main__:2024-11-05 15:54:12 | Epoch: 1 | Step: 166400 | Dataset: 0-64461 | Loss: 0.755 | 913 ms/step , 6886.65 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 15:54:14 | Validation | Step: 166400 | Val_loss: 0.722 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:54:23 | Epoch: 1 | Step: 166410 | Dataset: 0-64781 | Loss: 0.810 | 914 ms/step , 6880.12 GFLOP/s , 15255.9 tokens/s INFO:__main__:2024-11-05 15:54:32 | Epoch: 1 | Step: 166420 | Dataset: 0-65101 | Loss: 0.748 | 914 ms/step , 6883.33 GFLOP/s , 17903.8 tokens/s INFO:__main__:2024-11-05 15:54:41 | Epoch: 1 | Step: 166430 | Dataset: 0-65421 | Loss: 0.798 | 915 ms/step , 6877.04 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 15:54:50 | Epoch: 1 | Step: 166440 | Dataset: 0-65741 | Loss: 0.723 | 915 ms/step , 6873.88 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 15:55:00 | Epoch: 1 | Step: 166450 | Dataset: 0-66061 | Loss: 0.731 | 914 ms/step , 6883.81 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-05 15:55:09 | Epoch: 1 | Step: 166460 | Dataset: 0-66381 | Loss: 0.673 | 913 ms/step , 6887.84 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 15:55:18 | Epoch: 1 | Step: 166470 | Dataset: 0-66701 | Loss: 0.802 | 914 ms/step , 6880.96 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 15:55:27 | Epoch: 1 | Step: 166480 | Dataset: 0-67021 | Loss: 0.708 | 916 ms/step , 6867.81 GFLOP/s , 17899.2 tokens/s INFO:__main__:2024-11-05 15:55:36 | Epoch: 1 | Step: 166490 | Dataset: 0-67341 | Loss: 0.714 | 914 ms/step , 6878.96 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-05 15:55:45 | Epoch: 1 | Step: 166500 | Dataset: 0-67661 | Loss: 0.761 | 915 ms/step , 6873.96 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-05 15:55:47 | Validation | Step: 166500 | Val_loss: 0.760 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:55:56 | Epoch: 1 | Step: 166510 | Dataset: 0-67981 | Loss: 0.754 | 915 ms/step , 6875.59 GFLOP/s , 15259.4 tokens/s INFO:__main__:2024-11-05 15:56:05 | Epoch: 1 | Step: 166520 | Dataset: 0-68301 | Loss: 0.751 | 915 ms/step , 6871.19 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-05 15:56:14 | Epoch: 1 | Step: 166530 | Dataset: 0-68621 | Loss: 0.684 | 914 ms/step , 6879.39 GFLOP/s , 17905.0 tokens/s INFO:__main__:2024-11-05 15:56:23 | Epoch: 1 | Step: 166540 | Dataset: 0-68941 | Loss: 0.633 | 914 ms/step , 6883.10 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 15:56:33 | Epoch: 1 | Step: 166550 | Dataset: 0-69261 | Loss: 0.675 | 914 ms/step , 6880.59 GFLOP/s , 17899.9 tokens/s INFO:__main__:2024-11-05 15:56:42 | Epoch: 1 | Step: 166560 | Dataset: 0-69581 | Loss: 0.700 | 914 ms/step , 6878.73 GFLOP/s , 17902.1 tokens/s INFO:__main__:2024-11-05 15:56:51 | Epoch: 1 | Step: 166570 | Dataset: 0-69901 | Loss: 0.787 | 914 ms/step , 6879.09 GFLOP/s , 17895.1 tokens/s INFO:__main__:2024-11-05 15:57:00 | Epoch: 1 | Step: 166580 | Dataset: 0-70221 | Loss: 0.794 | 916 ms/step , 6868.82 GFLOP/s , 17903.7 tokens/s INFO:__main__:2024-11-05 15:57:09 | Epoch: 1 | Step: 166590 | Dataset: 0-70541 | Loss: 0.691 | 915 ms/step , 6871.31 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-05 15:57:18 | Epoch: 1 | Step: 166600 | Dataset: 0-70861 | Loss: 0.752 | 915 ms/step , 6873.48 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-05 15:57:20 | Validation | Step: 166600 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:57:29 | Epoch: 1 | Step: 166610 | Dataset: 0-71181 | Loss: 0.757 | 916 ms/step , 6866.85 GFLOP/s , 15252.4 tokens/s INFO:__main__:2024-11-05 15:57:38 | Epoch: 1 | Step: 166620 | Dataset: 0-71501 | Loss: 0.758 | 914 ms/step , 6877.89 GFLOP/s , 17899.7 tokens/s INFO:__main__:2024-11-05 15:57:47 | Epoch: 1 | Step: 166630 | Dataset: 0-71821 | Loss: 0.715 | 915 ms/step , 6872.53 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 15:57:57 | Epoch: 1 | Step: 166640 | Dataset: 0-72141 | Loss: 0.681 | 914 ms/step , 6878.36 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-05 15:58:06 | Epoch: 1 | Step: 166650 | Dataset: 0-72461 | Loss: 0.711 | 914 ms/step , 6881.63 GFLOP/s , 17902.3 tokens/s INFO:__main__:2024-11-05 15:58:15 | Epoch: 1 | Step: 166660 | Dataset: 0-72781 | Loss: 0.705 | 915 ms/step , 6873.67 GFLOP/s , 17902.2 tokens/s INFO:__main__:2024-11-05 15:58:24 | Epoch: 1 | Step: 166670 | Dataset: 0-73101 | Loss: 0.786 | 914 ms/step , 6878.22 GFLOP/s , 17904.7 tokens/s INFO:__main__:2024-11-05 15:58:33 | Epoch: 1 | Step: 166680 | Dataset: 0-73421 | Loss: 0.793 | 915 ms/step , 6877.34 GFLOP/s , 17904.2 tokens/s INFO:__main__:2024-11-05 15:58:42 | Epoch: 1 | Step: 166690 | Dataset: 0-73741 | Loss: 0.726 | 913 ms/step , 6889.19 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 15:58:51 | Epoch: 1 | Step: 166700 | Dataset: 0-74061 | Loss: 0.803 | 915 ms/step , 6876.45 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 15:58:53 | Validation | Step: 166700 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 15:59:02 | Epoch: 1 | Step: 166710 | Dataset: 0-74381 | Loss: 0.761 | 914 ms/step , 6880.15 GFLOP/s , 15271.6 tokens/s INFO:__main__:2024-11-05 15:59:11 | Epoch: 1 | Step: 166720 | Dataset: 0-74701 | Loss: 0.732 | 914 ms/step , 6884.28 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 15:59:20 | Epoch: 1 | Step: 166730 | Dataset: 0-75021 | Loss: 0.804 | 914 ms/step , 6882.80 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 15:59:30 | Epoch: 1 | Step: 166740 | Dataset: 0-75341 | Loss: 0.507 | 912 ms/step , 6894.27 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 15:59:39 | Epoch: 1 | Step: 166750 | Dataset: 0-75661 | Loss: 0.836 | 915 ms/step , 6876.29 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 15:59:48 | Epoch: 1 | Step: 166760 | Dataset: 0-75981 | Loss: 0.762 | 912 ms/step , 6894.63 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 15:59:57 | Epoch: 1 | Step: 166770 | Dataset: 0-76301 | Loss: 0.825 | 913 ms/step , 6890.86 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 16:00:06 | Epoch: 1 | Step: 166780 | Dataset: 0-76621 | Loss: 0.559 | 913 ms/step , 6891.39 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 16:00:15 | Epoch: 1 | Step: 166790 | Dataset: 0-76941 | Loss: 0.689 | 913 ms/step , 6890.64 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 16:00:24 | Epoch: 1 | Step: 166800 | Dataset: 0-77261 | Loss: 0.844 | 912 ms/step , 6897.46 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 16:00:26 | Validation | Step: 166800 | Val_loss: 0.727 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:00:35 | Epoch: 1 | Step: 166810 | Dataset: 0-77581 | Loss: 0.885 | 917 ms/step , 6862.38 GFLOP/s , 15258.0 tokens/s INFO:__main__:2024-11-05 16:00:44 | Epoch: 1 | Step: 166820 | Dataset: 0-77901 | Loss: 0.811 | 914 ms/step , 6881.79 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 16:00:53 | Epoch: 1 | Step: 166830 | Dataset: 0-78221 | Loss: 0.796 | 915 ms/step , 6875.27 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 16:01:03 | Epoch: 1 | Step: 166840 | Dataset: 0-78541 | Loss: 0.781 | 915 ms/step , 6871.73 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 16:01:12 | Epoch: 1 | Step: 166850 | Dataset: 0-78861 | Loss: 0.859 | 915 ms/step , 6873.52 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 16:01:21 | Epoch: 1 | Step: 166860 | Dataset: 0-79181 | Loss: 0.793 | 914 ms/step , 6882.58 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 16:01:30 | Epoch: 1 | Step: 166870 | Dataset: 0-79501 | Loss: 0.732 | 914 ms/step , 6884.49 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 16:01:39 | Epoch: 1 | Step: 166880 | Dataset: 0-79821 | Loss: 0.735 | 913 ms/step , 6885.15 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 16:01:48 | Epoch: 1 | Step: 166890 | Dataset: 0-80141 | Loss: 0.680 | 913 ms/step , 6890.50 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 16:01:57 | Epoch: 1 | Step: 166900 | Dataset: 0-80461 | Loss: 0.518 | 913 ms/step , 6885.12 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 16:01:59 | Validation | Step: 166900 | Val_loss: 0.735 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:02:08 | Epoch: 1 | Step: 166910 | Dataset: 0-80781 | Loss: 0.749 | 913 ms/step , 6889.98 GFLOP/s , 15264.9 tokens/s INFO:__main__:2024-11-05 16:02:17 | Epoch: 1 | Step: 166920 | Dataset: 0-81101 | Loss: 0.680 | 912 ms/step , 6892.64 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 16:02:26 | Epoch: 1 | Step: 166930 | Dataset: 0-81421 | Loss: 0.798 | 912 ms/step , 6893.48 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 16:02:36 | Epoch: 1 | Step: 166940 | Dataset: 0-81741 | Loss: 0.674 | 914 ms/step , 6880.96 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 16:02:45 | Epoch: 1 | Step: 166950 | Dataset: 0-82061 | Loss: 0.810 | 914 ms/step , 6881.93 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 16:02:54 | Epoch: 1 | Step: 166960 | Dataset: 0-82381 | Loss: 0.774 | 914 ms/step , 6881.03 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 16:03:03 | Epoch: 1 | Step: 166970 | Dataset: 0-82701 | Loss: 0.632 | 913 ms/step , 6891.34 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 16:03:12 | Epoch: 1 | Step: 166980 | Dataset: 0-83021 | Loss: 0.716 | 913 ms/step , 6885.23 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 16:03:21 | Epoch: 1 | Step: 166990 | Dataset: 0-83341 | Loss: 0.778 | 916 ms/step , 6867.90 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 16:03:30 | Epoch: 1 | Step: 167000 | Dataset: 0-83661 | Loss: 0.783 | 913 ms/step , 6888.25 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 16:03:32 | Validation | Step: 167000 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:03:32 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_160332_step_167000.pt` INFO:__main__:2024-11-05 16:03:42 | Epoch: 1 | Step: 167010 | Dataset: 0-83981 | Loss: 0.816 | 914 ms/step , 6882.88 GFLOP/s , 13785.0 tokens/s INFO:__main__:2024-11-05 16:03:51 | Epoch: 1 | Step: 167020 | Dataset: 0-84301 | Loss: 0.772 | 915 ms/step , 6875.81 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 16:04:01 | Epoch: 1 | Step: 167030 | Dataset: 0-84621 | Loss: 0.869 | 914 ms/step , 6878.13 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 16:04:10 | Epoch: 1 | Step: 167040 | Dataset: 0-84941 | Loss: 0.674 | 913 ms/step , 6889.02 GFLOP/s , 17890.8 tokens/s INFO:__main__:2024-11-05 16:04:19 | Epoch: 1 | Step: 167050 | Dataset: 0-85261 | Loss: 0.699 | 912 ms/step , 6893.46 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 16:04:28 | Epoch: 1 | Step: 167060 | Dataset: 0-85581 | Loss: 0.797 | 916 ms/step , 6865.75 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 16:04:37 | Epoch: 1 | Step: 167070 | Dataset: 0-85901 | Loss: 0.669 | 913 ms/step , 6888.69 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 16:04:46 | Epoch: 1 | Step: 167080 | Dataset: 0-86221 | Loss: 0.813 | 914 ms/step , 6882.41 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 16:04:55 | Epoch: 1 | Step: 167090 | Dataset: 0-86541 | Loss: 0.813 | 915 ms/step , 6876.25 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 16:05:05 | Epoch: 1 | Step: 167100 | Dataset: 0-86861 | Loss: 0.794 | 914 ms/step , 6880.82 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 16:05:06 | Validation | Step: 167100 | Val_loss: 0.711 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:05:15 | Epoch: 1 | Step: 167110 | Dataset: 0-87181 | Loss: 0.845 | 913 ms/step , 6885.46 GFLOP/s , 15262.2 tokens/s INFO:__main__:2024-11-05 16:05:24 | Epoch: 1 | Step: 167120 | Dataset: 0-87501 | Loss: 0.775 | 915 ms/step , 6877.05 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 16:05:34 | Epoch: 1 | Step: 167130 | Dataset: 0-87821 | Loss: 0.589 | 913 ms/step , 6890.11 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 16:05:43 | Epoch: 1 | Step: 167140 | Dataset: 0-88141 | Loss: 0.580 | 913 ms/step , 6892.06 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 16:05:52 | Epoch: 1 | Step: 167150 | Dataset: 0-88461 | Loss: 0.704 | 913 ms/step , 6888.82 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 16:06:01 | Epoch: 1 | Step: 167160 | Dataset: 0-88781 | Loss: 0.731 | 914 ms/step , 6881.94 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 16:06:10 | Epoch: 1 | Step: 167170 | Dataset: 0-89101 | Loss: 0.767 | 913 ms/step , 6885.10 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 16:06:19 | Epoch: 1 | Step: 167180 | Dataset: 0-89421 | Loss: 0.710 | 912 ms/step , 6893.40 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 16:06:28 | Epoch: 1 | Step: 167190 | Dataset: 0-89741 | Loss: 0.631 | 913 ms/step , 6890.74 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:06:38 | Epoch: 1 | Step: 167200 | Dataset: 0-90061 | Loss: 0.775 | 913 ms/step , 6891.71 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 16:06:39 | Validation | Step: 167200 | Val_loss: 0.700 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:06:48 | Epoch: 1 | Step: 167210 | Dataset: 0-90381 | Loss: 0.778 | 913 ms/step , 6886.37 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-05 16:06:57 | Epoch: 1 | Step: 167220 | Dataset: 0-90701 | Loss: 0.600 | 914 ms/step , 6882.69 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 16:07:07 | Epoch: 1 | Step: 167230 | Dataset: 0-91021 | Loss: 0.748 | 913 ms/step , 6890.30 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 16:07:16 | Epoch: 1 | Step: 167240 | Dataset: 0-91341 | Loss: 0.809 | 913 ms/step , 6892.10 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 16:07:25 | Epoch: 1 | Step: 167250 | Dataset: 0-91661 | Loss: 0.535 | 912 ms/step , 6892.65 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 16:07:34 | Epoch: 1 | Step: 167260 | Dataset: 0-91981 | Loss: 0.777 | 914 ms/step , 6881.32 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 16:07:43 | Epoch: 1 | Step: 167270 | Dataset: 0-92301 | Loss: 0.785 | 914 ms/step , 6880.45 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 16:07:52 | Epoch: 1 | Step: 167280 | Dataset: 0-92621 | Loss: 0.780 | 916 ms/step , 6869.71 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 16:08:01 | Epoch: 1 | Step: 167290 | Dataset: 0-92941 | Loss: 0.744 | 913 ms/step , 6888.24 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 16:08:11 | Epoch: 1 | Step: 167300 | Dataset: 0-93261 | Loss: 0.848 | 914 ms/step , 6884.62 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 16:08:12 | Validation | Step: 167300 | Val_loss: 0.731 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:08:21 | Epoch: 1 | Step: 167310 | Dataset: 0-93581 | Loss: 0.787 | 913 ms/step , 6885.28 GFLOP/s , 15267.8 tokens/s INFO:__main__:2024-11-05 16:08:30 | Epoch: 1 | Step: 167320 | Dataset: 0-93901 | Loss: 0.710 | 912 ms/step , 6895.13 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 16:08:40 | Epoch: 1 | Step: 167330 | Dataset: 0-94221 | Loss: 0.626 | 913 ms/step , 6891.10 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 16:08:49 | Epoch: 1 | Step: 167340 | Dataset: 0-94541 | Loss: 0.712 | 914 ms/step , 6882.99 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 16:08:58 | Epoch: 1 | Step: 167350 | Dataset: 0-94861 | Loss: 0.844 | 914 ms/step , 6883.80 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 16:09:07 | Epoch: 1 | Step: 167360 | Dataset: 0-95181 | Loss: 0.775 | 913 ms/step , 6885.53 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 16:09:16 | Epoch: 1 | Step: 167370 | Dataset: 0-95501 | Loss: 0.707 | 914 ms/step , 6885.04 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 16:09:25 | Epoch: 1 | Step: 167380 | Dataset: 0-95821 | Loss: 0.824 | 913 ms/step , 6887.59 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 16:09:34 | Epoch: 1 | Step: 167390 | Dataset: 0-96141 | Loss: 0.924 | 913 ms/step , 6887.24 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 16:09:44 | Epoch: 1 | Step: 167400 | Dataset: 0-96461 | Loss: 0.742 | 912 ms/step , 6898.59 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 16:09:45 | Validation | Step: 167400 | Val_loss: 0.815 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:09:54 | Epoch: 1 | Step: 167410 | Dataset: 0-96781 | Loss: 0.645 | 912 ms/step , 6897.47 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-05 16:10:03 | Epoch: 1 | Step: 167420 | Dataset: 0-97101 | Loss: 0.723 | 914 ms/step , 6882.96 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 16:10:13 | Epoch: 1 | Step: 167430 | Dataset: 0-97421 | Loss: 0.824 | 913 ms/step , 6886.04 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 16:10:22 | Epoch: 1 | Step: 167440 | Dataset: 0-97741 | Loss: 0.743 | 914 ms/step , 6881.41 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 16:10:31 | Epoch: 1 | Step: 167450 | Dataset: 0-98061 | Loss: 0.707 | 916 ms/step , 6869.12 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 16:10:40 | Epoch: 1 | Step: 167460 | Dataset: 0-98381 | Loss: 0.629 | 913 ms/step , 6886.74 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 16:10:49 | Epoch: 1 | Step: 167470 | Dataset: 0-98701 | Loss: 0.749 | 916 ms/step , 6869.14 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 16:10:58 | Epoch: 1 | Step: 167480 | Dataset: 0-99021 | Loss: 0.779 | 914 ms/step , 6884.67 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 16:11:07 | Epoch: 1 | Step: 167490 | Dataset: 0-99341 | Loss: 0.814 | 915 ms/step , 6876.50 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 16:11:17 | Epoch: 1 | Step: 167500 | Dataset: 0-99661 | Loss: 0.650 | 912 ms/step , 6892.82 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 16:11:18 | Validation | Step: 167500 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:11:27 | Epoch: 1 | Step: 167510 | Dataset: 0-99981 | Loss: 0.678 | 914 ms/step , 6884.10 GFLOP/s , 15269.8 tokens/s INFO:__main__:2024-11-05 16:11:36 | Epoch: 1 | Step: 167520 | Dataset: 0-100301 | Loss: 0.654 | 913 ms/step , 6885.90 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 16:11:46 | Epoch: 1 | Step: 167530 | Dataset: 0-100621 | Loss: 0.812 | 914 ms/step , 6884.20 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 16:11:55 | Epoch: 1 | Step: 167540 | Dataset: 0-100941 | Loss: 0.673 | 912 ms/step , 6892.91 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 16:12:04 | Epoch: 1 | Step: 167550 | Dataset: 0-101261 | Loss: 0.757 | 913 ms/step , 6886.64 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 16:12:13 | Epoch: 1 | Step: 167560 | Dataset: 0-101581 | Loss: 0.595 | 913 ms/step , 6886.55 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 16:12:22 | Epoch: 1 | Step: 167570 | Dataset: 0-101901 | Loss: 0.724 | 914 ms/step , 6882.31 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 16:12:31 | Epoch: 1 | Step: 167580 | Dataset: 0-102221 | Loss: 0.769 | 913 ms/step , 6888.25 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 16:12:40 | Epoch: 1 | Step: 167590 | Dataset: 0-102541 | Loss: 0.696 | 913 ms/step , 6888.39 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 16:12:50 | Epoch: 1 | Step: 167600 | Dataset: 0-102861 | Loss: 0.684 | 914 ms/step , 6879.10 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 16:12:51 | Validation | Step: 167600 | Val_loss: 0.824 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:13:00 | Epoch: 1 | Step: 167610 | Dataset: 0-103181 | Loss: 0.619 | 913 ms/step , 6888.46 GFLOP/s , 15289.1 tokens/s INFO:__main__:2024-11-05 16:13:09 | Epoch: 1 | Step: 167620 | Dataset: 0-103501 | Loss: 0.784 | 915 ms/step , 6874.39 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 16:13:19 | Epoch: 1 | Step: 167630 | Dataset: 0-103821 | Loss: 0.618 | 912 ms/step , 6894.93 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 16:13:28 | Epoch: 1 | Step: 167640 | Dataset: 0-104141 | Loss: 0.685 | 913 ms/step , 6885.96 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 16:13:37 | Epoch: 1 | Step: 167650 | Dataset: 0-104461 | Loss: 0.651 | 912 ms/step , 6893.64 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 16:13:46 | Epoch: 1 | Step: 167660 | Dataset: 0-104781 | Loss: 0.671 | 915 ms/step , 6875.02 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 16:13:55 | Epoch: 1 | Step: 167670 | Dataset: 0-105101 | Loss: 0.716 | 914 ms/step , 6878.72 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 16:14:04 | Epoch: 1 | Step: 167680 | Dataset: 0-105421 | Loss: 0.704 | 913 ms/step , 6888.69 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 16:14:13 | Epoch: 1 | Step: 167690 | Dataset: 0-105741 | Loss: 0.693 | 913 ms/step , 6888.99 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 16:14:22 | Epoch: 1 | Step: 167700 | Dataset: 0-106061 | Loss: 0.746 | 913 ms/step , 6891.70 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 16:14:24 | Validation | Step: 167700 | Val_loss: 0.798 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:14:33 | Epoch: 1 | Step: 167710 | Dataset: 0-106381 | Loss: 0.717 | 914 ms/step , 6884.09 GFLOP/s , 15279.0 tokens/s INFO:__main__:2024-11-05 16:14:42 | Epoch: 1 | Step: 167720 | Dataset: 0-106701 | Loss: 0.697 | 914 ms/step , 6878.36 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 16:14:51 | Epoch: 1 | Step: 167730 | Dataset: 0-107021 | Loss: 0.726 | 914 ms/step , 6883.23 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 16:15:01 | Epoch: 1 | Step: 167740 | Dataset: 0-107341 | Loss: 0.737 | 913 ms/step , 6889.87 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 16:15:10 | Epoch: 1 | Step: 167750 | Dataset: 0-107661 | Loss: 0.676 | 913 ms/step , 6891.11 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 16:15:19 | Epoch: 1 | Step: 167760 | Dataset: 0-107981 | Loss: 0.610 | 912 ms/step , 6894.29 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 16:15:28 | Epoch: 1 | Step: 167770 | Dataset: 0-108301 | Loss: 0.726 | 913 ms/step , 6886.92 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 16:15:37 | Epoch: 1 | Step: 167780 | Dataset: 0-108621 | Loss: 0.639 | 912 ms/step , 6893.92 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 16:15:46 | Epoch: 1 | Step: 167790 | Dataset: 0-108941 | Loss: 0.687 | 913 ms/step , 6890.44 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:15:55 | Epoch: 1 | Step: 167800 | Dataset: 0-109261 | Loss: 0.701 | 912 ms/step , 6893.58 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 16:15:57 | Validation | Step: 167800 | Val_loss: 0.817 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:16:06 | Epoch: 1 | Step: 167810 | Dataset: 0-109581 | Loss: 0.801 | 914 ms/step , 6882.99 GFLOP/s , 15278.1 tokens/s INFO:__main__:2024-11-05 16:16:15 | Epoch: 1 | Step: 167820 | Dataset: 0-109901 | Loss: 0.780 | 915 ms/step , 6873.32 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 16:16:24 | Epoch: 1 | Step: 167830 | Dataset: 0-110221 | Loss: 0.738 | 916 ms/step , 6865.69 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 16:16:34 | Epoch: 1 | Step: 167840 | Dataset: 0-110541 | Loss: 0.737 | 913 ms/step , 6888.95 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 16:16:43 | Epoch: 1 | Step: 167850 | Dataset: 0-110861 | Loss: 0.719 | 914 ms/step , 6882.51 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 16:16:52 | Epoch: 1 | Step: 167860 | Dataset: 0-111181 | Loss: 0.717 | 913 ms/step , 6886.15 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 16:17:01 | Epoch: 1 | Step: 167870 | Dataset: 0-111501 | Loss: 0.768 | 913 ms/step , 6886.66 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 16:17:10 | Epoch: 1 | Step: 167880 | Dataset: 0-111821 | Loss: 0.600 | 913 ms/step , 6890.10 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 16:17:19 | Epoch: 1 | Step: 167890 | Dataset: 0-112141 | Loss: 0.721 | 915 ms/step , 6877.28 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 16:17:28 | Epoch: 1 | Step: 167900 | Dataset: 0-112461 | Loss: 0.696 | 913 ms/step , 6885.61 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 16:17:30 | Validation | Step: 167900 | Val_loss: 0.754 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:17:39 | Epoch: 1 | Step: 167910 | Dataset: 0-112781 | Loss: 0.697 | 914 ms/step , 6883.01 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 16:17:48 | Epoch: 1 | Step: 167920 | Dataset: 0-113101 | Loss: 0.639 | 913 ms/step , 6890.26 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 16:17:57 | Epoch: 1 | Step: 167930 | Dataset: 0-113421 | Loss: 0.777 | 914 ms/step , 6883.01 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 16:18:07 | Epoch: 1 | Step: 167940 | Dataset: 0-113741 | Loss: 0.708 | 914 ms/step , 6878.07 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:18:16 | Epoch: 1 | Step: 167950 | Dataset: 0-114061 | Loss: 0.743 | 912 ms/step , 6894.10 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 16:18:25 | Epoch: 1 | Step: 167960 | Dataset: 0-114381 | Loss: 0.759 | 914 ms/step , 6881.96 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 16:18:34 | Epoch: 1 | Step: 167970 | Dataset: 0-114701 | Loss: 0.776 | 912 ms/step , 6898.53 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 16:18:43 | Epoch: 1 | Step: 167980 | Dataset: 0-115021 | Loss: 0.706 | 913 ms/step , 6889.15 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 16:18:52 | Epoch: 1 | Step: 167990 | Dataset: 0-115341 | Loss: 0.744 | 913 ms/step , 6891.63 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 16:19:01 | Epoch: 1 | Step: 168000 | Dataset: 0-115661 | Loss: 0.665 | 914 ms/step , 6884.03 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 16:19:03 | Validation | Step: 168000 | Val_loss: 0.828 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:19:03 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_161903_step_168000.pt` INFO:__main__:2024-11-05 16:19:13 | Epoch: 1 | Step: 168010 | Dataset: 0-115981 | Loss: 0.713 | 914 ms/step , 6882.69 GFLOP/s , 13811.0 tokens/s INFO:__main__:2024-11-05 16:19:22 | Epoch: 1 | Step: 168020 | Dataset: 0-116301 | Loss: 0.674 | 913 ms/step , 6889.14 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 16:19:32 | Epoch: 1 | Step: 168030 | Dataset: 0-116621 | Loss: 0.626 | 914 ms/step , 6880.63 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 16:19:41 | Epoch: 1 | Step: 168040 | Dataset: 0-116941 | Loss: 0.602 | 913 ms/step , 6887.20 GFLOP/s , 17880.8 tokens/s INFO:__main__:2024-11-05 16:19:50 | Epoch: 1 | Step: 168050 | Dataset: 0-117261 | Loss: 0.738 | 913 ms/step , 6890.13 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 16:19:59 | Epoch: 1 | Step: 168060 | Dataset: 0-117581 | Loss: 0.753 | 915 ms/step , 6876.16 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 16:20:08 | Epoch: 1 | Step: 168070 | Dataset: 0-117901 | Loss: 0.501 | 912 ms/step , 6893.42 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 16:20:17 | Epoch: 1 | Step: 168080 | Dataset: 0-118221 | Loss: 0.691 | 914 ms/step , 6882.66 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 16:20:26 | Epoch: 1 | Step: 168090 | Dataset: 0-118541 | Loss: 0.705 | 914 ms/step , 6881.55 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 16:20:36 | Epoch: 1 | Step: 168100 | Dataset: 0-118861 | Loss: 0.658 | 914 ms/step , 6881.60 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 16:20:37 | Validation | Step: 168100 | Val_loss: 0.867 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:20:46 | Epoch: 1 | Step: 168110 | Dataset: 0-119181 | Loss: 0.645 | 912 ms/step , 6896.43 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-05 16:20:55 | Epoch: 1 | Step: 168120 | Dataset: 0-119501 | Loss: 0.755 | 915 ms/step , 6872.70 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:21:05 | Epoch: 1 | Step: 168130 | Dataset: 0-119821 | Loss: 0.763 | 912 ms/step , 6892.94 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 16:21:14 | Epoch: 1 | Step: 168140 | Dataset: 0-120141 | Loss: 0.610 | 912 ms/step , 6894.28 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 16:21:23 | Epoch: 1 | Step: 168150 | Dataset: 0-120461 | Loss: 0.801 | 914 ms/step , 6883.42 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 16:21:32 | Epoch: 1 | Step: 168160 | Dataset: 0-120781 | Loss: 0.828 | 913 ms/step , 6887.06 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 16:21:41 | Epoch: 1 | Step: 168170 | Dataset: 0-121101 | Loss: 0.692 | 913 ms/step , 6889.83 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 16:21:50 | Epoch: 1 | Step: 168180 | Dataset: 0-121421 | Loss: 0.576 | 912 ms/step , 6898.24 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 16:21:59 | Epoch: 1 | Step: 168190 | Dataset: 0-121741 | Loss: 0.812 | 914 ms/step , 6884.85 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 16:22:08 | Epoch: 1 | Step: 168200 | Dataset: 0-122061 | Loss: 0.663 | 913 ms/step , 6891.51 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 16:22:10 | Validation | Step: 168200 | Val_loss: 0.890 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:22:19 | Epoch: 1 | Step: 168210 | Dataset: 0-122381 | Loss: 0.720 | 912 ms/step , 6893.55 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 16:22:28 | Epoch: 1 | Step: 168220 | Dataset: 0-122701 | Loss: 0.720 | 912 ms/step , 6893.12 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 16:22:37 | Epoch: 1 | Step: 168230 | Dataset: 0-123021 | Loss: 0.665 | 913 ms/step , 6885.41 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 16:22:47 | Epoch: 1 | Step: 168240 | Dataset: 0-123341 | Loss: 0.755 | 914 ms/step , 6880.10 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 16:22:56 | Epoch: 1 | Step: 168250 | Dataset: 0-123661 | Loss: 0.782 | 914 ms/step , 6883.95 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 16:23:05 | Epoch: 1 | Step: 168260 | Dataset: 0-123981 | Loss: 0.716 | 913 ms/step , 6887.59 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 16:23:14 | Epoch: 1 | Step: 168270 | Dataset: 0-124301 | Loss: 0.845 | 914 ms/step , 6882.80 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 16:23:23 | Epoch: 1 | Step: 168280 | Dataset: 0-124621 | Loss: 0.634 | 913 ms/step , 6889.42 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 16:23:32 | Epoch: 1 | Step: 168290 | Dataset: 0-124941 | Loss: 0.686 | 914 ms/step , 6881.34 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 16:23:41 | Epoch: 1 | Step: 168300 | Dataset: 0-125261 | Loss: 0.598 | 912 ms/step , 6893.00 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 16:23:43 | Validation | Step: 168300 | Val_loss: 0.846 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:23:52 | Epoch: 1 | Step: 168310 | Dataset: 0-125581 | Loss: 0.672 | 913 ms/step , 6890.34 GFLOP/s , 15283.7 tokens/s INFO:__main__:2024-11-05 16:24:01 | Epoch: 1 | Step: 168320 | Dataset: 0-125901 | Loss: 0.645 | 912 ms/step , 6896.90 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 16:24:10 | Epoch: 1 | Step: 168330 | Dataset: 0-126221 | Loss: 0.747 | 913 ms/step , 6891.37 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 16:24:20 | Epoch: 1 | Step: 168340 | Dataset: 0-126541 | Loss: 0.667 | 913 ms/step , 6887.70 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 16:24:29 | Epoch: 1 | Step: 168350 | Dataset: 0-126861 | Loss: 0.769 | 913 ms/step , 6887.10 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 16:24:38 | Epoch: 1 | Step: 168360 | Dataset: 0-127181 | Loss: 0.797 | 913 ms/step , 6885.36 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 16:24:47 | Epoch: 1 | Step: 168370 | Dataset: 0-127501 | Loss: 0.716 | 913 ms/step , 6887.21 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 16:24:56 | Epoch: 1 | Step: 168380 | Dataset: 0-127821 | Loss: 0.522 | 912 ms/step , 6895.70 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 16:25:05 | Epoch: 1 | Step: 168390 | Dataset: 0-128141 | Loss: 0.758 | 914 ms/step , 6879.42 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 16:25:14 | Epoch: 1 | Step: 168400 | Dataset: 0-128461 | Loss: 0.765 | 912 ms/step , 6899.88 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 16:25:16 | Validation | Step: 168400 | Val_loss: 0.847 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:25:25 | Epoch: 1 | Step: 168410 | Dataset: 0-128781 | Loss: 0.753 | 912 ms/step , 6892.64 GFLOP/s , 15276.3 tokens/s INFO:__main__:2024-11-05 16:25:34 | Epoch: 1 | Step: 168420 | Dataset: 0-129101 | Loss: 0.781 | 912 ms/step , 6893.33 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 16:25:43 | Epoch: 1 | Step: 168430 | Dataset: 0-129421 | Loss: 0.863 | 913 ms/step , 6886.27 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 16:25:52 | Epoch: 1 | Step: 168440 | Dataset: 0-129741 | Loss: 0.713 | 913 ms/step , 6890.17 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 16:26:02 | Epoch: 1 | Step: 168450 | Dataset: 0-130061 | Loss: 0.737 | 913 ms/step , 6890.17 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 16:26:11 | Epoch: 1 | Step: 168460 | Dataset: 0-130381 | Loss: 0.786 | 913 ms/step , 6890.13 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 16:26:20 | Epoch: 1 | Step: 168470 | Dataset: 0-130701 | Loss: 0.720 | 914 ms/step , 6877.58 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 16:26:29 | Epoch: 1 | Step: 168480 | Dataset: 0-131021 | Loss: 0.630 | 914 ms/step , 6878.76 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 16:26:38 | Epoch: 1 | Step: 168490 | Dataset: 0-131341 | Loss: 0.597 | 913 ms/step , 6892.18 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 16:26:47 | Epoch: 1 | Step: 168500 | Dataset: 0-131661 | Loss: 0.649 | 914 ms/step , 6883.86 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 16:26:49 | Validation | Step: 168500 | Val_loss: 0.790 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:26:58 | Epoch: 1 | Step: 168510 | Dataset: 0-131981 | Loss: 0.713 | 912 ms/step , 6892.95 GFLOP/s , 15284.1 tokens/s INFO:__main__:2024-11-05 16:27:07 | Epoch: 1 | Step: 168520 | Dataset: 0-132301 | Loss: 0.669 | 913 ms/step , 6887.22 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 16:27:16 | Epoch: 1 | Step: 168530 | Dataset: 0-132621 | Loss: 0.597 | 913 ms/step , 6891.26 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 16:27:25 | Epoch: 1 | Step: 168540 | Dataset: 0-132941 | Loss: 0.793 | 912 ms/step , 6895.47 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 16:27:35 | Epoch: 1 | Step: 168550 | Dataset: 0-133261 | Loss: 0.703 | 913 ms/step , 6892.46 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 16:27:44 | Epoch: 1 | Step: 168560 | Dataset: 0-133581 | Loss: 0.685 | 913 ms/step , 6887.18 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 16:27:53 | Epoch: 1 | Step: 168570 | Dataset: 0-133901 | Loss: 0.363 | 912 ms/step , 6894.46 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 16:28:02 | Epoch: 1 | Step: 168580 | Dataset: 0-134221 | Loss: 0.312 | 912 ms/step , 6897.73 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-05 16:28:11 | Epoch: 1 | Step: 168590 | Dataset: 0-134541 | Loss: 0.376 | 913 ms/step , 6887.90 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-05 16:28:20 | Epoch: 1 | Step: 168600 | Dataset: 0-134861 | Loss: 0.375 | 912 ms/step , 6898.12 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-05 16:28:22 | Validation | Step: 168600 | Val_loss: 0.802 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:28:31 | Epoch: 1 | Step: 168610 | Dataset: 0-135181 | Loss: 0.401 | 911 ms/step , 6903.39 GFLOP/s , 15288.9 tokens/s INFO:__main__:2024-11-05 16:28:40 | Epoch: 1 | Step: 168620 | Dataset: 0-135501 | Loss: 0.319 | 913 ms/step , 6890.27 GFLOP/s , 17950.3 tokens/s INFO:__main__:2024-11-05 16:28:49 | Epoch: 1 | Step: 168630 | Dataset: 0-135821 | Loss: 0.344 | 912 ms/step , 6892.96 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-05 16:28:58 | Epoch: 1 | Step: 168640 | Dataset: 0-136141 | Loss: 0.347 | 912 ms/step , 6900.01 GFLOP/s , 17956.1 tokens/s INFO:__main__:2024-11-05 16:29:07 | Epoch: 1 | Step: 168650 | Dataset: 0-136461 | Loss: 0.304 | 913 ms/step , 6887.41 GFLOP/s , 17957.7 tokens/s INFO:__main__:2024-11-05 16:29:17 | Epoch: 1 | Step: 168660 | Dataset: 0-136781 | Loss: 0.342 | 912 ms/step , 6894.64 GFLOP/s , 17954.8 tokens/s INFO:__main__:2024-11-05 16:29:26 | Epoch: 1 | Step: 168670 | Dataset: 0-137101 | Loss: 0.355 | 912 ms/step , 6896.83 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 16:29:35 | Epoch: 1 | Step: 168680 | Dataset: 0-137421 | Loss: 0.355 | 913 ms/step , 6890.36 GFLOP/s , 17951.8 tokens/s INFO:__main__:2024-11-05 16:29:44 | Epoch: 1 | Step: 168690 | Dataset: 0-137741 | Loss: 0.376 | 912 ms/step , 6895.44 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 16:29:53 | Epoch: 1 | Step: 168700 | Dataset: 0-138061 | Loss: 0.384 | 912 ms/step , 6895.04 GFLOP/s , 17952.9 tokens/s INFO:__main__:2024-11-05 16:29:55 | Validation | Step: 168700 | Val_loss: 0.718 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:30:04 | Epoch: 1 | Step: 168710 | Dataset: 0-138381 | Loss: 0.328 | 912 ms/step , 6896.43 GFLOP/s , 15292.9 tokens/s INFO:__main__:2024-11-05 16:30:13 | Epoch: 1 | Step: 168720 | Dataset: 0-138701 | Loss: 0.303 | 912 ms/step , 6899.21 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 16:30:22 | Epoch: 1 | Step: 168730 | Dataset: 0-139021 | Loss: 0.381 | 912 ms/step , 6897.41 GFLOP/s , 17957.0 tokens/s INFO:__main__:2024-11-05 16:30:31 | Epoch: 1 | Step: 168740 | Dataset: 0-139341 | Loss: 0.281 | 913 ms/step , 6885.80 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-05 16:30:40 | Epoch: 1 | Step: 168750 | Dataset: 0-139661 | Loss: 0.342 | 912 ms/step , 6894.20 GFLOP/s , 17952.7 tokens/s INFO:__main__:2024-11-05 16:30:49 | Epoch: 1 | Step: 168760 | Dataset: 0-139981 | Loss: 0.284 | 913 ms/step , 6889.85 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 16:30:59 | Epoch: 1 | Step: 168770 | Dataset: 0-140301 | Loss: 0.343 | 911 ms/step , 6901.71 GFLOP/s , 17960.2 tokens/s INFO:__main__:2024-11-05 16:31:08 | Epoch: 1 | Step: 168780 | Dataset: 0-140621 | Loss: 0.330 | 911 ms/step , 6902.27 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 16:31:17 | Epoch: 1 | Step: 168790 | Dataset: 0-140941 | Loss: 0.292 | 912 ms/step , 6897.63 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-05 16:31:26 | Epoch: 1 | Step: 168800 | Dataset: 0-141261 | Loss: 0.326 | 913 ms/step , 6891.89 GFLOP/s , 17954.6 tokens/s INFO:__main__:2024-11-05 16:31:28 | Validation | Step: 168800 | Val_loss: 0.825 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:31:37 | Epoch: 1 | Step: 168810 | Dataset: 0-141581 | Loss: 0.310 | 911 ms/step , 6901.82 GFLOP/s , 15285.6 tokens/s INFO:__main__:2024-11-05 16:31:46 | Epoch: 1 | Step: 168820 | Dataset: 0-141901 | Loss: 0.343 | 911 ms/step , 6900.77 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 16:31:55 | Epoch: 1 | Step: 168830 | Dataset: 0-142221 | Loss: 0.301 | 913 ms/step , 6891.90 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-05 16:32:04 | Epoch: 1 | Step: 168840 | Dataset: 0-142541 | Loss: 0.298 | 913 ms/step , 6890.99 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-05 16:32:13 | Epoch: 1 | Step: 168850 | Dataset: 0-142861 | Loss: 0.337 | 913 ms/step , 6887.09 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-05 16:32:22 | Epoch: 1 | Step: 168860 | Dataset: 0-143181 | Loss: 0.304 | 912 ms/step , 6893.01 GFLOP/s , 17962.0 tokens/s INFO:__main__:2024-11-05 16:32:31 | Epoch: 1 | Step: 168870 | Dataset: 0-143501 | Loss: 0.313 | 912 ms/step , 6895.19 GFLOP/s , 17955.8 tokens/s INFO:__main__:2024-11-05 16:32:41 | Epoch: 1 | Step: 168880 | Dataset: 0-143821 | Loss: 0.336 | 913 ms/step , 6892.25 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-05 16:32:50 | Epoch: 1 | Step: 168890 | Dataset: 0-144141 | Loss: 0.295 | 912 ms/step , 6899.81 GFLOP/s , 17956.4 tokens/s INFO:__main__:2024-11-05 16:32:59 | Epoch: 1 | Step: 168900 | Dataset: 0-144461 | Loss: 0.347 | 912 ms/step , 6892.82 GFLOP/s , 17961.2 tokens/s INFO:__main__:2024-11-05 16:33:00 | Validation | Step: 168900 | Val_loss: 0.810 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:33:09 | Epoch: 1 | Step: 168910 | Dataset: 0-144781 | Loss: 0.316 | 913 ms/step , 6892.28 GFLOP/s , 15291.6 tokens/s INFO:__main__:2024-11-05 16:33:19 | Epoch: 1 | Step: 168920 | Dataset: 0-145101 | Loss: 0.298 | 911 ms/step , 6902.55 GFLOP/s , 17962.9 tokens/s INFO:__main__:2024-11-05 16:33:28 | Epoch: 1 | Step: 168930 | Dataset: 0-145421 | Loss: 0.292 | 912 ms/step , 6899.36 GFLOP/s , 17954.5 tokens/s INFO:__main__:2024-11-05 16:33:37 | Epoch: 1 | Step: 168940 | Dataset: 0-145741 | Loss: 0.308 | 912 ms/step , 6897.07 GFLOP/s , 17954.0 tokens/s INFO:__main__:2024-11-05 16:33:46 | Epoch: 1 | Step: 168950 | Dataset: 0-146061 | Loss: 0.376 | 912 ms/step , 6893.50 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 16:33:55 | Epoch: 1 | Step: 168960 | Dataset: 0-146381 | Loss: 0.299 | 912 ms/step , 6893.75 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-05 16:34:04 | Epoch: 1 | Step: 168970 | Dataset: 0-146701 | Loss: 0.326 | 912 ms/step , 6896.50 GFLOP/s , 17956.8 tokens/s INFO:__main__:2024-11-05 16:34:13 | Epoch: 1 | Step: 168980 | Dataset: 0-147021 | Loss: 0.317 | 913 ms/step , 6886.64 GFLOP/s , 17958.6 tokens/s INFO:__main__:2024-11-05 16:34:22 | Epoch: 1 | Step: 168990 | Dataset: 0-147341 | Loss: 0.316 | 913 ms/step , 6888.96 GFLOP/s , 17957.8 tokens/s INFO:__main__:2024-11-05 16:34:32 | Epoch: 1 | Step: 169000 | Dataset: 0-147661 | Loss: 0.326 | 911 ms/step , 6903.21 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-05 16:34:33 | Validation | Step: 169000 | Val_loss: 0.763 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:34:33 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_163433_step_169000.pt` INFO:__main__:2024-11-05 16:34:43 | Epoch: 1 | Step: 169010 | Dataset: 0-147981 | Loss: 0.344 | 913 ms/step , 6887.68 GFLOP/s , 13829.7 tokens/s INFO:__main__:2024-11-05 16:34:53 | Epoch: 1 | Step: 169020 | Dataset: 0-148301 | Loss: 0.284 | 912 ms/step , 6895.63 GFLOP/s , 17959.4 tokens/s INFO:__main__:2024-11-05 16:35:02 | Epoch: 1 | Step: 169030 | Dataset: 0-148621 | Loss: 0.309 | 912 ms/step , 6897.13 GFLOP/s , 17961.4 tokens/s INFO:__main__:2024-11-05 16:35:11 | Epoch: 1 | Step: 169040 | Dataset: 0-148941 | Loss: 0.301 | 911 ms/step , 6904.71 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 16:35:20 | Epoch: 1 | Step: 169050 | Dataset: 0-149261 | Loss: 0.338 | 912 ms/step , 6897.15 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 16:35:29 | Epoch: 1 | Step: 169060 | Dataset: 0-149581 | Loss: 0.319 | 912 ms/step , 6900.01 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 16:35:38 | Epoch: 1 | Step: 169070 | Dataset: 0-149901 | Loss: 0.298 | 911 ms/step , 6901.73 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-05 16:35:47 | Epoch: 1 | Step: 169080 | Dataset: 0-150221 | Loss: 0.305 | 912 ms/step , 6898.10 GFLOP/s , 17953.6 tokens/s INFO:__main__:2024-11-05 16:35:56 | Epoch: 1 | Step: 169090 | Dataset: 0-150541 | Loss: 0.311 | 913 ms/step , 6885.74 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 16:36:06 | Epoch: 1 | Step: 169100 | Dataset: 0-150861 | Loss: 0.302 | 913 ms/step , 6891.80 GFLOP/s , 17960.2 tokens/s INFO:__main__:2024-11-05 16:36:07 | Validation | Step: 169100 | Val_loss: 0.869 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:36:16 | Epoch: 1 | Step: 169110 | Dataset: 0-151181 | Loss: 0.313 | 912 ms/step , 6900.13 GFLOP/s , 15292.1 tokens/s INFO:__main__:2024-11-05 16:36:25 | Epoch: 1 | Step: 169120 | Dataset: 0-151501 | Loss: 0.323 | 912 ms/step , 6899.77 GFLOP/s , 17960.7 tokens/s INFO:__main__:2024-11-05 16:36:35 | Epoch: 1 | Step: 169130 | Dataset: 0-151821 | Loss: 0.267 | 913 ms/step , 6888.21 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-05 16:36:44 | Epoch: 1 | Step: 169140 | Dataset: 0-152141 | Loss: 0.288 | 911 ms/step , 6904.28 GFLOP/s , 17962.2 tokens/s INFO:__main__:2024-11-05 16:36:53 | Epoch: 1 | Step: 169150 | Dataset: 0-152461 | Loss: 0.310 | 912 ms/step , 6899.97 GFLOP/s , 17964.0 tokens/s INFO:__main__:2024-11-05 16:37:02 | Epoch: 1 | Step: 169160 | Dataset: 0-152781 | Loss: 0.324 | 911 ms/step , 6900.62 GFLOP/s , 17962.5 tokens/s INFO:__main__:2024-11-05 16:37:11 | Epoch: 1 | Step: 169170 | Dataset: 0-153101 | Loss: 0.268 | 912 ms/step , 6894.56 GFLOP/s , 17954.4 tokens/s INFO:__main__:2024-11-05 16:37:20 | Epoch: 1 | Step: 169180 | Dataset: 0-153421 | Loss: 0.327 | 911 ms/step , 6902.43 GFLOP/s , 17950.8 tokens/s INFO:__main__:2024-11-05 16:37:29 | Epoch: 1 | Step: 169190 | Dataset: 0-153741 | Loss: 0.256 | 912 ms/step , 6896.39 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-05 16:37:38 | Epoch: 1 | Step: 169200 | Dataset: 0-154061 | Loss: 0.315 | 913 ms/step , 6889.41 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 16:37:40 | Validation | Step: 169200 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:37:49 | Epoch: 1 | Step: 169210 | Dataset: 0-154381 | Loss: 0.326 | 913 ms/step , 6890.08 GFLOP/s , 15290.5 tokens/s INFO:__main__:2024-11-05 16:37:58 | Epoch: 1 | Step: 169220 | Dataset: 0-154701 | Loss: 0.331 | 912 ms/step , 6898.68 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-05 16:38:07 | Epoch: 1 | Step: 169230 | Dataset: 0-155021 | Loss: 0.305 | 911 ms/step , 6901.45 GFLOP/s , 17950.7 tokens/s INFO:__main__:2024-11-05 16:38:16 | Epoch: 1 | Step: 169240 | Dataset: 0-155341 | Loss: 0.299 | 913 ms/step , 6891.40 GFLOP/s , 17956.4 tokens/s INFO:__main__:2024-11-05 16:38:26 | Epoch: 1 | Step: 169250 | Dataset: 0-155661 | Loss: 0.290 | 912 ms/step , 6898.72 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-05 16:38:35 | Epoch: 1 | Step: 169260 | Dataset: 0-155981 | Loss: 0.355 | 911 ms/step , 6900.83 GFLOP/s , 17954.6 tokens/s INFO:__main__:2024-11-05 16:38:44 | Epoch: 1 | Step: 169270 | Dataset: 0-156301 | Loss: 0.266 | 913 ms/step , 6891.77 GFLOP/s , 17953.4 tokens/s INFO:__main__:2024-11-05 16:38:53 | Epoch: 1 | Step: 169280 | Dataset: 0-156621 | Loss: 0.281 | 912 ms/step , 6899.05 GFLOP/s , 17959.2 tokens/s INFO:__main__:2024-11-05 16:39:02 | Epoch: 1 | Step: 169290 | Dataset: 0-156941 | Loss: 0.285 | 911 ms/step , 6905.45 GFLOP/s , 17958.0 tokens/s INFO:__main__:2024-11-05 16:39:11 | Epoch: 1 | Step: 169300 | Dataset: 0-157261 | Loss: 0.313 | 912 ms/step , 6893.46 GFLOP/s , 17952.6 tokens/s INFO:__main__:2024-11-05 16:39:13 | Validation | Step: 169300 | Val_loss: 0.813 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:39:22 | Epoch: 1 | Step: 169310 | Dataset: 0-157581 | Loss: 0.355 | 912 ms/step , 6899.24 GFLOP/s , 15290.1 tokens/s INFO:__main__:2024-11-05 16:39:31 | Epoch: 1 | Step: 169320 | Dataset: 0-157901 | Loss: 0.277 | 911 ms/step , 6903.95 GFLOP/s , 17957.6 tokens/s INFO:__main__:2024-11-05 16:39:40 | Epoch: 1 | Step: 169330 | Dataset: 0-158221 | Loss: 0.315 | 912 ms/step , 6897.56 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 16:39:49 | Epoch: 1 | Step: 169340 | Dataset: 0-158541 | Loss: 0.269 | 912 ms/step , 6894.66 GFLOP/s , 17957.1 tokens/s INFO:__main__:2024-11-05 16:39:58 | Epoch: 1 | Step: 169350 | Dataset: 0-158861 | Loss: 0.278 | 911 ms/step , 6903.61 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-05 16:40:08 | Epoch: 1 | Step: 169360 | Dataset: 0-159181 | Loss: 0.272 | 913 ms/step , 6888.58 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 16:40:17 | Epoch: 1 | Step: 169370 | Dataset: 0-159501 | Loss: 0.262 | 913 ms/step , 6892.17 GFLOP/s , 17954.9 tokens/s INFO:__main__:2024-11-05 16:40:26 | Epoch: 1 | Step: 169380 | Dataset: 0-159821 | Loss: 0.286 | 911 ms/step , 6901.04 GFLOP/s , 17960.1 tokens/s INFO:__main__:2024-11-05 16:40:35 | Epoch: 1 | Step: 169390 | Dataset: 0-160141 | Loss: 0.298 | 912 ms/step , 6895.12 GFLOP/s , 17950.5 tokens/s INFO:__main__:2024-11-05 16:40:44 | Epoch: 1 | Step: 169400 | Dataset: 0-160461 | Loss: 0.305 | 912 ms/step , 6898.49 GFLOP/s , 17955.1 tokens/s INFO:__main__:2024-11-05 16:40:46 | Validation | Step: 169400 | Val_loss: 0.849 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:40:55 | Epoch: 1 | Step: 169410 | Dataset: 0-160781 | Loss: 0.288 | 911 ms/step , 6903.93 GFLOP/s , 15289.3 tokens/s INFO:__main__:2024-11-05 16:41:04 | Epoch: 1 | Step: 169420 | Dataset: 0-161101 | Loss: 0.311 | 912 ms/step , 6894.18 GFLOP/s , 17954.6 tokens/s INFO:__main__:2024-11-05 16:41:13 | Epoch: 1 | Step: 169430 | Dataset: 0-161421 | Loss: 0.318 | 913 ms/step , 6892.20 GFLOP/s , 17961.5 tokens/s INFO:__main__:2024-11-05 16:41:22 | Epoch: 1 | Step: 169440 | Dataset: 0-161741 | Loss: 0.337 | 911 ms/step , 6903.71 GFLOP/s , 17955.9 tokens/s INFO:__main__:2024-11-05 16:41:31 | Epoch: 1 | Step: 169450 | Dataset: 0-162061 | Loss: 0.318 | 912 ms/step , 6897.81 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-05 16:41:40 | Epoch: 1 | Step: 169460 | Dataset: 0-162381 | Loss: 0.313 | 912 ms/step , 6894.61 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-05 16:41:50 | Epoch: 1 | Step: 169470 | Dataset: 0-162701 | Loss: 0.291 | 913 ms/step , 6890.46 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 16:41:59 | Epoch: 1 | Step: 169480 | Dataset: 0-163021 | Loss: 0.294 | 912 ms/step , 6895.31 GFLOP/s , 17954.1 tokens/s INFO:__main__:2024-11-05 16:42:08 | Epoch: 1 | Step: 169490 | Dataset: 0-163341 | Loss: 0.274 | 912 ms/step , 6896.01 GFLOP/s , 17957.6 tokens/s INFO:__main__:2024-11-05 16:42:17 | Epoch: 1 | Step: 169500 | Dataset: 0-163661 | Loss: 0.274 | 912 ms/step , 6896.56 GFLOP/s , 17956.2 tokens/s INFO:__main__:2024-11-05 16:42:18 | Validation | Step: 169500 | Val_loss: 0.825 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:42:28 | Epoch: 1 | Step: 169510 | Dataset: 0-163981 | Loss: 0.323 | 911 ms/step , 6902.34 GFLOP/s , 15287.8 tokens/s INFO:__main__:2024-11-05 16:42:37 | Epoch: 1 | Step: 169520 | Dataset: 0-164301 | Loss: 0.297 | 912 ms/step , 6898.89 GFLOP/s , 17960.9 tokens/s INFO:__main__:2024-11-05 16:42:46 | Epoch: 1 | Step: 169530 | Dataset: 0-164621 | Loss: 0.318 | 913 ms/step , 6891.26 GFLOP/s , 17962.7 tokens/s INFO:__main__:2024-11-05 16:42:55 | Epoch: 1 | Step: 169540 | Dataset: 0-164941 | Loss: 0.300 | 911 ms/step , 6906.51 GFLOP/s , 17965.0 tokens/s INFO:__main__:2024-11-05 16:43:04 | Epoch: 1 | Step: 169550 | Dataset: 0-165261 | Loss: 0.266 | 912 ms/step , 6899.21 GFLOP/s , 17964.2 tokens/s INFO:__main__:2024-11-05 16:43:13 | Epoch: 1 | Step: 169560 | Dataset: 0-165581 | Loss: 0.328 | 911 ms/step , 6907.42 GFLOP/s , 17957.7 tokens/s INFO:__main__:2024-11-05 16:43:22 | Epoch: 1 | Step: 169570 | Dataset: 0-165901 | Loss: 0.276 | 911 ms/step , 6904.30 GFLOP/s , 17965.0 tokens/s INFO:__main__:2024-11-05 16:43:31 | Epoch: 1 | Step: 169580 | Dataset: 0-166221 | Loss: 0.250 | 911 ms/step , 6901.61 GFLOP/s , 17961.2 tokens/s INFO:__main__:2024-11-05 16:43:41 | Epoch: 1 | Step: 169590 | Dataset: 0-166541 | Loss: 0.290 | 912 ms/step , 6895.50 GFLOP/s , 17957.3 tokens/s INFO:__main__:2024-11-05 16:43:50 | Epoch: 1 | Step: 169600 | Dataset: 0-166861 | Loss: 0.294 | 912 ms/step , 6893.51 GFLOP/s , 17961.0 tokens/s INFO:__main__:2024-11-05 16:43:51 | Validation | Step: 169600 | Val_loss: 0.850 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:44:00 | Epoch: 1 | Step: 169610 | Dataset: 0-167181 | Loss: 0.279 | 912 ms/step , 6895.69 GFLOP/s , 15287.4 tokens/s INFO:__main__:2024-11-05 16:44:10 | Epoch: 1 | Step: 169620 | Dataset: 0-167501 | Loss: 0.283 | 913 ms/step , 6888.88 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-05 16:44:19 | Epoch: 1 | Step: 169630 | Dataset: 0-167821 | Loss: 0.318 | 912 ms/step , 6898.22 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-05 16:44:28 | Epoch: 1 | Step: 169640 | Dataset: 0-168141 | Loss: 0.367 | 912 ms/step , 6894.93 GFLOP/s , 17955.5 tokens/s INFO:__main__:2024-11-05 16:44:37 | Epoch: 1 | Step: 169650 | Dataset: 0-168461 | Loss: 0.327 | 914 ms/step , 6884.29 GFLOP/s , 17956.6 tokens/s INFO:__main__:2024-11-05 16:44:46 | Epoch: 1 | Step: 169660 | Dataset: 0-168781 | Loss: 0.311 | 912 ms/step , 6899.24 GFLOP/s , 17958.4 tokens/s INFO:__main__:2024-11-05 16:44:55 | Epoch: 1 | Step: 169670 | Dataset: 0-169101 | Loss: 0.274 | 912 ms/step , 6896.25 GFLOP/s , 17957.3 tokens/s INFO:__main__:2024-11-05 16:45:04 | Epoch: 1 | Step: 169680 | Dataset: 0-169421 | Loss: 0.269 | 911 ms/step , 6903.27 GFLOP/s , 17958.6 tokens/s INFO:__main__:2024-11-05 16:45:13 | Epoch: 1 | Step: 169690 | Dataset: 0-169741 | Loss: 0.261 | 912 ms/step , 6898.69 GFLOP/s , 17963.3 tokens/s INFO:__main__:2024-11-05 16:45:23 | Epoch: 1 | Step: 169700 | Dataset: 0-170061 | Loss: 0.281 | 913 ms/step , 6888.38 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 16:45:24 | Validation | Step: 169700 | Val_loss: 0.875 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:45:33 | Epoch: 1 | Step: 169710 | Dataset: 0-170381 | Loss: 0.308 | 911 ms/step , 6900.18 GFLOP/s , 15303.0 tokens/s INFO:__main__:2024-11-05 16:45:42 | Epoch: 1 | Step: 169720 | Dataset: 0-170701 | Loss: 0.273 | 911 ms/step , 6904.74 GFLOP/s , 17960.2 tokens/s INFO:__main__:2024-11-05 16:45:52 | Epoch: 1 | Step: 169730 | Dataset: 0-171021 | Loss: 0.704 | 914 ms/step , 6878.78 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 16:46:01 | Epoch: 1 | Step: 169740 | Dataset: 0-171341 | Loss: 0.824 | 914 ms/step , 6883.19 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 16:46:10 | Epoch: 1 | Step: 169750 | Dataset: 0-171661 | Loss: 0.592 | 912 ms/step , 6894.31 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 16:46:19 | Epoch: 1 | Step: 169760 | Dataset: 0-171981 | Loss: 0.753 | 914 ms/step , 6883.53 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 16:46:28 | Epoch: 1 | Step: 169770 | Dataset: 0-172301 | Loss: 0.594 | 912 ms/step , 6896.18 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 16:46:37 | Epoch: 1 | Step: 169780 | Dataset: 0-172621 | Loss: 0.753 | 914 ms/step , 6879.58 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 16:46:46 | Epoch: 1 | Step: 169790 | Dataset: 0-172941 | Loss: 0.742 | 913 ms/step , 6885.97 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 16:46:55 | Epoch: 1 | Step: 169800 | Dataset: 0-173261 | Loss: 0.599 | 912 ms/step , 6893.12 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 16:46:57 | Validation | Step: 169800 | Val_loss: 0.674 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:47:06 | Epoch: 1 | Step: 169810 | Dataset: 0-173581 | Loss: 0.791 | 914 ms/step , 6882.68 GFLOP/s , 15267.5 tokens/s INFO:__main__:2024-11-05 16:47:15 | Epoch: 1 | Step: 169820 | Dataset: 0-173901 | Loss: 0.722 | 915 ms/step , 6875.81 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 16:47:24 | Epoch: 1 | Step: 169830 | Dataset: 0-174221 | Loss: 0.764 | 913 ms/step , 6886.47 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 16:47:34 | Epoch: 1 | Step: 169840 | Dataset: 0-174541 | Loss: 0.816 | 914 ms/step , 6882.71 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 16:47:43 | Epoch: 1 | Step: 169850 | Dataset: 0-174861 | Loss: 0.797 | 914 ms/step , 6883.70 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 16:47:52 | Epoch: 1 | Step: 169860 | Dataset: 0-175181 | Loss: 0.813 | 914 ms/step , 6880.87 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 16:48:01 | Epoch: 1 | Step: 169870 | Dataset: 0-175501 | Loss: 0.749 | 913 ms/step , 6886.63 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 16:48:10 | Epoch: 1 | Step: 169880 | Dataset: 0-175821 | Loss: 0.860 | 915 ms/step , 6875.18 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 16:48:19 | Epoch: 1 | Step: 169890 | Dataset: 0-176141 | Loss: 0.787 | 914 ms/step , 6882.68 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 16:48:28 | Epoch: 1 | Step: 169900 | Dataset: 0-176461 | Loss: 0.673 | 913 ms/step , 6887.78 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 16:48:30 | Validation | Step: 169900 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:48:39 | Epoch: 1 | Step: 169910 | Dataset: 0-176781 | Loss: 0.779 | 913 ms/step , 6887.03 GFLOP/s , 15281.1 tokens/s INFO:__main__:2024-11-05 16:48:48 | Epoch: 1 | Step: 169920 | Dataset: 0-177101 | Loss: 0.726 | 913 ms/step , 6888.61 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 16:48:57 | Epoch: 1 | Step: 169930 | Dataset: 0-177421 | Loss: 0.800 | 913 ms/step , 6889.73 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 16:49:07 | Epoch: 1 | Step: 169940 | Dataset: 0-177741 | Loss: 0.650 | 914 ms/step , 6881.04 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 16:49:16 | Epoch: 1 | Step: 169950 | Dataset: 0-178061 | Loss: 0.806 | 913 ms/step , 6886.83 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 16:49:25 | Epoch: 1 | Step: 169960 | Dataset: 0-178381 | Loss: 0.537 | 912 ms/step , 6893.60 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 16:49:34 | Epoch: 1 | Step: 169970 | Dataset: 0-178701 | Loss: 0.688 | 912 ms/step , 6892.87 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 16:49:43 | Epoch: 1 | Step: 169980 | Dataset: 0-179021 | Loss: 0.802 | 913 ms/step , 6887.14 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 16:49:52 | Epoch: 1 | Step: 169990 | Dataset: 0-179341 | Loss: 0.765 | 913 ms/step , 6886.36 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 16:50:01 | Epoch: 1 | Step: 170000 | Dataset: 0-179661 | Loss: 0.699 | 913 ms/step , 6887.08 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 16:50:03 | Validation | Step: 170000 | Val_loss: 0.724 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:50:03 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_165003_step_170000.pt` INFO:__main__:2024-11-05 16:50:13 | Epoch: 1 | Step: 170010 | Dataset: 0-179981 | Loss: 0.557 | 912 ms/step , 6893.01 GFLOP/s , 13798.9 tokens/s INFO:__main__:2024-11-05 16:50:22 | Epoch: 1 | Step: 170020 | Dataset: 0-180301 | Loss: 0.762 | 914 ms/step , 6883.42 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 16:50:32 | Epoch: 1 | Step: 170030 | Dataset: 0-180621 | Loss: 0.851 | 912 ms/step , 6896.47 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 16:50:41 | Epoch: 1 | Step: 170040 | Dataset: 0-180941 | Loss: 0.739 | 913 ms/step , 6891.51 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-05 16:50:50 | Epoch: 1 | Step: 170050 | Dataset: 0-181261 | Loss: 0.682 | 915 ms/step , 6874.10 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 16:50:59 | Epoch: 1 | Step: 170060 | Dataset: 0-181581 | Loss: 0.676 | 912 ms/step , 6893.06 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 16:51:08 | Epoch: 1 | Step: 170070 | Dataset: 0-181901 | Loss: 0.772 | 914 ms/step , 6879.46 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 16:51:17 | Epoch: 1 | Step: 170080 | Dataset: 0-182221 | Loss: 0.714 | 915 ms/step , 6877.31 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 16:51:26 | Epoch: 1 | Step: 170090 | Dataset: 0-182541 | Loss: 0.802 | 915 ms/step , 6876.46 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 16:51:36 | Epoch: 1 | Step: 170100 | Dataset: 0-182861 | Loss: 0.770 | 914 ms/step , 6881.18 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 16:51:37 | Validation | Step: 170100 | Val_loss: 0.770 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:51:46 | Epoch: 1 | Step: 170110 | Dataset: 0-183181 | Loss: 0.756 | 913 ms/step , 6891.31 GFLOP/s , 15287.4 tokens/s INFO:__main__:2024-11-05 16:51:55 | Epoch: 1 | Step: 170120 | Dataset: 0-183501 | Loss: 0.665 | 913 ms/step , 6891.64 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 16:52:05 | Epoch: 1 | Step: 170130 | Dataset: 0-183821 | Loss: 0.753 | 913 ms/step , 6887.30 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 16:52:14 | Epoch: 1 | Step: 170140 | Dataset: 0-184141 | Loss: 0.744 | 912 ms/step , 6896.16 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 16:52:23 | Epoch: 1 | Step: 170150 | Dataset: 0-184461 | Loss: 0.625 | 913 ms/step , 6892.41 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 16:52:32 | Epoch: 1 | Step: 170160 | Dataset: 0-184781 | Loss: 0.763 | 913 ms/step , 6885.68 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 16:52:41 | Epoch: 1 | Step: 170170 | Dataset: 0-185101 | Loss: 0.798 | 914 ms/step , 6884.23 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 16:52:50 | Epoch: 1 | Step: 170180 | Dataset: 0-185421 | Loss: 0.668 | 913 ms/step , 6889.90 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 16:52:59 | Epoch: 1 | Step: 170190 | Dataset: 0-185741 | Loss: 0.817 | 914 ms/step , 6882.25 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 16:53:09 | Epoch: 1 | Step: 170200 | Dataset: 0-186061 | Loss: 0.764 | 913 ms/step , 6886.05 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 16:53:10 | Validation | Step: 170200 | Val_loss: 0.731 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:53:19 | Epoch: 1 | Step: 170210 | Dataset: 0-186381 | Loss: 0.673 | 914 ms/step , 6883.26 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 16:53:28 | Epoch: 1 | Step: 170220 | Dataset: 0-186701 | Loss: 0.794 | 914 ms/step , 6878.82 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 16:53:38 | Epoch: 1 | Step: 170230 | Dataset: 0-187021 | Loss: 0.781 | 913 ms/step , 6886.36 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 16:53:47 | Epoch: 1 | Step: 170240 | Dataset: 0-187341 | Loss: 0.673 | 912 ms/step , 6892.68 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 16:53:56 | Epoch: 1 | Step: 170250 | Dataset: 0-187661 | Loss: 0.766 | 913 ms/step , 6888.99 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 16:54:05 | Epoch: 1 | Step: 170260 | Dataset: 0-187981 | Loss: 0.811 | 913 ms/step , 6885.95 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 16:54:14 | Epoch: 1 | Step: 170270 | Dataset: 0-188301 | Loss: 0.663 | 913 ms/step , 6885.09 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 16:54:23 | Epoch: 1 | Step: 170280 | Dataset: 0-188621 | Loss: 0.774 | 914 ms/step , 6882.36 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 16:54:32 | Epoch: 1 | Step: 170290 | Dataset: 0-188941 | Loss: 0.766 | 914 ms/step , 6884.41 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 16:54:41 | Epoch: 1 | Step: 170300 | Dataset: 0-189261 | Loss: 0.641 | 912 ms/step , 6896.18 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 16:54:43 | Validation | Step: 170300 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:54:52 | Epoch: 1 | Step: 170310 | Dataset: 0-189581 | Loss: 0.721 | 913 ms/step , 6888.68 GFLOP/s , 15280.2 tokens/s INFO:__main__:2024-11-05 16:55:01 | Epoch: 1 | Step: 170320 | Dataset: 0-189901 | Loss: 0.752 | 913 ms/step , 6885.32 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 16:55:10 | Epoch: 1 | Step: 170330 | Dataset: 0-190221 | Loss: 0.800 | 916 ms/step , 6862.82 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 16:55:20 | Epoch: 1 | Step: 170340 | Dataset: 0-190541 | Loss: 0.795 | 913 ms/step , 6886.87 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 16:55:29 | Epoch: 1 | Step: 170350 | Dataset: 0-190861 | Loss: 0.731 | 912 ms/step , 6894.25 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 16:55:38 | Epoch: 1 | Step: 170360 | Dataset: 0-191181 | Loss: 0.806 | 913 ms/step , 6890.69 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 16:55:47 | Epoch: 1 | Step: 170370 | Dataset: 0-191501 | Loss: 0.747 | 913 ms/step , 6885.44 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 16:55:56 | Epoch: 1 | Step: 170380 | Dataset: 0-191821 | Loss: 0.863 | 913 ms/step , 6887.85 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 16:56:05 | Epoch: 1 | Step: 170390 | Dataset: 0-192141 | Loss: 0.744 | 915 ms/step , 6871.53 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 16:56:14 | Epoch: 1 | Step: 170400 | Dataset: 0-192461 | Loss: 0.780 | 914 ms/step , 6881.83 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:56:16 | Validation | Step: 170400 | Val_loss: 0.746 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:56:25 | Epoch: 1 | Step: 170410 | Dataset: 0-192781 | Loss: 0.889 | 916 ms/step , 6867.01 GFLOP/s , 15268.3 tokens/s INFO:__main__:2024-11-05 16:56:34 | Epoch: 1 | Step: 170420 | Dataset: 0-193101 | Loss: 0.704 | 914 ms/step , 6884.49 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 16:56:43 | Epoch: 1 | Step: 170430 | Dataset: 0-193421 | Loss: 0.759 | 914 ms/step , 6884.34 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 16:56:53 | Epoch: 1 | Step: 170440 | Dataset: 0-193741 | Loss: 0.776 | 912 ms/step , 6893.98 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 16:57:02 | Epoch: 1 | Step: 170450 | Dataset: 0-194061 | Loss: 0.807 | 913 ms/step , 6886.12 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 16:57:11 | Epoch: 1 | Step: 170460 | Dataset: 0-194381 | Loss: 0.677 | 913 ms/step , 6885.21 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 16:57:20 | Epoch: 1 | Step: 170470 | Dataset: 0-194701 | Loss: 0.812 | 914 ms/step , 6884.36 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 16:57:29 | Epoch: 1 | Step: 170480 | Dataset: 0-195021 | Loss: 0.713 | 914 ms/step , 6877.58 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 16:57:38 | Epoch: 1 | Step: 170490 | Dataset: 0-195341 | Loss: 0.816 | 914 ms/step , 6880.78 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 16:57:47 | Epoch: 1 | Step: 170500 | Dataset: 0-195661 | Loss: 0.725 | 914 ms/step , 6882.33 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 16:57:49 | Validation | Step: 170500 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:57:58 | Epoch: 1 | Step: 170510 | Dataset: 0-195981 | Loss: 0.786 | 913 ms/step , 6887.90 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-05 16:58:07 | Epoch: 1 | Step: 170520 | Dataset: 0-196301 | Loss: 0.483 | 913 ms/step , 6887.44 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 16:58:16 | Epoch: 1 | Step: 170530 | Dataset: 0-196621 | Loss: 0.624 | 913 ms/step , 6891.77 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 16:58:26 | Epoch: 1 | Step: 170540 | Dataset: 0-196941 | Loss: 0.797 | 913 ms/step , 6886.47 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 16:58:35 | Epoch: 1 | Step: 170550 | Dataset: 0-197261 | Loss: 0.762 | 915 ms/step , 6874.35 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 16:58:44 | Epoch: 1 | Step: 170560 | Dataset: 0-197581 | Loss: 0.713 | 914 ms/step , 6881.81 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 16:58:53 | Epoch: 1 | Step: 170570 | Dataset: 0-197901 | Loss: 0.703 | 912 ms/step , 6893.95 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 16:59:02 | Epoch: 1 | Step: 170580 | Dataset: 0-198221 | Loss: 0.813 | 913 ms/step , 6885.95 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 16:59:11 | Epoch: 1 | Step: 170590 | Dataset: 0-198541 | Loss: 0.742 | 914 ms/step , 6879.77 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 16:59:20 | Epoch: 1 | Step: 170600 | Dataset: 0-198861 | Loss: 0.784 | 913 ms/step , 6890.63 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 16:59:22 | Validation | Step: 170600 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 16:59:31 | Epoch: 1 | Step: 170610 | Dataset: 0-199181 | Loss: 0.778 | 915 ms/step , 6876.65 GFLOP/s , 15270.6 tokens/s INFO:__main__:2024-11-05 16:59:40 | Epoch: 1 | Step: 170620 | Dataset: 0-199501 | Loss: 0.766 | 915 ms/step , 6877.19 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 16:59:49 | Epoch: 1 | Step: 170630 | Dataset: 0-199821 | Loss: 0.887 | 914 ms/step , 6879.77 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 16:59:59 | Epoch: 1 | Step: 170640 | Dataset: 0-200141 | Loss: 0.810 | 913 ms/step , 6887.63 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 17:00:08 | Epoch: 1 | Step: 170650 | Dataset: 0-200461 | Loss: 0.693 | 914 ms/step , 6881.38 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 17:00:17 | Epoch: 1 | Step: 170660 | Dataset: 0-200781 | Loss: 0.694 | 913 ms/step , 6892.54 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 17:00:26 | Epoch: 1 | Step: 170670 | Dataset: 0-201101 | Loss: 0.670 | 914 ms/step , 6882.08 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 17:00:35 | Epoch: 1 | Step: 170680 | Dataset: 0-201421 | Loss: 0.736 | 913 ms/step , 6891.32 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 17:00:44 | Epoch: 1 | Step: 170690 | Dataset: 0-201741 | Loss: 0.872 | 914 ms/step , 6878.21 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 17:00:53 | Epoch: 1 | Step: 170700 | Dataset: 0-202061 | Loss: 0.668 | 914 ms/step , 6882.85 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 17:00:55 | Validation | Step: 170700 | Val_loss: 0.725 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:01:04 | Epoch: 1 | Step: 170710 | Dataset: 0-202381 | Loss: 0.766 | 912 ms/step , 6898.04 GFLOP/s , 15265.4 tokens/s INFO:__main__:2024-11-05 17:01:13 | Epoch: 1 | Step: 170720 | Dataset: 0-202701 | Loss: 0.758 | 914 ms/step , 6877.66 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 17:01:22 | Epoch: 1 | Step: 170730 | Dataset: 0-203021 | Loss: 0.664 | 913 ms/step , 6892.04 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 17:01:32 | Epoch: 1 | Step: 170740 | Dataset: 0-203341 | Loss: 0.694 | 914 ms/step , 6884.93 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 17:01:41 | Epoch: 1 | Step: 170750 | Dataset: 0-203661 | Loss: 0.790 | 914 ms/step , 6880.64 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 17:01:50 | Epoch: 1 | Step: 170760 | Dataset: 0-203981 | Loss: 0.788 | 913 ms/step , 6885.37 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 17:01:59 | Epoch: 1 | Step: 170770 | Dataset: 0-204301 | Loss: 0.707 | 915 ms/step , 6876.70 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 17:02:08 | Epoch: 1 | Step: 170780 | Dataset: 0-204621 | Loss: 0.659 | 912 ms/step , 6892.97 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 17:02:17 | Epoch: 1 | Step: 170790 | Dataset: 0-204941 | Loss: 0.784 | 913 ms/step , 6886.47 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 17:02:26 | Epoch: 1 | Step: 170800 | Dataset: 0-205261 | Loss: 0.885 | 915 ms/step , 6872.34 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 17:02:28 | Validation | Step: 170800 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:02:37 | Epoch: 1 | Step: 170810 | Dataset: 0-205581 | Loss: 0.831 | 914 ms/step , 6884.31 GFLOP/s , 15258.3 tokens/s INFO:__main__:2024-11-05 17:02:46 | Epoch: 1 | Step: 170820 | Dataset: 0-205901 | Loss: 0.848 | 915 ms/step , 6873.50 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 17:02:55 | Epoch: 1 | Step: 170830 | Dataset: 0-206221 | Loss: 0.579 | 913 ms/step , 6887.11 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 17:03:05 | Epoch: 1 | Step: 170840 | Dataset: 0-206541 | Loss: 0.718 | 913 ms/step , 6889.11 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 17:03:14 | Epoch: 1 | Step: 170850 | Dataset: 0-206861 | Loss: 0.843 | 913 ms/step , 6886.80 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 17:03:23 | Epoch: 1 | Step: 170860 | Dataset: 0-207181 | Loss: 0.701 | 914 ms/step , 6882.59 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 17:03:32 | Epoch: 1 | Step: 170870 | Dataset: 0-207501 | Loss: 0.768 | 915 ms/step , 6871.80 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 17:03:41 | Epoch: 1 | Step: 170880 | Dataset: 0-207821 | Loss: 0.745 | 913 ms/step , 6885.69 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 17:03:50 | Epoch: 1 | Step: 170890 | Dataset: 0-208141 | Loss: 0.783 | 913 ms/step , 6885.92 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 17:03:59 | Epoch: 1 | Step: 170900 | Dataset: 0-208461 | Loss: 0.586 | 913 ms/step , 6886.75 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 17:04:01 | Validation | Step: 170900 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:04:10 | Epoch: 1 | Step: 170910 | Dataset: 0-208781 | Loss: 0.824 | 915 ms/step , 6872.59 GFLOP/s , 15261.1 tokens/s INFO:__main__:2024-11-05 17:04:19 | Epoch: 1 | Step: 170920 | Dataset: 0-209101 | Loss: 0.736 | 912 ms/step , 6896.06 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 17:04:28 | Epoch: 1 | Step: 170930 | Dataset: 0-209421 | Loss: 0.678 | 913 ms/step , 6888.43 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 17:04:38 | Epoch: 1 | Step: 170940 | Dataset: 0-209741 | Loss: 0.824 | 914 ms/step , 6883.84 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 17:04:47 | Epoch: 1 | Step: 170950 | Dataset: 0-210061 | Loss: 0.622 | 912 ms/step , 6895.38 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 17:04:56 | Epoch: 1 | Step: 170960 | Dataset: 0-210381 | Loss: 0.756 | 913 ms/step , 6891.87 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 17:05:05 | Epoch: 1 | Step: 170970 | Dataset: 0-210701 | Loss: 0.812 | 914 ms/step , 6883.97 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 17:05:14 | Epoch: 1 | Step: 170980 | Dataset: 0-211021 | Loss: 0.708 | 913 ms/step , 6886.81 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 17:05:23 | Epoch: 1 | Step: 170990 | Dataset: 0-211341 | Loss: 0.646 | 913 ms/step , 6890.58 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 17:05:32 | Epoch: 1 | Step: 171000 | Dataset: 0-211661 | Loss: 0.745 | 914 ms/step , 6883.87 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 17:05:34 | Validation | Step: 171000 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:05:34 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_170534_step_171000.pt` INFO:__main__:2024-11-05 17:05:44 | Epoch: 1 | Step: 171010 | Dataset: 0-211981 | Loss: 0.873 | 914 ms/step , 6884.79 GFLOP/s , 13807.5 tokens/s INFO:__main__:2024-11-05 17:05:53 | Epoch: 1 | Step: 171020 | Dataset: 0-212301 | Loss: 0.652 | 913 ms/step , 6886.18 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 17:06:03 | Epoch: 1 | Step: 171030 | Dataset: 0-212621 | Loss: 0.685 | 913 ms/step , 6888.88 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 17:06:12 | Epoch: 1 | Step: 171040 | Dataset: 0-212941 | Loss: 0.782 | 913 ms/step , 6890.35 GFLOP/s , 17900.3 tokens/s INFO:__main__:2024-11-05 17:06:21 | Epoch: 1 | Step: 171050 | Dataset: 0-213261 | Loss: 0.661 | 913 ms/step , 6889.75 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 17:06:30 | Epoch: 1 | Step: 171060 | Dataset: 0-213581 | Loss: 0.683 | 913 ms/step , 6890.35 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 17:06:39 | Epoch: 1 | Step: 171070 | Dataset: 0-213901 | Loss: 0.801 | 915 ms/step , 6876.64 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 17:06:48 | Epoch: 1 | Step: 171080 | Dataset: 0-214221 | Loss: 0.794 | 913 ms/step , 6888.47 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 17:06:57 | Epoch: 1 | Step: 171090 | Dataset: 0-214541 | Loss: 0.687 | 913 ms/step , 6886.57 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 17:07:07 | Epoch: 1 | Step: 171100 | Dataset: 0-214861 | Loss: 0.787 | 915 ms/step , 6872.94 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 17:07:08 | Validation | Step: 171100 | Val_loss: 0.770 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:07:17 | Epoch: 1 | Step: 171110 | Dataset: 0-215181 | Loss: 0.741 | 913 ms/step , 6886.63 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-05 17:07:26 | Epoch: 1 | Step: 171120 | Dataset: 0-215501 | Loss: 0.442 | 912 ms/step , 6898.76 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 17:07:36 | Epoch: 1 | Step: 171130 | Dataset: 0-215821 | Loss: 0.793 | 913 ms/step , 6887.30 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 17:07:45 | Epoch: 1 | Step: 171140 | Dataset: 0-216141 | Loss: 0.756 | 914 ms/step , 6882.07 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 17:07:54 | Epoch: 1 | Step: 171150 | Dataset: 0-216461 | Loss: 0.374 | 911 ms/step , 6904.25 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 17:08:03 | Epoch: 1 | Step: 171160 | Dataset: 0-216781 | Loss: 0.476 | 912 ms/step , 6897.64 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 17:08:12 | Epoch: 1 | Step: 171170 | Dataset: 0-217101 | Loss: 0.797 | 913 ms/step , 6891.50 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 17:08:21 | Epoch: 1 | Step: 171180 | Dataset: 0-217421 | Loss: 0.769 | 912 ms/step , 6894.30 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 17:08:30 | Epoch: 1 | Step: 171190 | Dataset: 0-217741 | Loss: 0.802 | 914 ms/step , 6884.77 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 17:08:39 | Epoch: 1 | Step: 171200 | Dataset: 0-218061 | Loss: 0.725 | 914 ms/step , 6882.18 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 17:08:41 | Validation | Step: 171200 | Val_loss: 0.767 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:08:50 | Epoch: 1 | Step: 171210 | Dataset: 0-218381 | Loss: 0.879 | 914 ms/step , 6884.18 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-05 17:08:59 | Epoch: 1 | Step: 171220 | Dataset: 0-218701 | Loss: 0.799 | 913 ms/step , 6885.99 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 17:09:08 | Epoch: 1 | Step: 171230 | Dataset: 0-219021 | Loss: 0.906 | 913 ms/step , 6890.73 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 17:09:18 | Epoch: 1 | Step: 171240 | Dataset: 0-219341 | Loss: 0.727 | 913 ms/step , 6891.33 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 17:09:27 | Epoch: 1 | Step: 171250 | Dataset: 0-219661 | Loss: 0.767 | 913 ms/step , 6887.05 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 17:09:36 | Epoch: 1 | Step: 171260 | Dataset: 0-219981 | Loss: 0.847 | 914 ms/step , 6882.33 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 17:09:45 | Epoch: 1 | Step: 171270 | Dataset: 0-220301 | Loss: 0.781 | 913 ms/step , 6890.90 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 17:09:54 | Epoch: 1 | Step: 171280 | Dataset: 0-220621 | Loss: 0.573 | 912 ms/step , 6895.57 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 17:10:03 | Epoch: 1 | Step: 171290 | Dataset: 0-220941 | Loss: 0.824 | 914 ms/step , 6882.14 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 17:10:12 | Epoch: 1 | Step: 171300 | Dataset: 0-221261 | Loss: 0.764 | 913 ms/step , 6885.79 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 17:10:14 | Validation | Step: 171300 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:10:23 | Epoch: 1 | Step: 171310 | Dataset: 0-221581 | Loss: 0.683 | 912 ms/step , 6895.90 GFLOP/s , 15293.7 tokens/s INFO:__main__:2024-11-05 17:10:32 | Epoch: 1 | Step: 171320 | Dataset: 0-221901 | Loss: 0.765 | 913 ms/step , 6887.61 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 17:10:41 | Epoch: 1 | Step: 171330 | Dataset: 0-222221 | Loss: 0.750 | 914 ms/step , 6880.20 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 17:10:51 | Epoch: 1 | Step: 171340 | Dataset: 0-222541 | Loss: 0.786 | 914 ms/step , 6883.51 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 17:11:00 | Epoch: 1 | Step: 171350 | Dataset: 0-222861 | Loss: 0.796 | 913 ms/step , 6890.23 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 17:11:09 | Epoch: 1 | Step: 171360 | Dataset: 0-223181 | Loss: 0.832 | 914 ms/step , 6881.69 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 17:11:18 | Epoch: 1 | Step: 171370 | Dataset: 0-223501 | Loss: 0.766 | 914 ms/step , 6881.82 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 17:11:27 | Epoch: 1 | Step: 171380 | Dataset: 0-223821 | Loss: 0.789 | 913 ms/step , 6885.37 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 17:11:36 | Epoch: 1 | Step: 171390 | Dataset: 0-224141 | Loss: 0.754 | 915 ms/step , 6874.61 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 17:11:45 | Epoch: 1 | Step: 171400 | Dataset: 0-224461 | Loss: 0.752 | 913 ms/step , 6888.60 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 17:11:47 | Validation | Step: 171400 | Val_loss: 0.766 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:11:56 | Epoch: 1 | Step: 171410 | Dataset: 0-224781 | Loss: 0.774 | 913 ms/step , 6889.74 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 17:12:05 | Epoch: 1 | Step: 171420 | Dataset: 0-225101 | Loss: 0.808 | 913 ms/step , 6887.23 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 17:12:14 | Epoch: 1 | Step: 171430 | Dataset: 0-225421 | Loss: 0.727 | 913 ms/step , 6885.93 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 17:12:24 | Epoch: 1 | Step: 171440 | Dataset: 0-225741 | Loss: 0.626 | 913 ms/step , 6890.70 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 17:12:33 | Epoch: 1 | Step: 171450 | Dataset: 0-226061 | Loss: 0.765 | 913 ms/step , 6890.01 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 17:12:42 | Epoch: 1 | Step: 171460 | Dataset: 0-226381 | Loss: 0.760 | 913 ms/step , 6890.16 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 17:12:51 | Epoch: 1 | Step: 171470 | Dataset: 0-226701 | Loss: 0.741 | 914 ms/step , 6878.81 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 17:13:00 | Epoch: 1 | Step: 171480 | Dataset: 0-227021 | Loss: 0.725 | 914 ms/step , 6882.48 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 17:13:09 | Epoch: 1 | Step: 171490 | Dataset: 0-227341 | Loss: 0.761 | 913 ms/step , 6886.80 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 17:13:18 | Epoch: 1 | Step: 171500 | Dataset: 0-227661 | Loss: 0.729 | 913 ms/step , 6887.00 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 17:13:20 | Validation | Step: 171500 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:13:29 | Epoch: 1 | Step: 171510 | Dataset: 0-227981 | Loss: 0.844 | 913 ms/step , 6886.92 GFLOP/s , 15278.8 tokens/s INFO:__main__:2024-11-05 17:13:38 | Epoch: 1 | Step: 171520 | Dataset: 0-228301 | Loss: 0.664 | 912 ms/step , 6893.84 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 17:13:47 | Epoch: 1 | Step: 171530 | Dataset: 0-228621 | Loss: 0.875 | 914 ms/step , 6883.00 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 17:13:56 | Epoch: 1 | Step: 171540 | Dataset: 0-228941 | Loss: 0.824 | 914 ms/step , 6883.52 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 17:14:06 | Epoch: 1 | Step: 171550 | Dataset: 0-229261 | Loss: 0.785 | 914 ms/step , 6882.83 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 17:14:15 | Epoch: 1 | Step: 171560 | Dataset: 0-229581 | Loss: 0.815 | 914 ms/step , 6882.12 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 17:14:24 | Epoch: 1 | Step: 171570 | Dataset: 0-229901 | Loss: 0.811 | 914 ms/step , 6883.89 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 17:14:33 | Epoch: 1 | Step: 171580 | Dataset: 0-230221 | Loss: 0.698 | 912 ms/step , 6893.87 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 17:14:42 | Epoch: 1 | Step: 171590 | Dataset: 0-230541 | Loss: 0.780 | 914 ms/step , 6881.60 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 17:14:51 | Epoch: 1 | Step: 171600 | Dataset: 0-230861 | Loss: 0.830 | 914 ms/step , 6884.53 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 17:14:53 | Validation | Step: 171600 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:15:02 | Epoch: 1 | Step: 171610 | Dataset: 0-231181 | Loss: 0.798 | 913 ms/step , 6887.52 GFLOP/s , 15280.8 tokens/s INFO:__main__:2024-11-05 17:15:11 | Epoch: 1 | Step: 171620 | Dataset: 0-231501 | Loss: 0.853 | 914 ms/step , 6882.61 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 17:15:20 | Epoch: 1 | Step: 171630 | Dataset: 0-231821 | Loss: 0.766 | 912 ms/step , 6896.62 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 17:15:29 | Epoch: 1 | Step: 171640 | Dataset: 0-232141 | Loss: 0.773 | 913 ms/step , 6885.52 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 17:15:39 | Epoch: 1 | Step: 171650 | Dataset: 0-232461 | Loss: 0.740 | 914 ms/step , 6883.27 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 17:15:48 | Epoch: 1 | Step: 171660 | Dataset: 0-232781 | Loss: 0.813 | 913 ms/step , 6890.54 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 17:15:57 | Epoch: 1 | Step: 171670 | Dataset: 0-233101 | Loss: 0.765 | 914 ms/step , 6883.34 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 17:16:06 | Epoch: 1 | Step: 171680 | Dataset: 0-233421 | Loss: 0.854 | 913 ms/step , 6885.85 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 17:16:15 | Epoch: 1 | Step: 171690 | Dataset: 0-233741 | Loss: 0.917 | 914 ms/step , 6884.37 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 17:16:24 | Epoch: 1 | Step: 171700 | Dataset: 0-234061 | Loss: 0.694 | 914 ms/step , 6884.68 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 17:16:26 | Validation | Step: 171700 | Val_loss: 0.771 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:16:35 | Epoch: 1 | Step: 171710 | Dataset: 0-234381 | Loss: 0.771 | 914 ms/step , 6883.69 GFLOP/s , 15268.0 tokens/s INFO:__main__:2024-11-05 17:16:44 | Epoch: 1 | Step: 171720 | Dataset: 0-234701 | Loss: 0.803 | 913 ms/step , 6888.08 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 17:16:53 | Epoch: 1 | Step: 171730 | Dataset: 0-235021 | Loss: 0.787 | 913 ms/step , 6890.87 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 17:17:02 | Epoch: 1 | Step: 171740 | Dataset: 0-235341 | Loss: 0.862 | 913 ms/step , 6891.78 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 17:17:12 | Epoch: 1 | Step: 171750 | Dataset: 0-235661 | Loss: 0.671 | 913 ms/step , 6890.71 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 17:17:21 | Epoch: 1 | Step: 171760 | Dataset: 0-235981 | Loss: 0.794 | 913 ms/step , 6885.93 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 17:17:30 | Epoch: 1 | Step: 171770 | Dataset: 0-236301 | Loss: 0.670 | 913 ms/step , 6889.64 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 17:17:39 | Epoch: 1 | Step: 171780 | Dataset: 0-236621 | Loss: 0.698 | 913 ms/step , 6888.80 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 17:17:48 | Epoch: 1 | Step: 171790 | Dataset: 0-236941 | Loss: 0.738 | 912 ms/step , 6896.31 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 17:17:57 | Epoch: 1 | Step: 171800 | Dataset: 0-237261 | Loss: 0.783 | 913 ms/step , 6889.96 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 17:17:59 | Validation | Step: 171800 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:18:08 | Epoch: 1 | Step: 171810 | Dataset: 0-237581 | Loss: 0.710 | 913 ms/step , 6891.98 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 17:18:17 | Epoch: 1 | Step: 171820 | Dataset: 0-237901 | Loss: 0.701 | 912 ms/step , 6897.74 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 17:18:26 | Epoch: 1 | Step: 171830 | Dataset: 0-238221 | Loss: 0.805 | 914 ms/step , 6883.70 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 17:18:35 | Epoch: 1 | Step: 171840 | Dataset: 0-238541 | Loss: 0.794 | 913 ms/step , 6885.46 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 17:18:44 | Epoch: 1 | Step: 171850 | Dataset: 0-238861 | Loss: 0.769 | 913 ms/step , 6887.21 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 17:18:54 | Epoch: 1 | Step: 171860 | Dataset: 0-239181 | Loss: 0.890 | 914 ms/step , 6881.42 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 17:19:03 | Epoch: 1 | Step: 171870 | Dataset: 0-239501 | Loss: 0.734 | 914 ms/step , 6884.40 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 17:19:12 | Epoch: 1 | Step: 171880 | Dataset: 0-239821 | Loss: 0.682 | 913 ms/step , 6887.48 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 17:19:21 | Epoch: 1 | Step: 171890 | Dataset: 0-240141 | Loss: 0.714 | 912 ms/step , 6894.02 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 17:19:30 | Epoch: 1 | Step: 171900 | Dataset: 0-240461 | Loss: 0.684 | 913 ms/step , 6889.14 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 17:19:32 | Validation | Step: 171900 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:19:41 | Epoch: 1 | Step: 171910 | Dataset: 0-240781 | Loss: 0.615 | 913 ms/step , 6890.89 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-05 17:19:50 | Epoch: 1 | Step: 171920 | Dataset: 0-241101 | Loss: 0.743 | 912 ms/step , 6893.52 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 17:19:59 | Epoch: 1 | Step: 171930 | Dataset: 0-241421 | Loss: 0.738 | 913 ms/step , 6885.65 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 17:20:08 | Epoch: 1 | Step: 171940 | Dataset: 0-241741 | Loss: 0.601 | 913 ms/step , 6892.48 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 17:20:17 | Epoch: 1 | Step: 171950 | Dataset: 0-242061 | Loss: 0.753 | 914 ms/step , 6882.29 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 17:20:27 | Epoch: 1 | Step: 171960 | Dataset: 0-242381 | Loss: 0.759 | 912 ms/step , 6896.44 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 17:20:36 | Epoch: 1 | Step: 171970 | Dataset: 0-242701 | Loss: 0.741 | 914 ms/step , 6882.05 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 17:20:45 | Epoch: 1 | Step: 171980 | Dataset: 0-243021 | Loss: 0.799 | 913 ms/step , 6887.57 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 17:20:54 | Epoch: 1 | Step: 171990 | Dataset: 0-243341 | Loss: 0.749 | 913 ms/step , 6889.97 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 17:21:03 | Epoch: 1 | Step: 172000 | Dataset: 0-243661 | Loss: 0.725 | 913 ms/step , 6888.96 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 17:21:05 | Validation | Step: 172000 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:21:05 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_172105_step_172000.pt` INFO:__main__:2024-11-05 17:21:15 | Epoch: 1 | Step: 172010 | Dataset: 0-243981 | Loss: 0.518 | 913 ms/step , 6889.73 GFLOP/s , 13794.0 tokens/s INFO:__main__:2024-11-05 17:21:24 | Epoch: 1 | Step: 172020 | Dataset: 0-244301 | Loss: 0.271 | 912 ms/step , 6898.68 GFLOP/s , 17965.6 tokens/s INFO:__main__:2024-11-05 17:21:33 | Epoch: 1 | Step: 172030 | Dataset: 0-244621 | Loss: 0.279 | 911 ms/step , 6905.91 GFLOP/s , 17963.1 tokens/s INFO:__main__:2024-11-05 17:21:42 | Epoch: 1 | Step: 172040 | Dataset: 0-244941 | Loss: 0.270 | 912 ms/step , 6893.43 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-05 17:21:51 | Epoch: 1 | Step: 172050 | Dataset: 0-245261 | Loss: 0.325 | 912 ms/step , 6897.20 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 17:22:01 | Epoch: 1 | Step: 172060 | Dataset: 0-245581 | Loss: 0.339 | 912 ms/step , 6893.26 GFLOP/s , 17947.2 tokens/s INFO:__main__:2024-11-05 17:22:10 | Epoch: 1 | Step: 172070 | Dataset: 0-245901 | Loss: 0.277 | 913 ms/step , 6885.97 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 17:22:19 | Epoch: 1 | Step: 172080 | Dataset: 0-246221 | Loss: 0.307 | 912 ms/step , 6898.61 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-05 17:22:28 | Epoch: 1 | Step: 172090 | Dataset: 0-246541 | Loss: 0.339 | 912 ms/step , 6897.41 GFLOP/s , 17954.0 tokens/s INFO:__main__:2024-11-05 17:22:37 | Epoch: 1 | Step: 172100 | Dataset: 0-246861 | Loss: 0.306 | 913 ms/step , 6890.95 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-05 17:22:39 | Validation | Step: 172100 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:22:48 | Epoch: 1 | Step: 172110 | Dataset: 0-247181 | Loss: 0.334 | 912 ms/step , 6893.84 GFLOP/s , 15301.3 tokens/s INFO:__main__:2024-11-05 17:22:57 | Epoch: 1 | Step: 172120 | Dataset: 0-247501 | Loss: 0.305 | 912 ms/step , 6893.65 GFLOP/s , 17950.9 tokens/s INFO:__main__:2024-11-05 17:23:06 | Epoch: 1 | Step: 172130 | Dataset: 0-247821 | Loss: 0.308 | 911 ms/step , 6901.58 GFLOP/s , 17956.5 tokens/s INFO:__main__:2024-11-05 17:23:15 | Epoch: 1 | Step: 172140 | Dataset: 0-248141 | Loss: 0.276 | 912 ms/step , 6895.81 GFLOP/s , 17954.0 tokens/s INFO:__main__:2024-11-05 17:23:24 | Epoch: 1 | Step: 172150 | Dataset: 0-248461 | Loss: 0.282 | 911 ms/step , 6904.78 GFLOP/s , 17959.4 tokens/s INFO:__main__:2024-11-05 17:23:33 | Epoch: 1 | Step: 172160 | Dataset: 0-248781 | Loss: 0.319 | 913 ms/step , 6887.32 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 17:23:43 | Epoch: 1 | Step: 172170 | Dataset: 0-249101 | Loss: 0.236 | 912 ms/step , 6899.84 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-05 17:23:52 | Epoch: 1 | Step: 172180 | Dataset: 0-249421 | Loss: 0.296 | 911 ms/step , 6900.90 GFLOP/s , 17955.0 tokens/s INFO:__main__:2024-11-05 17:24:01 | Epoch: 1 | Step: 172190 | Dataset: 0-249741 | Loss: 0.294 | 911 ms/step , 6900.22 GFLOP/s , 17954.5 tokens/s INFO:__main__:2024-11-05 17:24:10 | Epoch: 1 | Step: 172200 | Dataset: 0-250061 | Loss: 0.331 | 912 ms/step , 6896.45 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 17:24:11 | Validation | Step: 172200 | Val_loss: 0.799 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:24:21 | Epoch: 1 | Step: 172210 | Dataset: 0-250381 | Loss: 0.293 | 912 ms/step , 6897.17 GFLOP/s , 15287.9 tokens/s INFO:__main__:2024-11-05 17:24:30 | Epoch: 1 | Step: 172220 | Dataset: 0-250701 | Loss: 0.294 | 913 ms/step , 6891.90 GFLOP/s , 17955.9 tokens/s INFO:__main__:2024-11-05 17:24:39 | Epoch: 1 | Step: 172230 | Dataset: 0-251021 | Loss: 0.253 | 913 ms/step , 6890.31 GFLOP/s , 17956.9 tokens/s INFO:__main__:2024-11-05 17:24:48 | Epoch: 1 | Step: 172240 | Dataset: 0-251341 | Loss: 0.240 | 912 ms/step , 6896.04 GFLOP/s , 17955.8 tokens/s INFO:__main__:2024-11-05 17:24:57 | Epoch: 1 | Step: 172250 | Dataset: 0-251661 | Loss: 0.309 | 911 ms/step , 6901.96 GFLOP/s , 17957.8 tokens/s INFO:__main__:2024-11-05 17:25:06 | Epoch: 1 | Step: 172260 | Dataset: 0-251981 | Loss: 0.292 | 911 ms/step , 6901.65 GFLOP/s , 17959.1 tokens/s INFO:__main__:2024-11-05 17:25:15 | Epoch: 1 | Step: 172270 | Dataset: 0-252301 | Loss: 0.336 | 912 ms/step , 6894.19 GFLOP/s , 17954.8 tokens/s INFO:__main__:2024-11-05 17:25:24 | Epoch: 1 | Step: 172280 | Dataset: 0-252621 | Loss: 0.285 | 912 ms/step , 6897.19 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 17:25:34 | Epoch: 1 | Step: 172290 | Dataset: 0-252941 | Loss: 0.297 | 912 ms/step , 6893.90 GFLOP/s , 17955.0 tokens/s INFO:__main__:2024-11-05 17:25:43 | Epoch: 1 | Step: 172300 | Dataset: 0-253261 | Loss: 0.243 | 912 ms/step , 6895.68 GFLOP/s , 17955.7 tokens/s INFO:__main__:2024-11-05 17:25:44 | Validation | Step: 172300 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:25:53 | Epoch: 1 | Step: 172310 | Dataset: 0-253581 | Loss: 0.743 | 912 ms/step , 6897.66 GFLOP/s , 15301.6 tokens/s INFO:__main__:2024-11-05 17:26:03 | Epoch: 1 | Step: 172320 | Dataset: 0-253901 | Loss: 0.695 | 912 ms/step , 6893.68 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 17:26:12 | Epoch: 1 | Step: 172330 | Dataset: 0-254221 | Loss: 0.695 | 912 ms/step , 6893.39 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 17:26:21 | Epoch: 1 | Step: 172340 | Dataset: 0-254541 | Loss: 0.736 | 912 ms/step , 6893.26 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 17:26:30 | Epoch: 1 | Step: 172350 | Dataset: 0-254861 | Loss: 0.667 | 913 ms/step , 6892.08 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 17:26:39 | Epoch: 1 | Step: 172360 | Dataset: 0-255181 | Loss: 0.647 | 914 ms/step , 6884.89 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 17:26:48 | Epoch: 1 | Step: 172370 | Dataset: 0-255501 | Loss: 0.699 | 912 ms/step , 6898.98 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 17:26:57 | Epoch: 1 | Step: 172380 | Dataset: 0-255821 | Loss: 0.610 | 913 ms/step , 6892.06 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 17:27:07 | Epoch: 1 | Step: 172390 | Dataset: 0-256141 | Loss: 0.727 | 912 ms/step , 6893.18 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 17:27:16 | Epoch: 1 | Step: 172400 | Dataset: 0-256461 | Loss: 0.634 | 913 ms/step , 6890.01 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 17:27:17 | Validation | Step: 172400 | Val_loss: 0.766 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:27:26 | Epoch: 1 | Step: 172410 | Dataset: 0-256781 | Loss: 0.662 | 913 ms/step , 6891.52 GFLOP/s , 15282.2 tokens/s INFO:__main__:2024-11-05 17:27:35 | Epoch: 1 | Step: 172420 | Dataset: 0-257101 | Loss: 0.679 | 913 ms/step , 6888.52 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 17:27:45 | Epoch: 1 | Step: 172430 | Dataset: 0-257421 | Loss: 0.667 | 913 ms/step , 6892.22 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 17:27:54 | Epoch: 1 | Step: 172440 | Dataset: 0-257741 | Loss: 0.702 | 913 ms/step , 6887.61 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 17:28:03 | Epoch: 1 | Step: 172450 | Dataset: 0-258061 | Loss: 0.572 | 912 ms/step , 6893.69 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-05 17:28:12 | Epoch: 1 | Step: 172460 | Dataset: 0-258381 | Loss: 0.657 | 913 ms/step , 6891.83 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 17:28:21 | Epoch: 1 | Step: 172470 | Dataset: 0-258701 | Loss: 0.635 | 912 ms/step , 6899.13 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 17:28:30 | Epoch: 1 | Step: 172480 | Dataset: 0-259021 | Loss: 0.688 | 912 ms/step , 6900.03 GFLOP/s , 17950.6 tokens/s INFO:__main__:2024-11-05 17:28:39 | Epoch: 1 | Step: 172490 | Dataset: 0-259341 | Loss: 0.648 | 912 ms/step , 6897.32 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 17:28:49 | Epoch: 1 | Step: 172500 | Dataset: 0-259661 | Loss: 0.645 | 913 ms/step , 6885.54 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 17:28:50 | Validation | Step: 172500 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:28:59 | Epoch: 1 | Step: 172510 | Dataset: 0-259981 | Loss: 0.626 | 913 ms/step , 6890.56 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-05 17:29:08 | Epoch: 1 | Step: 172520 | Dataset: 0-260301 | Loss: 0.617 | 912 ms/step , 6895.17 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 17:29:17 | Epoch: 1 | Step: 172530 | Dataset: 0-260621 | Loss: 0.692 | 912 ms/step , 6893.75 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 17:29:27 | Epoch: 1 | Step: 172540 | Dataset: 0-260941 | Loss: 0.645 | 911 ms/step , 6902.76 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-05 17:29:36 | Epoch: 1 | Step: 172550 | Dataset: 0-261261 | Loss: 0.679 | 913 ms/step , 6887.68 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 17:29:45 | Epoch: 1 | Step: 172560 | Dataset: 0-261581 | Loss: 0.595 | 913 ms/step , 6887.67 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 17:29:54 | Epoch: 1 | Step: 172570 | Dataset: 0-261901 | Loss: 0.608 | 912 ms/step , 6896.24 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:30:03 | Epoch: 1 | Step: 172580 | Dataset: 0-262221 | Loss: 0.705 | 913 ms/step , 6890.31 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 17:30:12 | Epoch: 1 | Step: 172590 | Dataset: 0-262541 | Loss: 0.607 | 914 ms/step , 6878.12 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 17:30:21 | Epoch: 1 | Step: 172600 | Dataset: 0-262861 | Loss: 0.679 | 912 ms/step , 6893.03 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 17:30:23 | Validation | Step: 172600 | Val_loss: 0.706 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:30:32 | Epoch: 1 | Step: 172610 | Dataset: 0-263181 | Loss: 0.636 | 913 ms/step , 6886.57 GFLOP/s , 15282.9 tokens/s INFO:__main__:2024-11-05 17:30:41 | Epoch: 1 | Step: 172620 | Dataset: 0-263501 | Loss: 0.590 | 912 ms/step , 6895.62 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-05 17:30:50 | Epoch: 1 | Step: 172630 | Dataset: 0-263821 | Loss: 0.631 | 913 ms/step , 6888.68 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 17:31:00 | Epoch: 1 | Step: 172640 | Dataset: 0-264141 | Loss: 0.627 | 912 ms/step , 6894.23 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 17:31:09 | Epoch: 1 | Step: 172650 | Dataset: 0-264461 | Loss: 0.688 | 912 ms/step , 6894.29 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 17:31:18 | Epoch: 1 | Step: 172660 | Dataset: 0-264781 | Loss: 0.607 | 913 ms/step , 6890.98 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 17:31:27 | Epoch: 1 | Step: 172670 | Dataset: 0-265101 | Loss: 0.645 | 915 ms/step , 6875.99 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 17:31:36 | Epoch: 1 | Step: 172680 | Dataset: 0-265421 | Loss: 0.575 | 912 ms/step , 6899.03 GFLOP/s , 17952.4 tokens/s INFO:__main__:2024-11-05 17:31:45 | Epoch: 1 | Step: 172690 | Dataset: 0-265741 | Loss: 0.700 | 913 ms/step , 6892.14 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 17:31:54 | Epoch: 1 | Step: 172700 | Dataset: 0-266061 | Loss: 0.629 | 913 ms/step , 6889.84 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 17:31:56 | Validation | Step: 172700 | Val_loss: 0.709 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:32:05 | Epoch: 1 | Step: 172710 | Dataset: 0-266381 | Loss: 0.613 | 912 ms/step , 6894.23 GFLOP/s , 15291.9 tokens/s INFO:__main__:2024-11-05 17:32:14 | Epoch: 1 | Step: 172720 | Dataset: 0-266701 | Loss: 0.585 | 912 ms/step , 6898.47 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-05 17:32:23 | Epoch: 1 | Step: 172730 | Dataset: 0-267021 | Loss: 0.583 | 913 ms/step , 6887.75 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 17:32:32 | Epoch: 1 | Step: 172740 | Dataset: 0-267341 | Loss: 0.635 | 912 ms/step , 6894.06 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 17:32:42 | Epoch: 1 | Step: 172750 | Dataset: 0-267661 | Loss: 0.682 | 911 ms/step , 6901.33 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-05 17:32:51 | Epoch: 1 | Step: 172760 | Dataset: 0-267981 | Loss: 0.624 | 912 ms/step , 6896.02 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 17:33:00 | Epoch: 1 | Step: 172770 | Dataset: 0-268301 | Loss: 0.680 | 911 ms/step , 6902.39 GFLOP/s , 17949.8 tokens/s INFO:__main__:2024-11-05 17:33:09 | Epoch: 1 | Step: 172780 | Dataset: 0-268621 | Loss: 0.576 | 912 ms/step , 6898.89 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 17:33:18 | Epoch: 1 | Step: 172790 | Dataset: 0-268941 | Loss: 0.597 | 913 ms/step , 6891.18 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 17:33:27 | Epoch: 1 | Step: 172800 | Dataset: 0-269261 | Loss: 0.608 | 912 ms/step , 6893.75 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 17:33:29 | Validation | Step: 172800 | Val_loss: 0.784 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:33:38 | Epoch: 1 | Step: 172810 | Dataset: 0-269581 | Loss: 0.653 | 912 ms/step , 6893.42 GFLOP/s , 15286.1 tokens/s INFO:__main__:2024-11-05 17:33:47 | Epoch: 1 | Step: 172820 | Dataset: 0-269901 | Loss: 0.597 | 912 ms/step , 6894.48 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 17:33:56 | Epoch: 1 | Step: 172830 | Dataset: 0-270221 | Loss: 0.606 | 913 ms/step , 6889.52 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 17:34:05 | Epoch: 1 | Step: 172840 | Dataset: 0-270541 | Loss: 0.608 | 912 ms/step , 6895.43 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 17:34:14 | Epoch: 1 | Step: 172850 | Dataset: 0-270861 | Loss: 0.625 | 913 ms/step , 6886.33 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 17:34:24 | Epoch: 1 | Step: 172860 | Dataset: 0-271181 | Loss: 0.652 | 914 ms/step , 6884.61 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 17:34:33 | Epoch: 1 | Step: 172870 | Dataset: 0-271501 | Loss: 0.676 | 913 ms/step , 6888.85 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 17:34:42 | Epoch: 1 | Step: 172880 | Dataset: 0-271821 | Loss: 0.705 | 913 ms/step , 6887.18 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 17:34:51 | Epoch: 1 | Step: 172890 | Dataset: 0-272141 | Loss: 0.705 | 913 ms/step , 6888.95 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 17:35:00 | Epoch: 1 | Step: 172900 | Dataset: 0-272461 | Loss: 0.677 | 913 ms/step , 6886.44 GFLOP/s , 17950.6 tokens/s INFO:__main__:2024-11-05 17:35:02 | Validation | Step: 172900 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:35:11 | Epoch: 1 | Step: 172910 | Dataset: 0-272781 | Loss: 0.564 | 912 ms/step , 6897.90 GFLOP/s , 15286.2 tokens/s INFO:__main__:2024-11-05 17:35:20 | Epoch: 1 | Step: 172920 | Dataset: 0-273101 | Loss: 0.555 | 912 ms/step , 6899.21 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 17:35:29 | Epoch: 1 | Step: 172930 | Dataset: 0-273421 | Loss: 0.511 | 912 ms/step , 6895.76 GFLOP/s , 17950.1 tokens/s INFO:__main__:2024-11-05 17:35:38 | Epoch: 1 | Step: 172940 | Dataset: 0-273741 | Loss: 0.629 | 913 ms/step , 6885.59 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 17:35:47 | Epoch: 1 | Step: 172950 | Dataset: 0-274061 | Loss: 0.662 | 913 ms/step , 6889.51 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 17:35:56 | Epoch: 1 | Step: 172960 | Dataset: 0-274381 | Loss: 0.671 | 913 ms/step , 6890.89 GFLOP/s , 17953.7 tokens/s INFO:__main__:2024-11-05 17:36:06 | Epoch: 1 | Step: 172970 | Dataset: 0-274701 | Loss: 0.643 | 913 ms/step , 6889.82 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 17:36:15 | Epoch: 1 | Step: 172980 | Dataset: 0-275021 | Loss: 0.628 | 913 ms/step , 6887.57 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-05 17:36:24 | Epoch: 1 | Step: 172990 | Dataset: 0-275341 | Loss: 0.629 | 912 ms/step , 6894.91 GFLOP/s , 17953.9 tokens/s INFO:__main__:2024-11-05 17:36:33 | Epoch: 1 | Step: 173000 | Dataset: 0-275661 | Loss: 0.603 | 912 ms/step , 6899.02 GFLOP/s , 17947.9 tokens/s INFO:__main__:2024-11-05 17:36:34 | Validation | Step: 173000 | Val_loss: 0.731 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:36:34 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_173634_step_173000.pt` INFO:__main__:2024-11-05 17:36:45 | Epoch: 1 | Step: 173010 | Dataset: 0-275981 | Loss: 0.586 | 912 ms/step , 6894.99 GFLOP/s , 13807.3 tokens/s INFO:__main__:2024-11-05 17:36:54 | Epoch: 1 | Step: 173020 | Dataset: 0-276301 | Loss: 0.727 | 913 ms/step , 6891.21 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 17:37:03 | Epoch: 1 | Step: 173030 | Dataset: 0-276621 | Loss: 0.570 | 912 ms/step , 6898.35 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:37:12 | Epoch: 1 | Step: 173040 | Dataset: 0-276941 | Loss: 0.633 | 913 ms/step , 6890.74 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 17:37:21 | Epoch: 1 | Step: 173050 | Dataset: 0-277261 | Loss: 0.671 | 914 ms/step , 6883.16 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 17:37:30 | Epoch: 1 | Step: 173060 | Dataset: 0-277581 | Loss: 0.551 | 912 ms/step , 6896.04 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 17:37:40 | Epoch: 1 | Step: 173070 | Dataset: 0-277901 | Loss: 0.534 | 912 ms/step , 6898.89 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 17:37:49 | Epoch: 1 | Step: 173080 | Dataset: 0-278221 | Loss: 0.661 | 912 ms/step , 6893.43 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 17:37:58 | Epoch: 1 | Step: 173090 | Dataset: 0-278541 | Loss: 0.541 | 913 ms/step , 6889.39 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 17:38:07 | Epoch: 1 | Step: 173100 | Dataset: 0-278861 | Loss: 0.681 | 914 ms/step , 6878.21 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 17:38:09 | Validation | Step: 173100 | Val_loss: 0.814 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:38:18 | Epoch: 1 | Step: 173110 | Dataset: 0-279181 | Loss: 0.681 | 912 ms/step , 6894.05 GFLOP/s , 15278.5 tokens/s INFO:__main__:2024-11-05 17:38:27 | Epoch: 1 | Step: 173120 | Dataset: 0-279501 | Loss: 0.639 | 914 ms/step , 6883.13 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 17:38:36 | Epoch: 1 | Step: 173130 | Dataset: 0-279821 | Loss: 0.534 | 912 ms/step , 6892.85 GFLOP/s , 17949.7 tokens/s INFO:__main__:2024-11-05 17:38:45 | Epoch: 1 | Step: 173140 | Dataset: 0-280141 | Loss: 0.619 | 912 ms/step , 6899.53 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 17:38:54 | Epoch: 1 | Step: 173150 | Dataset: 0-280461 | Loss: 0.635 | 912 ms/step , 6895.60 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 17:39:03 | Epoch: 1 | Step: 173160 | Dataset: 0-280781 | Loss: 0.652 | 912 ms/step , 6896.47 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 17:39:12 | Epoch: 1 | Step: 173170 | Dataset: 0-281101 | Loss: 0.587 | 912 ms/step , 6894.01 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 17:39:22 | Epoch: 1 | Step: 173180 | Dataset: 0-281421 | Loss: 0.669 | 913 ms/step , 6889.66 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 17:39:31 | Epoch: 1 | Step: 173190 | Dataset: 0-281741 | Loss: 0.717 | 912 ms/step , 6896.21 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 17:39:40 | Epoch: 1 | Step: 173200 | Dataset: 0-282061 | Loss: 0.703 | 913 ms/step , 6889.62 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 17:39:41 | Validation | Step: 173200 | Val_loss: 0.782 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:39:51 | Epoch: 1 | Step: 173210 | Dataset: 0-282381 | Loss: 0.611 | 913 ms/step , 6888.99 GFLOP/s , 15281.3 tokens/s INFO:__main__:2024-11-05 17:40:00 | Epoch: 1 | Step: 173220 | Dataset: 0-282701 | Loss: 0.655 | 913 ms/step , 6888.87 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 17:40:09 | Epoch: 1 | Step: 173230 | Dataset: 0-283021 | Loss: 0.686 | 912 ms/step , 6897.99 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 17:40:18 | Epoch: 1 | Step: 173240 | Dataset: 0-283341 | Loss: 0.570 | 913 ms/step , 6892.40 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 17:40:27 | Epoch: 1 | Step: 173250 | Dataset: 0-283661 | Loss: 0.656 | 912 ms/step , 6894.06 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-05 17:40:36 | Epoch: 1 | Step: 173260 | Dataset: 0-283981 | Loss: 0.588 | 911 ms/step , 6900.60 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 17:40:45 | Epoch: 1 | Step: 173270 | Dataset: 0-284301 | Loss: 0.643 | 913 ms/step , 6891.72 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 17:40:54 | Epoch: 1 | Step: 173280 | Dataset: 0-284621 | Loss: 0.602 | 913 ms/step , 6891.67 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 17:41:04 | Epoch: 1 | Step: 173290 | Dataset: 0-284941 | Loss: 0.624 | 915 ms/step , 6873.12 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 17:41:13 | Epoch: 1 | Step: 173300 | Dataset: 0-285261 | Loss: 0.613 | 912 ms/step , 6893.27 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 17:41:14 | Validation | Step: 173300 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:41:23 | Epoch: 1 | Step: 173310 | Dataset: 0-285581 | Loss: 0.632 | 913 ms/step , 6891.42 GFLOP/s , 15279.7 tokens/s INFO:__main__:2024-11-05 17:41:33 | Epoch: 1 | Step: 173320 | Dataset: 0-285901 | Loss: 0.635 | 913 ms/step , 6889.75 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 17:41:42 | Epoch: 1 | Step: 173330 | Dataset: 0-286221 | Loss: 0.550 | 912 ms/step , 6893.78 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 17:41:51 | Epoch: 1 | Step: 173340 | Dataset: 0-286541 | Loss: 0.621 | 914 ms/step , 6879.05 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 17:42:00 | Epoch: 1 | Step: 173350 | Dataset: 0-286861 | Loss: 0.670 | 912 ms/step , 6894.98 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 17:42:09 | Epoch: 1 | Step: 173360 | Dataset: 0-287181 | Loss: 0.544 | 912 ms/step , 6895.27 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 17:42:18 | Epoch: 1 | Step: 173370 | Dataset: 0-287501 | Loss: 0.694 | 912 ms/step , 6898.75 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-05 17:42:27 | Epoch: 1 | Step: 173380 | Dataset: 0-287821 | Loss: 0.567 | 912 ms/step , 6896.85 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 17:42:37 | Epoch: 1 | Step: 173390 | Dataset: 0-288141 | Loss: 0.664 | 912 ms/step , 6893.24 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 17:42:46 | Epoch: 1 | Step: 173400 | Dataset: 0-288461 | Loss: 0.588 | 911 ms/step , 6903.54 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 17:42:47 | Validation | Step: 173400 | Val_loss: 0.737 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:42:56 | Epoch: 1 | Step: 173410 | Dataset: 0-288781 | Loss: 0.638 | 912 ms/step , 6893.55 GFLOP/s , 15283.0 tokens/s INFO:__main__:2024-11-05 17:43:06 | Epoch: 1 | Step: 173420 | Dataset: 0-289101 | Loss: 0.641 | 913 ms/step , 6885.51 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 17:43:15 | Epoch: 1 | Step: 173430 | Dataset: 0-289421 | Loss: 0.579 | 913 ms/step , 6887.72 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 17:43:24 | Epoch: 1 | Step: 173440 | Dataset: 0-289741 | Loss: 0.641 | 913 ms/step , 6889.07 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 17:43:33 | Epoch: 1 | Step: 173450 | Dataset: 0-290061 | Loss: 0.721 | 914 ms/step , 6883.13 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 17:43:42 | Epoch: 1 | Step: 173460 | Dataset: 0-290381 | Loss: 0.639 | 912 ms/step , 6897.66 GFLOP/s , 17949.9 tokens/s INFO:__main__:2024-11-05 17:43:51 | Epoch: 1 | Step: 173470 | Dataset: 0-290701 | Loss: 0.500 | 912 ms/step , 6894.45 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 17:44:00 | Epoch: 1 | Step: 173480 | Dataset: 0-291021 | Loss: 0.600 | 914 ms/step , 6882.02 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 17:44:09 | Epoch: 1 | Step: 173490 | Dataset: 0-291341 | Loss: 0.530 | 913 ms/step , 6887.75 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 17:44:19 | Epoch: 1 | Step: 173500 | Dataset: 0-291661 | Loss: 0.602 | 912 ms/step , 6894.54 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 17:44:20 | Validation | Step: 173500 | Val_loss: 0.785 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:44:29 | Epoch: 1 | Step: 173510 | Dataset: 0-291981 | Loss: 0.602 | 912 ms/step , 6896.66 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 17:44:38 | Epoch: 1 | Step: 173520 | Dataset: 0-292301 | Loss: 0.582 | 912 ms/step , 6894.96 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 17:44:48 | Epoch: 1 | Step: 173530 | Dataset: 0-292621 | Loss: 0.640 | 912 ms/step , 6897.62 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:44:57 | Epoch: 1 | Step: 173540 | Dataset: 0-292941 | Loss: 0.621 | 913 ms/step , 6887.63 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 17:45:06 | Epoch: 1 | Step: 173550 | Dataset: 0-293261 | Loss: 0.581 | 912 ms/step , 6893.70 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 17:45:15 | Epoch: 1 | Step: 173560 | Dataset: 0-293581 | Loss: 0.587 | 913 ms/step , 6886.46 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 17:45:24 | Epoch: 1 | Step: 173570 | Dataset: 0-293901 | Loss: 0.624 | 914 ms/step , 6884.97 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 17:45:33 | Epoch: 1 | Step: 173580 | Dataset: 0-294221 | Loss: 0.586 | 913 ms/step , 6888.15 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 17:45:42 | Epoch: 1 | Step: 173590 | Dataset: 0-294541 | Loss: 0.715 | 913 ms/step , 6891.99 GFLOP/s , 17951.4 tokens/s INFO:__main__:2024-11-05 17:45:51 | Epoch: 1 | Step: 173600 | Dataset: 0-294861 | Loss: 0.750 | 913 ms/step , 6892.38 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-05 17:45:53 | Validation | Step: 173600 | Val_loss: 0.752 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:46:02 | Epoch: 1 | Step: 173610 | Dataset: 0-295181 | Loss: 0.590 | 912 ms/step , 6896.57 GFLOP/s , 15274.9 tokens/s INFO:__main__:2024-11-05 17:46:11 | Epoch: 1 | Step: 173620 | Dataset: 0-295501 | Loss: 0.571 | 912 ms/step , 6892.94 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 17:46:20 | Epoch: 1 | Step: 173630 | Dataset: 0-295821 | Loss: 0.570 | 912 ms/step , 6897.56 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 17:46:30 | Epoch: 1 | Step: 173640 | Dataset: 0-296141 | Loss: 0.590 | 914 ms/step , 6884.86 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 17:46:39 | Epoch: 1 | Step: 173650 | Dataset: 0-296461 | Loss: 0.730 | 915 ms/step , 6870.98 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 17:46:48 | Epoch: 1 | Step: 173660 | Dataset: 0-296781 | Loss: 0.606 | 914 ms/step , 6882.18 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 17:46:57 | Epoch: 1 | Step: 173670 | Dataset: 0-297101 | Loss: 0.580 | 913 ms/step , 6885.98 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 17:47:06 | Epoch: 1 | Step: 173680 | Dataset: 0-297421 | Loss: 0.732 | 913 ms/step , 6890.19 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 17:47:15 | Epoch: 1 | Step: 173690 | Dataset: 0-297741 | Loss: 0.668 | 913 ms/step , 6885.34 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:47:24 | Epoch: 1 | Step: 173700 | Dataset: 0-298061 | Loss: 0.559 | 913 ms/step , 6887.77 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 17:47:26 | Validation | Step: 173700 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:47:35 | Epoch: 1 | Step: 173710 | Dataset: 0-298381 | Loss: 0.730 | 912 ms/step , 6893.08 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-05 17:47:44 | Epoch: 1 | Step: 173720 | Dataset: 0-298701 | Loss: 0.521 | 912 ms/step , 6894.47 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 17:47:53 | Epoch: 1 | Step: 173730 | Dataset: 0-299021 | Loss: 0.631 | 913 ms/step , 6888.83 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:48:03 | Epoch: 1 | Step: 173740 | Dataset: 0-299341 | Loss: 0.513 | 912 ms/step , 6899.39 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 17:48:12 | Epoch: 1 | Step: 173750 | Dataset: 0-299661 | Loss: 0.578 | 912 ms/step , 6894.47 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 17:48:21 | Epoch: 1 | Step: 173760 | Dataset: 0-299981 | Loss: 0.634 | 912 ms/step , 6895.29 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 17:48:30 | Epoch: 1 | Step: 173770 | Dataset: 0-300301 | Loss: 0.577 | 912 ms/step , 6895.77 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 17:48:39 | Epoch: 1 | Step: 173780 | Dataset: 0-300621 | Loss: 0.656 | 913 ms/step , 6891.58 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 17:48:48 | Epoch: 1 | Step: 173790 | Dataset: 0-300941 | Loss: 0.627 | 912 ms/step , 6899.27 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 17:48:57 | Epoch: 1 | Step: 173800 | Dataset: 0-301261 | Loss: 0.657 | 912 ms/step , 6895.14 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 17:48:59 | Validation | Step: 173800 | Val_loss: 0.771 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:49:08 | Epoch: 1 | Step: 173810 | Dataset: 0-301581 | Loss: 0.616 | 911 ms/step , 6900.74 GFLOP/s , 15277.6 tokens/s INFO:__main__:2024-11-05 17:49:17 | Epoch: 1 | Step: 173820 | Dataset: 0-301901 | Loss: 0.635 | 913 ms/step , 6886.82 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 17:49:26 | Epoch: 1 | Step: 173830 | Dataset: 0-302221 | Loss: 0.594 | 912 ms/step , 6897.24 GFLOP/s , 17946.2 tokens/s INFO:__main__:2024-11-05 17:49:35 | Epoch: 1 | Step: 173840 | Dataset: 0-302541 | Loss: 0.599 | 911 ms/step , 6900.88 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 17:49:45 | Epoch: 1 | Step: 173850 | Dataset: 0-302861 | Loss: 0.549 | 912 ms/step , 6896.88 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 17:49:54 | Epoch: 1 | Step: 173860 | Dataset: 0-303181 | Loss: 0.578 | 913 ms/step , 6891.95 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 17:50:03 | Epoch: 1 | Step: 173870 | Dataset: 0-303501 | Loss: 0.590 | 913 ms/step , 6889.95 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 17:50:12 | Epoch: 1 | Step: 173880 | Dataset: 0-303821 | Loss: 0.726 | 914 ms/step , 6881.11 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-05 17:50:21 | Epoch: 1 | Step: 173890 | Dataset: 0-304141 | Loss: 0.633 | 911 ms/step , 6904.83 GFLOP/s , 17949.6 tokens/s INFO:__main__:2024-11-05 17:50:30 | Epoch: 1 | Step: 173900 | Dataset: 0-304461 | Loss: 0.598 | 913 ms/step , 6890.15 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 17:50:32 | Validation | Step: 173900 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:50:41 | Epoch: 1 | Step: 173910 | Dataset: 0-304781 | Loss: 0.548 | 912 ms/step , 6895.55 GFLOP/s , 15292.0 tokens/s INFO:__main__:2024-11-05 17:50:50 | Epoch: 1 | Step: 173920 | Dataset: 0-305101 | Loss: 0.542 | 913 ms/step , 6887.14 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 17:50:59 | Epoch: 1 | Step: 173930 | Dataset: 0-305421 | Loss: 0.532 | 913 ms/step , 6885.63 GFLOP/s , 17952.0 tokens/s INFO:__main__:2024-11-05 17:51:08 | Epoch: 1 | Step: 173940 | Dataset: 0-305741 | Loss: 0.548 | 913 ms/step , 6885.62 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-05 17:51:17 | Epoch: 1 | Step: 173950 | Dataset: 0-306061 | Loss: 0.690 | 911 ms/step , 6902.57 GFLOP/s , 17948.9 tokens/s INFO:__main__:2024-11-05 17:51:27 | Epoch: 1 | Step: 173960 | Dataset: 0-306381 | Loss: 0.651 | 913 ms/step , 6892.53 GFLOP/s , 17950.2 tokens/s INFO:__main__:2024-11-05 17:51:36 | Epoch: 1 | Step: 173970 | Dataset: 0-306701 | Loss: 0.595 | 913 ms/step , 6891.22 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 17:51:45 | Epoch: 1 | Step: 173980 | Dataset: 0-307021 | Loss: 0.704 | 912 ms/step , 6896.91 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 17:51:54 | Epoch: 1 | Step: 173990 | Dataset: 0-307341 | Loss: 0.590 | 912 ms/step , 6894.95 GFLOP/s , 17945.7 tokens/s INFO:__main__:2024-11-05 17:52:03 | Epoch: 1 | Step: 174000 | Dataset: 0-307661 | Loss: 0.651 | 913 ms/step , 6888.96 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 17:52:05 | Validation | Step: 174000 | Val_loss: 0.776 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:52:05 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_175205_step_174000.pt` INFO:__main__:2024-11-05 17:52:15 | Epoch: 1 | Step: 174010 | Dataset: 0-307981 | Loss: 0.610 | 913 ms/step , 6886.31 GFLOP/s , 13778.2 tokens/s INFO:__main__:2024-11-05 17:52:24 | Epoch: 1 | Step: 174020 | Dataset: 0-308301 | Loss: 0.623 | 912 ms/step , 6897.04 GFLOP/s , 17944.4 tokens/s INFO:__main__:2024-11-05 17:52:33 | Epoch: 1 | Step: 174030 | Dataset: 0-308621 | Loss: 0.643 | 913 ms/step , 6889.42 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 17:52:42 | Epoch: 1 | Step: 174040 | Dataset: 0-308941 | Loss: 0.625 | 914 ms/step , 6879.03 GFLOP/s , 17950.4 tokens/s INFO:__main__:2024-11-05 17:52:51 | Epoch: 1 | Step: 174050 | Dataset: 0-309261 | Loss: 0.600 | 913 ms/step , 6890.06 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 17:53:01 | Epoch: 1 | Step: 174060 | Dataset: 0-309581 | Loss: 0.688 | 913 ms/step , 6890.36 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 17:53:10 | Epoch: 1 | Step: 174070 | Dataset: 0-309901 | Loss: 0.632 | 913 ms/step , 6886.89 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 17:53:19 | Epoch: 1 | Step: 174080 | Dataset: 0-310221 | Loss: 0.577 | 912 ms/step , 6893.63 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 17:53:28 | Epoch: 1 | Step: 174090 | Dataset: 0-310541 | Loss: 0.588 | 913 ms/step , 6889.12 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 17:53:37 | Epoch: 1 | Step: 174100 | Dataset: 0-310861 | Loss: 0.632 | 913 ms/step , 6887.21 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 17:53:39 | Validation | Step: 174100 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:53:48 | Epoch: 1 | Step: 174110 | Dataset: 0-311181 | Loss: 0.557 | 912 ms/step , 6894.43 GFLOP/s , 15286.6 tokens/s INFO:__main__:2024-11-05 17:53:57 | Epoch: 1 | Step: 174120 | Dataset: 0-311501 | Loss: 0.588 | 912 ms/step , 6897.74 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 17:54:06 | Epoch: 1 | Step: 174130 | Dataset: 0-311821 | Loss: 0.552 | 912 ms/step , 6892.82 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 17:54:15 | Epoch: 1 | Step: 174140 | Dataset: 0-312141 | Loss: 0.705 | 913 ms/step , 6887.99 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 17:54:24 | Epoch: 1 | Step: 174150 | Dataset: 0-312461 | Loss: 0.544 | 914 ms/step , 6882.25 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 17:54:34 | Epoch: 1 | Step: 174160 | Dataset: 0-312781 | Loss: 0.599 | 914 ms/step , 6883.77 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 17:54:43 | Epoch: 1 | Step: 174170 | Dataset: 0-313101 | Loss: 0.532 | 912 ms/step , 6899.41 GFLOP/s , 17948.6 tokens/s INFO:__main__:2024-11-05 17:54:52 | Epoch: 1 | Step: 174180 | Dataset: 0-313421 | Loss: 0.561 | 913 ms/step , 6892.07 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 17:55:01 | Epoch: 1 | Step: 174190 | Dataset: 0-313741 | Loss: 0.617 | 911 ms/step , 6905.88 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 17:55:10 | Epoch: 1 | Step: 174200 | Dataset: 0-314061 | Loss: 0.626 | 915 ms/step , 6876.25 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 17:55:12 | Validation | Step: 174200 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:55:21 | Epoch: 1 | Step: 174210 | Dataset: 0-314381 | Loss: 0.609 | 914 ms/step , 6884.31 GFLOP/s , 15294.7 tokens/s INFO:__main__:2024-11-05 17:55:30 | Epoch: 1 | Step: 174220 | Dataset: 0-314701 | Loss: 0.601 | 912 ms/step , 6894.78 GFLOP/s , 17953.8 tokens/s INFO:__main__:2024-11-05 17:55:39 | Epoch: 1 | Step: 174230 | Dataset: 0-315021 | Loss: 0.575 | 912 ms/step , 6894.07 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-05 17:55:48 | Epoch: 1 | Step: 174240 | Dataset: 0-315341 | Loss: 0.630 | 912 ms/step , 6898.59 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 17:55:57 | Epoch: 1 | Step: 174250 | Dataset: 0-315661 | Loss: 0.565 | 913 ms/step , 6891.10 GFLOP/s , 17944.9 tokens/s INFO:__main__:2024-11-05 17:56:06 | Epoch: 1 | Step: 174260 | Dataset: 0-315981 | Loss: 0.637 | 914 ms/step , 6883.47 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 17:56:16 | Epoch: 1 | Step: 174270 | Dataset: 0-316301 | Loss: 0.646 | 914 ms/step , 6884.45 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 17:56:25 | Epoch: 1 | Step: 174280 | Dataset: 0-316621 | Loss: 0.594 | 912 ms/step , 6893.89 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 17:56:34 | Epoch: 1 | Step: 174290 | Dataset: 0-316941 | Loss: 0.574 | 912 ms/step , 6892.99 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 17:56:43 | Epoch: 1 | Step: 174300 | Dataset: 0-317261 | Loss: 0.659 | 913 ms/step , 6885.15 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-05 17:56:44 | Validation | Step: 174300 | Val_loss: 0.745 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:56:54 | Epoch: 1 | Step: 174310 | Dataset: 0-317581 | Loss: 0.602 | 913 ms/step , 6891.16 GFLOP/s , 15293.2 tokens/s INFO:__main__:2024-11-05 17:57:03 | Epoch: 1 | Step: 174320 | Dataset: 0-317901 | Loss: 0.576 | 911 ms/step , 6902.85 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-05 17:57:12 | Epoch: 1 | Step: 174330 | Dataset: 0-318221 | Loss: 0.584 | 912 ms/step , 6892.79 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-05 17:57:21 | Epoch: 1 | Step: 174340 | Dataset: 0-318541 | Loss: 0.591 | 911 ms/step , 6901.37 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 17:57:30 | Epoch: 1 | Step: 174350 | Dataset: 0-318861 | Loss: 0.604 | 913 ms/step , 6885.68 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 17:57:39 | Epoch: 1 | Step: 174360 | Dataset: 0-319181 | Loss: 0.581 | 912 ms/step , 6899.37 GFLOP/s , 17951.9 tokens/s INFO:__main__:2024-11-05 17:57:48 | Epoch: 1 | Step: 174370 | Dataset: 0-319501 | Loss: 0.541 | 912 ms/step , 6893.18 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 17:57:58 | Epoch: 1 | Step: 174380 | Dataset: 0-319821 | Loss: 0.613 | 912 ms/step , 6894.74 GFLOP/s , 17952.6 tokens/s INFO:__main__:2024-11-05 17:58:07 | Epoch: 1 | Step: 174390 | Dataset: 0-320141 | Loss: 0.572 | 913 ms/step , 6888.64 GFLOP/s , 17951.8 tokens/s INFO:__main__:2024-11-05 17:58:16 | Epoch: 1 | Step: 174400 | Dataset: 0-320461 | Loss: 0.652 | 912 ms/step , 6898.71 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 17:58:17 | Validation | Step: 174400 | Val_loss: 0.667 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:58:26 | Epoch: 1 | Step: 174410 | Dataset: 0-320781 | Loss: 0.605 | 913 ms/step , 6892.12 GFLOP/s , 15285.5 tokens/s INFO:__main__:2024-11-05 17:58:36 | Epoch: 1 | Step: 174420 | Dataset: 0-321101 | Loss: 0.599 | 914 ms/step , 6883.21 GFLOP/s , 17949.1 tokens/s INFO:__main__:2024-11-05 17:58:45 | Epoch: 1 | Step: 174430 | Dataset: 0-321421 | Loss: 0.591 | 912 ms/step , 6895.77 GFLOP/s , 17951.5 tokens/s INFO:__main__:2024-11-05 17:58:54 | Epoch: 1 | Step: 174440 | Dataset: 0-321741 | Loss: 0.584 | 913 ms/step , 6887.71 GFLOP/s , 17944.8 tokens/s INFO:__main__:2024-11-05 17:59:03 | Epoch: 1 | Step: 174450 | Dataset: 0-322061 | Loss: 0.626 | 913 ms/step , 6890.69 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 17:59:12 | Epoch: 1 | Step: 174460 | Dataset: 0-322381 | Loss: 0.602 | 912 ms/step , 6893.96 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 17:59:21 | Epoch: 1 | Step: 174470 | Dataset: 0-322701 | Loss: 0.611 | 913 ms/step , 6892.04 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 17:59:30 | Epoch: 1 | Step: 174480 | Dataset: 0-323021 | Loss: 0.622 | 912 ms/step , 6892.91 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 17:59:40 | Epoch: 1 | Step: 174490 | Dataset: 0-323341 | Loss: 0.652 | 913 ms/step , 6892.17 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-05 17:59:49 | Epoch: 1 | Step: 174500 | Dataset: 0-323661 | Loss: 0.523 | 913 ms/step , 6885.90 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 17:59:50 | Validation | Step: 174500 | Val_loss: 0.732 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 17:59:59 | Epoch: 1 | Step: 174510 | Dataset: 0-323981 | Loss: 0.502 | 911 ms/step , 6902.27 GFLOP/s , 15289.2 tokens/s INFO:__main__:2024-11-05 18:00:09 | Epoch: 1 | Step: 174520 | Dataset: 0-324301 | Loss: 0.691 | 912 ms/step , 6894.60 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 18:00:18 | Epoch: 1 | Step: 174530 | Dataset: 0-324621 | Loss: 0.632 | 912 ms/step , 6893.20 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 18:00:27 | Epoch: 1 | Step: 174540 | Dataset: 0-324941 | Loss: 0.588 | 913 ms/step , 6889.24 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 18:00:36 | Epoch: 1 | Step: 174550 | Dataset: 0-325261 | Loss: 0.639 | 912 ms/step , 6893.93 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-05 18:00:45 | Epoch: 1 | Step: 174560 | Dataset: 0-325581 | Loss: 0.623 | 913 ms/step , 6891.10 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 18:00:54 | Epoch: 1 | Step: 174570 | Dataset: 0-325901 | Loss: 0.527 | 912 ms/step , 6898.84 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-05 18:01:03 | Epoch: 1 | Step: 174580 | Dataset: 0-326221 | Loss: 0.592 | 912 ms/step , 6895.01 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 18:01:12 | Epoch: 1 | Step: 174590 | Dataset: 0-326541 | Loss: 0.607 | 914 ms/step , 6884.52 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-05 18:01:22 | Epoch: 1 | Step: 174600 | Dataset: 0-326861 | Loss: 0.594 | 912 ms/step , 6894.83 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 18:01:23 | Validation | Step: 174600 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:01:32 | Epoch: 1 | Step: 174610 | Dataset: 0-327181 | Loss: 0.624 | 912 ms/step , 6896.55 GFLOP/s , 15278.5 tokens/s INFO:__main__:2024-11-05 18:01:41 | Epoch: 1 | Step: 174620 | Dataset: 0-327501 | Loss: 0.683 | 913 ms/step , 6891.79 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 18:01:51 | Epoch: 1 | Step: 174630 | Dataset: 0-327821 | Loss: 0.706 | 913 ms/step , 6887.50 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 18:02:00 | Epoch: 1 | Step: 174640 | Dataset: 0-328141 | Loss: 0.622 | 912 ms/step , 6894.16 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 18:02:09 | Epoch: 1 | Step: 174650 | Dataset: 0-328461 | Loss: 0.563 | 912 ms/step , 6893.87 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 18:02:18 | Epoch: 1 | Step: 174660 | Dataset: 0-328781 | Loss: 0.634 | 914 ms/step , 6880.73 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 18:02:27 | Epoch: 1 | Step: 174670 | Dataset: 0-329101 | Loss: 0.588 | 912 ms/step , 6895.77 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-05 18:02:36 | Epoch: 1 | Step: 174680 | Dataset: 0-329421 | Loss: 0.654 | 912 ms/step , 6896.80 GFLOP/s , 17953.7 tokens/s INFO:__main__:2024-11-05 18:02:45 | Epoch: 1 | Step: 174690 | Dataset: 0-329741 | Loss: 0.666 | 913 ms/step , 6885.66 GFLOP/s , 17948.3 tokens/s INFO:__main__:2024-11-05 18:02:54 | Epoch: 1 | Step: 174700 | Dataset: 0-330061 | Loss: 0.606 | 912 ms/step , 6899.90 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 18:02:56 | Validation | Step: 174700 | Val_loss: 0.730 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:03:05 | Epoch: 1 | Step: 174710 | Dataset: 0-330381 | Loss: 0.585 | 913 ms/step , 6890.06 GFLOP/s , 15287.0 tokens/s INFO:__main__:2024-11-05 18:03:14 | Epoch: 1 | Step: 174720 | Dataset: 0-330701 | Loss: 0.567 | 912 ms/step , 6893.46 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 18:03:23 | Epoch: 1 | Step: 174730 | Dataset: 0-331021 | Loss: 0.601 | 912 ms/step , 6893.50 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 18:03:33 | Epoch: 1 | Step: 174740 | Dataset: 0-331341 | Loss: 0.587 | 912 ms/step , 6893.13 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 18:03:42 | Epoch: 1 | Step: 174750 | Dataset: 0-331661 | Loss: 0.655 | 912 ms/step , 6895.81 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 18:03:51 | Epoch: 1 | Step: 174760 | Dataset: 0-331981 | Loss: 0.616 | 913 ms/step , 6886.43 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 18:04:00 | Epoch: 1 | Step: 174770 | Dataset: 0-332301 | Loss: 0.627 | 913 ms/step , 6888.51 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 18:04:09 | Epoch: 1 | Step: 174780 | Dataset: 0-332621 | Loss: 0.616 | 911 ms/step , 6902.66 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-05 18:04:18 | Epoch: 1 | Step: 174790 | Dataset: 0-332941 | Loss: 0.609 | 913 ms/step , 6891.72 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-05 18:04:27 | Epoch: 1 | Step: 174800 | Dataset: 0-333261 | Loss: 0.575 | 911 ms/step , 6900.38 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 18:04:29 | Validation | Step: 174800 | Val_loss: 0.688 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:04:38 | Epoch: 1 | Step: 174810 | Dataset: 0-333581 | Loss: 0.624 | 912 ms/step , 6897.66 GFLOP/s , 15280.2 tokens/s INFO:__main__:2024-11-05 18:04:47 | Epoch: 1 | Step: 174820 | Dataset: 0-333901 | Loss: 0.664 | 914 ms/step , 6880.53 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 18:04:56 | Epoch: 1 | Step: 174830 | Dataset: 0-334221 | Loss: 0.646 | 912 ms/step , 6897.27 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-05 18:05:05 | Epoch: 1 | Step: 174840 | Dataset: 0-334541 | Loss: 0.584 | 912 ms/step , 6898.37 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 18:05:15 | Epoch: 1 | Step: 174850 | Dataset: 0-334861 | Loss: 0.622 | 913 ms/step , 6887.96 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 18:05:24 | Epoch: 1 | Step: 174860 | Dataset: 0-335181 | Loss: 0.554 | 913 ms/step , 6891.69 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 18:05:33 | Epoch: 1 | Step: 174870 | Dataset: 0-335501 | Loss: 0.674 | 913 ms/step , 6891.16 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 18:05:42 | Epoch: 1 | Step: 174880 | Dataset: 0-335821 | Loss: 0.596 | 914 ms/step , 6878.44 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 18:05:51 | Epoch: 1 | Step: 174890 | Dataset: 0-336141 | Loss: 0.589 | 912 ms/step , 6896.43 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-05 18:06:00 | Epoch: 1 | Step: 174900 | Dataset: 0-336461 | Loss: 0.613 | 912 ms/step , 6893.11 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 18:06:02 | Validation | Step: 174900 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:06:11 | Epoch: 1 | Step: 174910 | Dataset: 0-336781 | Loss: 0.574 | 913 ms/step , 6891.06 GFLOP/s , 15297.2 tokens/s INFO:__main__:2024-11-05 18:06:20 | Epoch: 1 | Step: 174920 | Dataset: 0-337101 | Loss: 0.601 | 913 ms/step , 6891.34 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 18:06:29 | Epoch: 1 | Step: 174930 | Dataset: 0-337421 | Loss: 0.556 | 912 ms/step , 6895.89 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 18:06:38 | Epoch: 1 | Step: 174940 | Dataset: 0-337741 | Loss: 0.541 | 913 ms/step , 6891.75 GFLOP/s , 17946.7 tokens/s INFO:__main__:2024-11-05 18:06:48 | Epoch: 1 | Step: 174950 | Dataset: 0-338061 | Loss: 0.615 | 913 ms/step , 6891.82 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 18:06:57 | Epoch: 1 | Step: 174960 | Dataset: 0-338381 | Loss: 0.542 | 913 ms/step , 6890.82 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 18:07:06 | Epoch: 1 | Step: 174970 | Dataset: 0-338701 | Loss: 0.516 | 912 ms/step , 6896.97 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-05 18:07:15 | Epoch: 1 | Step: 174980 | Dataset: 0-339021 | Loss: 0.573 | 912 ms/step , 6899.71 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 18:07:24 | Epoch: 1 | Step: 174990 | Dataset: 0-339341 | Loss: 0.578 | 913 ms/step , 6890.04 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 18:07:33 | Epoch: 1 | Step: 175000 | Dataset: 0-339661 | Loss: 0.595 | 913 ms/step , 6887.69 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 18:07:35 | Validation | Step: 175000 | Val_loss: 0.811 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:07:35 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_180735_step_175000.pt` INFO:__main__:2024-11-05 18:07:45 | Epoch: 1 | Step: 175010 | Dataset: 0-339981 | Loss: 0.591 | 912 ms/step , 6897.06 GFLOP/s , 13848.5 tokens/s INFO:__main__:2024-11-05 18:07:54 | Epoch: 1 | Step: 175020 | Dataset: 0-340301 | Loss: 0.675 | 913 ms/step , 6890.84 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 18:08:03 | Epoch: 1 | Step: 175030 | Dataset: 0-340621 | Loss: 0.585 | 913 ms/step , 6888.03 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 18:08:12 | Epoch: 1 | Step: 175040 | Dataset: 0-340941 | Loss: 0.586 | 914 ms/step , 6882.29 GFLOP/s , 17887.6 tokens/s INFO:__main__:2024-11-05 18:08:22 | Epoch: 1 | Step: 175050 | Dataset: 0-341261 | Loss: 0.622 | 913 ms/step , 6886.17 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-05 18:08:31 | Epoch: 1 | Step: 175060 | Dataset: 0-341581 | Loss: 0.650 | 913 ms/step , 6892.55 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 18:08:40 | Epoch: 1 | Step: 175070 | Dataset: 0-341901 | Loss: 0.524 | 912 ms/step , 6894.15 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 18:08:49 | Epoch: 1 | Step: 175080 | Dataset: 0-342221 | Loss: 0.635 | 913 ms/step , 6887.48 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 18:08:58 | Epoch: 1 | Step: 175090 | Dataset: 0-342541 | Loss: 0.579 | 913 ms/step , 6890.60 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 18:09:07 | Epoch: 1 | Step: 175100 | Dataset: 0-342861 | Loss: 0.667 | 915 ms/step , 6873.79 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 18:09:09 | Validation | Step: 175100 | Val_loss: 0.787 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:09:18 | Epoch: 1 | Step: 175110 | Dataset: 0-343181 | Loss: 0.612 | 912 ms/step , 6893.09 GFLOP/s , 15280.5 tokens/s INFO:__main__:2024-11-05 18:09:27 | Epoch: 1 | Step: 175120 | Dataset: 0-343501 | Loss: 0.614 | 912 ms/step , 6894.40 GFLOP/s , 17946.9 tokens/s INFO:__main__:2024-11-05 18:09:36 | Epoch: 1 | Step: 175130 | Dataset: 0-343821 | Loss: 0.543 | 911 ms/step , 6903.10 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 18:09:45 | Epoch: 1 | Step: 175140 | Dataset: 0-344141 | Loss: 0.521 | 913 ms/step , 6889.94 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 18:09:54 | Epoch: 1 | Step: 175150 | Dataset: 0-344461 | Loss: 0.690 | 913 ms/step , 6889.48 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 18:10:04 | Epoch: 1 | Step: 175160 | Dataset: 0-344781 | Loss: 0.717 | 913 ms/step , 6889.77 GFLOP/s , 17948.5 tokens/s INFO:__main__:2024-11-05 18:10:13 | Epoch: 1 | Step: 175170 | Dataset: 0-345101 | Loss: 0.629 | 913 ms/step , 6889.11 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 18:10:22 | Epoch: 1 | Step: 175180 | Dataset: 0-345421 | Loss: 0.679 | 913 ms/step , 6887.39 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 18:10:31 | Epoch: 1 | Step: 175190 | Dataset: 0-345741 | Loss: 0.616 | 912 ms/step , 6894.17 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-05 18:10:40 | Epoch: 1 | Step: 175200 | Dataset: 0-346061 | Loss: 0.655 | 913 ms/step , 6888.16 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 18:10:42 | Validation | Step: 175200 | Val_loss: 0.794 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:10:51 | Epoch: 1 | Step: 175210 | Dataset: 0-346381 | Loss: 0.610 | 914 ms/step , 6882.93 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-05 18:11:00 | Epoch: 1 | Step: 175220 | Dataset: 0-346701 | Loss: 0.551 | 913 ms/step , 6890.74 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 18:11:09 | Epoch: 1 | Step: 175230 | Dataset: 0-347021 | Loss: 0.698 | 912 ms/step , 6897.12 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-05 18:11:18 | Epoch: 1 | Step: 175240 | Dataset: 0-347341 | Loss: 0.595 | 913 ms/step , 6886.23 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 18:11:27 | Epoch: 1 | Step: 175250 | Dataset: 0-347661 | Loss: 0.498 | 913 ms/step , 6887.79 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-05 18:11:37 | Epoch: 1 | Step: 175260 | Dataset: 0-347981 | Loss: 0.591 | 913 ms/step , 6891.02 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-05 18:11:46 | Epoch: 1 | Step: 175270 | Dataset: 0-348301 | Loss: 0.616 | 913 ms/step , 6887.67 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 18:11:55 | Epoch: 1 | Step: 175280 | Dataset: 0-348621 | Loss: 0.617 | 912 ms/step , 6894.31 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 18:12:04 | Epoch: 1 | Step: 175290 | Dataset: 0-348941 | Loss: 0.613 | 913 ms/step , 6891.58 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 18:12:13 | Epoch: 1 | Step: 175300 | Dataset: 0-349261 | Loss: 0.586 | 913 ms/step , 6885.29 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 18:12:15 | Validation | Step: 175300 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:12:24 | Epoch: 1 | Step: 175310 | Dataset: 0-349581 | Loss: 0.591 | 912 ms/step , 6894.70 GFLOP/s , 15279.9 tokens/s INFO:__main__:2024-11-05 18:12:33 | Epoch: 1 | Step: 175320 | Dataset: 0-349901 | Loss: 0.601 | 912 ms/step , 6893.87 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 18:12:42 | Epoch: 1 | Step: 175330 | Dataset: 0-350221 | Loss: 0.635 | 914 ms/step , 6879.67 GFLOP/s , 17945.3 tokens/s INFO:__main__:2024-11-05 18:12:51 | Epoch: 1 | Step: 175340 | Dataset: 0-350541 | Loss: 0.610 | 912 ms/step , 6892.78 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 18:13:00 | Epoch: 1 | Step: 175350 | Dataset: 0-350861 | Loss: 0.628 | 913 ms/step , 6887.98 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 18:13:09 | Epoch: 1 | Step: 175360 | Dataset: 0-351181 | Loss: 0.586 | 913 ms/step , 6889.56 GFLOP/s , 17944.0 tokens/s INFO:__main__:2024-11-05 18:13:19 | Epoch: 1 | Step: 175370 | Dataset: 0-351501 | Loss: 0.544 | 911 ms/step , 6903.59 GFLOP/s , 17949.2 tokens/s INFO:__main__:2024-11-05 18:13:28 | Epoch: 1 | Step: 175380 | Dataset: 0-351821 | Loss: 0.603 | 912 ms/step , 6893.70 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 18:13:37 | Epoch: 1 | Step: 175390 | Dataset: 0-352141 | Loss: 0.667 | 912 ms/step , 6895.90 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-05 18:13:46 | Epoch: 1 | Step: 175400 | Dataset: 0-352461 | Loss: 0.596 | 913 ms/step , 6889.88 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 18:13:48 | Validation | Step: 175400 | Val_loss: 0.709 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:13:57 | Epoch: 1 | Step: 175410 | Dataset: 0-352781 | Loss: 0.654 | 913 ms/step , 6886.26 GFLOP/s , 15279.4 tokens/s INFO:__main__:2024-11-05 18:14:06 | Epoch: 1 | Step: 175420 | Dataset: 0-353101 | Loss: 0.604 | 912 ms/step , 6898.72 GFLOP/s , 17947.0 tokens/s INFO:__main__:2024-11-05 18:14:15 | Epoch: 1 | Step: 175430 | Dataset: 0-353421 | Loss: 0.572 | 912 ms/step , 6895.24 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 18:14:24 | Epoch: 1 | Step: 175440 | Dataset: 0-353741 | Loss: 0.602 | 913 ms/step , 6885.90 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 18:14:33 | Epoch: 1 | Step: 175450 | Dataset: 0-354061 | Loss: 0.733 | 913 ms/step , 6889.03 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 18:14:42 | Epoch: 1 | Step: 175460 | Dataset: 0-354381 | Loss: 0.594 | 913 ms/step , 6888.65 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-05 18:14:51 | Epoch: 1 | Step: 175470 | Dataset: 0-354701 | Loss: 0.500 | 913 ms/step , 6885.25 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 18:15:01 | Epoch: 1 | Step: 175480 | Dataset: 0-355021 | Loss: 0.557 | 913 ms/step , 6886.71 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 18:15:10 | Epoch: 1 | Step: 175490 | Dataset: 0-355341 | Loss: 0.604 | 914 ms/step , 6880.58 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-05 18:15:19 | Epoch: 1 | Step: 175500 | Dataset: 0-355661 | Loss: 0.562 | 912 ms/step , 6893.29 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 18:15:20 | Validation | Step: 175500 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:15:30 | Epoch: 1 | Step: 175510 | Dataset: 0-355981 | Loss: 0.654 | 912 ms/step , 6896.06 GFLOP/s , 15282.9 tokens/s INFO:__main__:2024-11-05 18:15:39 | Epoch: 1 | Step: 175520 | Dataset: 0-356301 | Loss: 0.550 | 912 ms/step , 6898.69 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-05 18:15:48 | Epoch: 1 | Step: 175530 | Dataset: 0-356621 | Loss: 0.578 | 913 ms/step , 6889.91 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 18:15:57 | Epoch: 1 | Step: 175540 | Dataset: 0-356941 | Loss: 0.523 | 912 ms/step , 6898.09 GFLOP/s , 17943.9 tokens/s INFO:__main__:2024-11-05 18:16:06 | Epoch: 1 | Step: 175550 | Dataset: 0-357261 | Loss: 0.579 | 912 ms/step , 6899.51 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 18:16:15 | Epoch: 1 | Step: 175560 | Dataset: 0-357581 | Loss: 0.648 | 913 ms/step , 6889.52 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 18:16:24 | Epoch: 1 | Step: 175570 | Dataset: 0-357901 | Loss: 0.619 | 912 ms/step , 6897.58 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 18:16:34 | Epoch: 1 | Step: 175580 | Dataset: 0-358221 | Loss: 0.776 | 916 ms/step , 6867.72 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 18:16:43 | Epoch: 1 | Step: 175590 | Dataset: 0-358541 | Loss: 0.337 | 912 ms/step , 6896.09 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 18:16:52 | Epoch: 1 | Step: 175600 | Dataset: 0-358861 | Loss: 0.492 | 911 ms/step , 6902.03 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-05 18:16:53 | Validation | Step: 175600 | Val_loss: 0.722 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:17:02 | Epoch: 1 | Step: 175610 | Dataset: 0-359181 | Loss: 0.345 | 913 ms/step , 6885.27 GFLOP/s , 15289.4 tokens/s INFO:__main__:2024-11-05 18:17:12 | Epoch: 1 | Step: 175620 | Dataset: 0-359501 | Loss: 0.398 | 912 ms/step , 6897.88 GFLOP/s , 17947.3 tokens/s INFO:__main__:2024-11-05 18:17:21 | Epoch: 1 | Step: 175630 | Dataset: 0-359821 | Loss: 0.451 | 912 ms/step , 6892.86 GFLOP/s , 17949.5 tokens/s INFO:__main__:2024-11-05 18:17:30 | Epoch: 1 | Step: 175640 | Dataset: 0-360141 | Loss: 0.441 | 913 ms/step , 6890.74 GFLOP/s , 17952.6 tokens/s INFO:__main__:2024-11-05 18:17:39 | Epoch: 1 | Step: 175650 | Dataset: 0-360461 | Loss: 0.303 | 912 ms/step , 6893.39 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-05 18:17:48 | Epoch: 1 | Step: 175660 | Dataset: 0-360781 | Loss: 0.262 | 912 ms/step , 6896.83 GFLOP/s , 17949.4 tokens/s INFO:__main__:2024-11-05 18:17:57 | Epoch: 1 | Step: 175670 | Dataset: 0-361101 | Loss: 0.464 | 912 ms/step , 6896.28 GFLOP/s , 17955.2 tokens/s INFO:__main__:2024-11-05 18:18:06 | Epoch: 1 | Step: 175680 | Dataset: 0-361421 | Loss: 0.394 | 912 ms/step , 6899.34 GFLOP/s , 17949.3 tokens/s INFO:__main__:2024-11-05 18:18:15 | Epoch: 1 | Step: 175690 | Dataset: 0-361741 | Loss: 0.428 | 913 ms/step , 6888.29 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-05 18:18:25 | Epoch: 1 | Step: 175700 | Dataset: 0-362061 | Loss: 0.487 | 912 ms/step , 6896.21 GFLOP/s , 17951.6 tokens/s INFO:__main__:2024-11-05 18:18:26 | Validation | Step: 175700 | Val_loss: 0.778 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:18:35 | Epoch: 1 | Step: 175710 | Dataset: 0-362381 | Loss: 0.308 | 912 ms/step , 6898.06 GFLOP/s , 15290.1 tokens/s INFO:__main__:2024-11-05 18:18:44 | Epoch: 1 | Step: 175720 | Dataset: 0-362701 | Loss: 0.360 | 912 ms/step , 6899.84 GFLOP/s , 17956.5 tokens/s INFO:__main__:2024-11-05 18:18:54 | Epoch: 1 | Step: 175730 | Dataset: 0-363021 | Loss: 0.328 | 912 ms/step , 6898.06 GFLOP/s , 17950.7 tokens/s INFO:__main__:2024-11-05 18:19:03 | Epoch: 1 | Step: 175740 | Dataset: 0-363341 | Loss: 0.397 | 912 ms/step , 6894.39 GFLOP/s , 17956.5 tokens/s INFO:__main__:2024-11-05 18:19:12 | Epoch: 1 | Step: 175750 | Dataset: 0-363661 | Loss: 0.468 | 912 ms/step , 6898.14 GFLOP/s , 17958.1 tokens/s INFO:__main__:2024-11-05 18:19:21 | Epoch: 1 | Step: 175760 | Dataset: 0-363981 | Loss: 0.516 | 911 ms/step , 6900.62 GFLOP/s , 17959.6 tokens/s INFO:__main__:2024-11-05 18:19:30 | Epoch: 1 | Step: 175770 | Dataset: 0-364301 | Loss: 0.373 | 912 ms/step , 6899.27 GFLOP/s , 17959.9 tokens/s INFO:__main__:2024-11-05 18:19:39 | Epoch: 1 | Step: 175780 | Dataset: 0-364621 | Loss: 0.440 | 911 ms/step , 6902.24 GFLOP/s , 17953.2 tokens/s INFO:__main__:2024-11-05 18:19:48 | Epoch: 1 | Step: 175790 | Dataset: 0-364941 | Loss: 0.332 | 912 ms/step , 6898.88 GFLOP/s , 17951.1 tokens/s INFO:__main__:2024-11-05 18:19:57 | Epoch: 1 | Step: 175800 | Dataset: 0-365261 | Loss: 0.369 | 914 ms/step , 6884.85 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 18:19:59 | Validation | Step: 175800 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:20:08 | Epoch: 1 | Step: 175810 | Dataset: 0-365581 | Loss: 0.359 | 912 ms/step , 6899.80 GFLOP/s , 15291.3 tokens/s INFO:__main__:2024-11-05 18:20:17 | Epoch: 1 | Step: 175820 | Dataset: 0-365901 | Loss: 0.743 | 914 ms/step , 6884.83 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-05 18:20:26 | Epoch: 1 | Step: 175830 | Dataset: 0-366221 | Loss: 0.793 | 915 ms/step , 6875.98 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 18:20:36 | Epoch: 1 | Step: 175840 | Dataset: 0-366541 | Loss: 0.835 | 914 ms/step , 6879.48 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 18:20:45 | Epoch: 1 | Step: 175850 | Dataset: 0-366861 | Loss: 0.856 | 912 ms/step , 6895.51 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 18:20:54 | Epoch: 1 | Step: 175860 | Dataset: 0-367181 | Loss: 0.848 | 915 ms/step , 6875.02 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 18:21:03 | Epoch: 1 | Step: 175870 | Dataset: 0-367501 | Loss: 0.725 | 914 ms/step , 6878.30 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 18:21:12 | Epoch: 1 | Step: 175880 | Dataset: 0-367821 | Loss: 0.783 | 913 ms/step , 6891.76 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 18:21:21 | Epoch: 1 | Step: 175890 | Dataset: 0-368141 | Loss: 0.854 | 913 ms/step , 6889.82 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 18:21:30 | Epoch: 1 | Step: 175900 | Dataset: 0-368461 | Loss: 0.668 | 913 ms/step , 6886.40 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 18:21:32 | Validation | Step: 175900 | Val_loss: 0.754 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:21:41 | Epoch: 1 | Step: 175910 | Dataset: 0-368781 | Loss: 0.879 | 914 ms/step , 6884.51 GFLOP/s , 15272.5 tokens/s INFO:__main__:2024-11-05 18:21:50 | Epoch: 1 | Step: 175920 | Dataset: 0-369101 | Loss: 0.640 | 912 ms/step , 6897.73 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 18:21:59 | Epoch: 1 | Step: 175930 | Dataset: 0-369421 | Loss: 0.764 | 913 ms/step , 6890.47 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 18:22:09 | Epoch: 1 | Step: 175940 | Dataset: 0-369741 | Loss: 0.761 | 913 ms/step , 6885.93 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 18:22:18 | Epoch: 1 | Step: 175950 | Dataset: 0-370061 | Loss: 0.728 | 912 ms/step , 6893.06 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 18:22:27 | Epoch: 1 | Step: 175960 | Dataset: 0-370381 | Loss: 0.660 | 914 ms/step , 6882.02 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 18:22:36 | Epoch: 1 | Step: 175970 | Dataset: 0-370701 | Loss: 0.786 | 913 ms/step , 6892.52 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 18:22:45 | Epoch: 1 | Step: 175980 | Dataset: 0-371021 | Loss: 0.896 | 915 ms/step , 6876.22 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 18:22:54 | Epoch: 1 | Step: 175990 | Dataset: 0-371341 | Loss: 0.729 | 913 ms/step , 6890.06 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 18:23:03 | Epoch: 1 | Step: 176000 | Dataset: 0-371661 | Loss: 0.768 | 912 ms/step , 6896.79 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 18:23:05 | Validation | Step: 176000 | Val_loss: 0.769 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:23:05 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_182305_step_176000.pt` INFO:__main__:2024-11-05 18:23:15 | Epoch: 1 | Step: 176010 | Dataset: 0-371981 | Loss: 0.784 | 913 ms/step , 6891.96 GFLOP/s , 13777.2 tokens/s INFO:__main__:2024-11-05 18:23:24 | Epoch: 1 | Step: 176020 | Dataset: 0-372301 | Loss: 0.700 | 915 ms/step , 6870.05 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-05 18:23:34 | Epoch: 1 | Step: 176030 | Dataset: 0-372621 | Loss: 0.581 | 915 ms/step , 6870.26 GFLOP/s , 17905.7 tokens/s INFO:__main__:2024-11-05 18:23:43 | Epoch: 1 | Step: 176040 | Dataset: 0-372941 | Loss: 0.801 | 915 ms/step , 6872.20 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 18:23:52 | Epoch: 1 | Step: 176050 | Dataset: 0-373261 | Loss: 0.782 | 915 ms/step , 6873.33 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-05 18:24:01 | Epoch: 1 | Step: 176060 | Dataset: 0-373581 | Loss: 0.729 | 915 ms/step , 6873.01 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 18:24:10 | Epoch: 1 | Step: 176070 | Dataset: 0-373901 | Loss: 0.740 | 913 ms/step , 6885.81 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 18:24:19 | Epoch: 1 | Step: 176080 | Dataset: 0-374221 | Loss: 0.697 | 914 ms/step , 6884.29 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 18:24:28 | Epoch: 1 | Step: 176090 | Dataset: 0-374541 | Loss: 0.681 | 913 ms/step , 6889.13 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 18:24:38 | Epoch: 1 | Step: 176100 | Dataset: 0-374861 | Loss: 0.775 | 913 ms/step , 6891.19 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 18:24:39 | Validation | Step: 176100 | Val_loss: 0.736 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:24:48 | Epoch: 1 | Step: 176110 | Dataset: 0-375181 | Loss: 0.749 | 914 ms/step , 6881.58 GFLOP/s , 15254.8 tokens/s INFO:__main__:2024-11-05 18:24:57 | Epoch: 1 | Step: 176120 | Dataset: 0-375501 | Loss: 0.787 | 913 ms/step , 6891.70 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 18:25:07 | Epoch: 1 | Step: 176130 | Dataset: 0-375821 | Loss: 0.729 | 913 ms/step , 6887.07 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 18:25:16 | Epoch: 1 | Step: 176140 | Dataset: 0-376141 | Loss: 0.691 | 914 ms/step , 6882.69 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 18:25:25 | Epoch: 1 | Step: 176150 | Dataset: 0-376461 | Loss: 0.701 | 914 ms/step , 6879.68 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 18:25:34 | Epoch: 1 | Step: 176160 | Dataset: 0-376781 | Loss: 0.758 | 914 ms/step , 6884.74 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 18:25:43 | Epoch: 1 | Step: 176170 | Dataset: 0-377101 | Loss: 0.675 | 914 ms/step , 6880.46 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 18:25:52 | Epoch: 1 | Step: 176180 | Dataset: 0-377421 | Loss: 0.798 | 914 ms/step , 6882.78 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 18:26:01 | Epoch: 1 | Step: 176190 | Dataset: 0-377741 | Loss: 0.749 | 915 ms/step , 6871.74 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 18:26:11 | Epoch: 1 | Step: 176200 | Dataset: 0-378061 | Loss: 0.631 | 913 ms/step , 6888.87 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 18:26:12 | Validation | Step: 176200 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:26:21 | Epoch: 1 | Step: 176210 | Dataset: 0-378381 | Loss: 0.640 | 915 ms/step , 6875.33 GFLOP/s , 15254.8 tokens/s INFO:__main__:2024-11-05 18:26:30 | Epoch: 1 | Step: 176220 | Dataset: 0-378701 | Loss: 0.669 | 914 ms/step , 6877.72 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 18:26:40 | Epoch: 1 | Step: 176230 | Dataset: 0-379021 | Loss: 0.742 | 914 ms/step , 6883.46 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 18:26:49 | Epoch: 1 | Step: 176240 | Dataset: 0-379341 | Loss: 0.745 | 913 ms/step , 6890.86 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 18:26:58 | Epoch: 1 | Step: 176250 | Dataset: 0-379661 | Loss: 0.705 | 914 ms/step , 6883.73 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 18:27:07 | Epoch: 1 | Step: 176260 | Dataset: 0-379981 | Loss: 0.659 | 913 ms/step , 6885.08 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 18:27:16 | Epoch: 1 | Step: 176270 | Dataset: 0-380301 | Loss: 0.717 | 916 ms/step , 6865.83 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 18:27:25 | Epoch: 1 | Step: 176280 | Dataset: 0-380621 | Loss: 0.776 | 914 ms/step , 6877.86 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 18:27:34 | Epoch: 1 | Step: 176290 | Dataset: 0-380941 | Loss: 0.700 | 913 ms/step , 6886.89 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 18:27:44 | Epoch: 1 | Step: 176300 | Dataset: 0-381261 | Loss: 0.710 | 913 ms/step , 6888.68 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 18:27:45 | Validation | Step: 176300 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:27:54 | Epoch: 1 | Step: 176310 | Dataset: 0-381581 | Loss: 0.784 | 914 ms/step , 6881.91 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-05 18:28:03 | Epoch: 1 | Step: 176320 | Dataset: 0-381901 | Loss: 0.676 | 914 ms/step , 6880.51 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 18:28:13 | Epoch: 1 | Step: 176330 | Dataset: 0-382221 | Loss: 0.703 | 913 ms/step , 6888.37 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 18:28:22 | Epoch: 1 | Step: 176340 | Dataset: 0-382541 | Loss: 0.675 | 913 ms/step , 6886.62 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 18:28:31 | Epoch: 1 | Step: 176350 | Dataset: 0-382861 | Loss: 0.762 | 914 ms/step , 6882.40 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 18:28:40 | Epoch: 1 | Step: 176360 | Dataset: 0-383181 | Loss: 0.713 | 914 ms/step , 6883.50 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 18:28:49 | Epoch: 1 | Step: 176370 | Dataset: 0-383501 | Loss: 0.745 | 914 ms/step , 6878.69 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 18:28:58 | Epoch: 1 | Step: 176380 | Dataset: 0-383821 | Loss: 0.684 | 914 ms/step , 6878.57 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 18:29:07 | Epoch: 1 | Step: 176390 | Dataset: 0-384141 | Loss: 0.671 | 914 ms/step , 6884.75 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 18:29:17 | Epoch: 1 | Step: 176400 | Dataset: 0-384461 | Loss: 0.797 | 913 ms/step , 6889.52 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 18:29:18 | Validation | Step: 176400 | Val_loss: 0.708 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:29:27 | Epoch: 1 | Step: 176410 | Dataset: 0-384781 | Loss: 0.694 | 913 ms/step , 6889.01 GFLOP/s , 15255.3 tokens/s INFO:__main__:2024-11-05 18:29:36 | Epoch: 1 | Step: 176420 | Dataset: 0-385101 | Loss: 0.763 | 914 ms/step , 6880.98 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 18:29:46 | Epoch: 1 | Step: 176430 | Dataset: 0-385421 | Loss: 0.639 | 914 ms/step , 6882.35 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 18:29:55 | Epoch: 1 | Step: 176440 | Dataset: 0-385741 | Loss: 0.671 | 913 ms/step , 6887.77 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 18:30:04 | Epoch: 1 | Step: 176450 | Dataset: 0-386061 | Loss: 0.668 | 913 ms/step , 6890.19 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 18:30:13 | Epoch: 1 | Step: 176460 | Dataset: 0-386381 | Loss: 0.638 | 913 ms/step , 6892.25 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 18:30:22 | Epoch: 1 | Step: 176470 | Dataset: 0-386701 | Loss: 0.805 | 916 ms/step , 6868.52 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 18:30:31 | Epoch: 1 | Step: 176480 | Dataset: 0-387021 | Loss: 0.712 | 914 ms/step , 6883.69 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 18:30:40 | Epoch: 1 | Step: 176490 | Dataset: 0-387341 | Loss: 0.821 | 915 ms/step , 6874.60 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 18:30:50 | Epoch: 1 | Step: 176500 | Dataset: 0-387661 | Loss: 0.557 | 912 ms/step , 6896.41 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 18:30:51 | Validation | Step: 176500 | Val_loss: 0.734 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:31:00 | Epoch: 1 | Step: 176510 | Dataset: 0-387981 | Loss: 0.766 | 913 ms/step , 6890.70 GFLOP/s , 15266.2 tokens/s INFO:__main__:2024-11-05 18:31:10 | Epoch: 1 | Step: 176520 | Dataset: 0-388301 | Loss: 0.672 | 913 ms/step , 6889.77 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 18:31:19 | Epoch: 1 | Step: 176530 | Dataset: 0-388621 | Loss: 0.798 | 913 ms/step , 6885.22 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 18:31:28 | Epoch: 1 | Step: 176540 | Dataset: 0-388941 | Loss: 0.781 | 912 ms/step , 6894.86 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 18:31:37 | Epoch: 1 | Step: 176550 | Dataset: 0-389261 | Loss: 0.791 | 914 ms/step , 6880.54 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 18:31:46 | Epoch: 1 | Step: 176560 | Dataset: 0-389581 | Loss: 0.713 | 913 ms/step , 6886.57 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 18:31:55 | Epoch: 1 | Step: 176570 | Dataset: 0-389901 | Loss: 0.857 | 913 ms/step , 6885.70 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 18:32:04 | Epoch: 1 | Step: 176580 | Dataset: 0-390221 | Loss: 0.669 | 913 ms/step , 6890.57 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 18:32:13 | Epoch: 1 | Step: 176590 | Dataset: 0-390541 | Loss: 0.777 | 912 ms/step , 6898.73 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 18:32:23 | Epoch: 1 | Step: 176600 | Dataset: 0-390861 | Loss: 0.686 | 914 ms/step , 6880.56 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 18:32:24 | Validation | Step: 176600 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:32:33 | Epoch: 1 | Step: 176610 | Dataset: 0-391181 | Loss: 0.747 | 914 ms/step , 6881.54 GFLOP/s , 15266.0 tokens/s INFO:__main__:2024-11-05 18:32:42 | Epoch: 1 | Step: 176620 | Dataset: 0-391501 | Loss: 0.747 | 914 ms/step , 6880.42 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 18:32:52 | Epoch: 1 | Step: 176630 | Dataset: 0-391821 | Loss: 0.707 | 914 ms/step , 6880.54 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 18:33:01 | Epoch: 1 | Step: 176640 | Dataset: 0-392141 | Loss: 0.719 | 914 ms/step , 6880.82 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 18:33:10 | Epoch: 1 | Step: 176650 | Dataset: 0-392461 | Loss: 0.713 | 915 ms/step , 6872.21 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 18:33:19 | Epoch: 1 | Step: 176660 | Dataset: 0-392781 | Loss: 0.805 | 913 ms/step , 6885.34 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 18:33:28 | Epoch: 1 | Step: 176670 | Dataset: 0-393101 | Loss: 0.842 | 913 ms/step , 6886.12 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 18:33:37 | Epoch: 1 | Step: 176680 | Dataset: 0-393421 | Loss: 0.718 | 913 ms/step , 6886.91 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 18:33:46 | Epoch: 1 | Step: 176690 | Dataset: 0-393741 | Loss: 0.706 | 913 ms/step , 6890.95 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 18:33:56 | Epoch: 1 | Step: 176700 | Dataset: 0-394061 | Loss: 0.728 | 914 ms/step , 6882.33 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 18:33:57 | Validation | Step: 176700 | Val_loss: 0.712 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:34:06 | Epoch: 1 | Step: 176710 | Dataset: 0-394381 | Loss: 0.655 | 914 ms/step , 6877.53 GFLOP/s , 15267.5 tokens/s INFO:__main__:2024-11-05 18:34:16 | Epoch: 1 | Step: 176720 | Dataset: 0-394701 | Loss: 0.794 | 915 ms/step , 6877.31 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 18:34:25 | Epoch: 1 | Step: 176730 | Dataset: 0-395021 | Loss: 0.677 | 912 ms/step , 6893.60 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 18:34:34 | Epoch: 1 | Step: 176740 | Dataset: 0-395341 | Loss: 0.732 | 914 ms/step , 6879.44 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 18:34:43 | Epoch: 1 | Step: 176750 | Dataset: 0-395661 | Loss: 0.602 | 912 ms/step , 6893.87 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 18:34:52 | Epoch: 1 | Step: 176760 | Dataset: 0-395981 | Loss: 0.738 | 915 ms/step , 6873.98 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 18:35:01 | Epoch: 1 | Step: 176770 | Dataset: 0-396301 | Loss: 0.693 | 913 ms/step , 6886.77 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 18:35:10 | Epoch: 1 | Step: 176780 | Dataset: 0-396621 | Loss: 0.742 | 916 ms/step , 6865.51 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 18:35:20 | Epoch: 1 | Step: 176790 | Dataset: 0-396941 | Loss: 0.654 | 914 ms/step , 6884.02 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 18:35:29 | Epoch: 1 | Step: 176800 | Dataset: 0-397261 | Loss: 0.743 | 912 ms/step , 6893.61 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 18:35:30 | Validation | Step: 176800 | Val_loss: 0.703 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:35:39 | Epoch: 1 | Step: 176810 | Dataset: 0-397581 | Loss: 0.774 | 914 ms/step , 6882.85 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 18:35:49 | Epoch: 1 | Step: 176820 | Dataset: 0-397901 | Loss: 0.670 | 913 ms/step , 6887.50 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 18:35:58 | Epoch: 1 | Step: 176830 | Dataset: 0-398221 | Loss: 0.679 | 913 ms/step , 6888.21 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 18:36:07 | Epoch: 1 | Step: 176840 | Dataset: 0-398541 | Loss: 0.681 | 913 ms/step , 6886.09 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 18:36:16 | Epoch: 1 | Step: 176850 | Dataset: 0-398861 | Loss: 0.800 | 912 ms/step , 6895.40 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 18:36:25 | Epoch: 1 | Step: 176860 | Dataset: 0-399181 | Loss: 0.772 | 914 ms/step , 6880.39 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 18:36:34 | Epoch: 1 | Step: 176870 | Dataset: 0-399501 | Loss: 0.752 | 915 ms/step , 6874.11 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 18:36:43 | Epoch: 1 | Step: 176880 | Dataset: 0-399821 | Loss: 0.693 | 915 ms/step , 6873.15 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 18:36:53 | Epoch: 1 | Step: 176890 | Dataset: 0-400141 | Loss: 0.820 | 914 ms/step , 6877.75 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 18:37:02 | Epoch: 1 | Step: 176900 | Dataset: 0-400461 | Loss: 0.766 | 914 ms/step , 6880.47 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 18:37:03 | Validation | Step: 176900 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:37:12 | Epoch: 1 | Step: 176910 | Dataset: 0-400781 | Loss: 0.695 | 913 ms/step , 6887.32 GFLOP/s , 15265.4 tokens/s INFO:__main__:2024-11-05 18:37:22 | Epoch: 1 | Step: 176920 | Dataset: 0-401101 | Loss: 0.707 | 914 ms/step , 6878.41 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 18:37:31 | Epoch: 1 | Step: 176930 | Dataset: 0-401421 | Loss: 0.665 | 913 ms/step , 6885.63 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 18:37:40 | Epoch: 1 | Step: 176940 | Dataset: 0-401741 | Loss: 0.754 | 915 ms/step , 6872.33 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 18:37:49 | Epoch: 1 | Step: 176950 | Dataset: 0-402061 | Loss: 0.662 | 914 ms/step , 6882.43 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 18:37:58 | Epoch: 1 | Step: 176960 | Dataset: 0-402381 | Loss: 0.758 | 913 ms/step , 6890.42 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 18:38:07 | Epoch: 1 | Step: 176970 | Dataset: 0-402701 | Loss: 0.701 | 913 ms/step , 6887.25 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 18:38:16 | Epoch: 1 | Step: 176980 | Dataset: 0-403021 | Loss: 0.675 | 914 ms/step , 6884.55 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 18:38:26 | Epoch: 1 | Step: 176990 | Dataset: 0-403341 | Loss: 0.693 | 915 ms/step , 6875.71 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 18:38:35 | Epoch: 1 | Step: 177000 | Dataset: 0-403661 | Loss: 0.761 | 914 ms/step , 6878.85 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 18:38:36 | Validation | Step: 177000 | Val_loss: 0.667 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:38:36 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_183836_step_177000.pt` INFO:__main__:2024-11-05 18:38:47 | Epoch: 1 | Step: 177010 | Dataset: 0-403981 | Loss: 0.656 | 913 ms/step , 6890.66 GFLOP/s , 13822.3 tokens/s INFO:__main__:2024-11-05 18:38:56 | Epoch: 1 | Step: 177020 | Dataset: 0-404301 | Loss: 0.760 | 915 ms/step , 6871.93 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 18:39:05 | Epoch: 1 | Step: 177030 | Dataset: 0-404621 | Loss: 0.686 | 914 ms/step , 6877.72 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 18:39:14 | Epoch: 1 | Step: 177040 | Dataset: 0-404941 | Loss: 0.797 | 914 ms/step , 6880.83 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 18:39:23 | Epoch: 1 | Step: 177050 | Dataset: 0-405261 | Loss: 0.758 | 913 ms/step , 6886.10 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 18:39:32 | Epoch: 1 | Step: 177060 | Dataset: 0-405581 | Loss: 0.677 | 913 ms/step , 6885.57 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 18:39:41 | Epoch: 1 | Step: 177070 | Dataset: 0-405901 | Loss: 0.739 | 916 ms/step , 6867.79 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 18:39:51 | Epoch: 1 | Step: 177080 | Dataset: 0-406221 | Loss: 0.772 | 914 ms/step , 6884.81 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 18:40:00 | Epoch: 1 | Step: 177090 | Dataset: 0-406541 | Loss: 0.709 | 915 ms/step , 6874.36 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 18:40:09 | Epoch: 1 | Step: 177100 | Dataset: 0-406861 | Loss: 0.686 | 914 ms/step , 6884.98 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 18:40:10 | Validation | Step: 177100 | Val_loss: 0.697 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:40:20 | Epoch: 1 | Step: 177110 | Dataset: 0-407181 | Loss: 0.691 | 914 ms/step , 6881.65 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-05 18:40:29 | Epoch: 1 | Step: 177120 | Dataset: 0-407501 | Loss: 0.754 | 913 ms/step , 6885.26 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 18:40:38 | Epoch: 1 | Step: 177130 | Dataset: 0-407821 | Loss: 0.671 | 913 ms/step , 6892.14 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 18:40:47 | Epoch: 1 | Step: 177140 | Dataset: 0-408141 | Loss: 0.676 | 913 ms/step , 6891.31 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 18:40:56 | Epoch: 1 | Step: 177150 | Dataset: 0-408461 | Loss: 0.632 | 914 ms/step , 6881.16 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 18:41:05 | Epoch: 1 | Step: 177160 | Dataset: 0-408781 | Loss: 0.820 | 914 ms/step , 6879.10 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 18:41:14 | Epoch: 1 | Step: 177170 | Dataset: 0-409101 | Loss: 0.781 | 915 ms/step , 6874.67 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 18:41:24 | Epoch: 1 | Step: 177180 | Dataset: 0-409421 | Loss: 0.747 | 914 ms/step , 6881.41 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 18:41:33 | Epoch: 1 | Step: 177190 | Dataset: 0-409741 | Loss: 0.742 | 915 ms/step , 6872.35 GFLOP/s , 17906.5 tokens/s INFO:__main__:2024-11-05 18:41:42 | Epoch: 1 | Step: 177200 | Dataset: 0-410061 | Loss: 0.665 | 914 ms/step , 6883.44 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 18:41:43 | Validation | Step: 177200 | Val_loss: 0.706 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:41:53 | Epoch: 1 | Step: 177210 | Dataset: 0-410381 | Loss: 0.682 | 914 ms/step , 6881.72 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-05 18:42:02 | Epoch: 1 | Step: 177220 | Dataset: 0-410701 | Loss: 0.696 | 915 ms/step , 6873.57 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 18:42:11 | Epoch: 1 | Step: 177230 | Dataset: 0-411021 | Loss: 0.721 | 914 ms/step , 6882.55 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 18:42:20 | Epoch: 1 | Step: 177240 | Dataset: 0-411341 | Loss: 0.740 | 913 ms/step , 6886.86 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 18:42:29 | Epoch: 1 | Step: 177250 | Dataset: 0-411661 | Loss: 0.706 | 917 ms/step , 6855.90 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-05 18:42:38 | Epoch: 1 | Step: 177260 | Dataset: 0-411981 | Loss: 0.792 | 914 ms/step , 6881.69 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 18:42:47 | Epoch: 1 | Step: 177270 | Dataset: 0-412301 | Loss: 0.725 | 915 ms/step , 6877.12 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-05 18:42:57 | Epoch: 1 | Step: 177280 | Dataset: 0-412621 | Loss: 0.765 | 914 ms/step , 6880.41 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 18:43:06 | Epoch: 1 | Step: 177290 | Dataset: 0-412941 | Loss: 0.744 | 914 ms/step , 6880.06 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 18:43:15 | Epoch: 1 | Step: 177300 | Dataset: 0-413261 | Loss: 0.651 | 914 ms/step , 6884.58 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 18:43:16 | Validation | Step: 177300 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:43:26 | Epoch: 1 | Step: 177310 | Dataset: 0-413581 | Loss: 0.799 | 913 ms/step , 6886.51 GFLOP/s , 15271.4 tokens/s INFO:__main__:2024-11-05 18:43:35 | Epoch: 1 | Step: 177320 | Dataset: 0-413901 | Loss: 0.668 | 915 ms/step , 6874.96 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 18:43:44 | Epoch: 1 | Step: 177330 | Dataset: 0-414221 | Loss: 0.799 | 915 ms/step , 6874.04 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 18:43:53 | Epoch: 1 | Step: 177340 | Dataset: 0-414541 | Loss: 0.774 | 914 ms/step , 6878.58 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 18:44:02 | Epoch: 1 | Step: 177350 | Dataset: 0-414861 | Loss: 0.683 | 913 ms/step , 6885.39 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 18:44:11 | Epoch: 1 | Step: 177360 | Dataset: 0-415181 | Loss: 0.687 | 913 ms/step , 6889.65 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 18:44:20 | Epoch: 1 | Step: 177370 | Dataset: 0-415501 | Loss: 0.674 | 914 ms/step , 6883.41 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 18:44:30 | Epoch: 1 | Step: 177380 | Dataset: 0-415821 | Loss: 0.704 | 914 ms/step , 6883.58 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 18:44:39 | Epoch: 1 | Step: 177390 | Dataset: 0-416141 | Loss: 0.745 | 914 ms/step , 6880.08 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 18:44:48 | Epoch: 1 | Step: 177400 | Dataset: 0-416461 | Loss: 0.716 | 913 ms/step , 6885.67 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 18:44:50 | Validation | Step: 177400 | Val_loss: 0.714 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:44:59 | Epoch: 1 | Step: 177410 | Dataset: 0-416781 | Loss: 0.690 | 914 ms/step , 6883.24 GFLOP/s , 15261.0 tokens/s INFO:__main__:2024-11-05 18:45:08 | Epoch: 1 | Step: 177420 | Dataset: 0-417101 | Loss: 0.775 | 913 ms/step , 6886.96 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 18:45:17 | Epoch: 1 | Step: 177430 | Dataset: 0-417421 | Loss: 0.616 | 913 ms/step , 6886.76 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 18:45:26 | Epoch: 1 | Step: 177440 | Dataset: 0-417741 | Loss: 0.810 | 915 ms/step , 6871.15 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 18:45:35 | Epoch: 1 | Step: 177450 | Dataset: 0-418061 | Loss: 0.650 | 913 ms/step , 6885.38 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 18:45:44 | Epoch: 1 | Step: 177460 | Dataset: 0-418381 | Loss: 0.671 | 914 ms/step , 6884.00 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 18:45:54 | Epoch: 1 | Step: 177470 | Dataset: 0-418701 | Loss: 0.764 | 914 ms/step , 6882.30 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 18:46:03 | Epoch: 1 | Step: 177480 | Dataset: 0-419021 | Loss: 0.755 | 914 ms/step , 6881.98 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 18:46:12 | Epoch: 1 | Step: 177490 | Dataset: 0-419341 | Loss: 0.731 | 914 ms/step , 6878.61 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 18:46:21 | Epoch: 1 | Step: 177500 | Dataset: 0-419661 | Loss: 0.667 | 914 ms/step , 6882.41 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 18:46:23 | Validation | Step: 177500 | Val_loss: 0.743 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:46:32 | Epoch: 1 | Step: 177510 | Dataset: 0-419981 | Loss: 0.637 | 912 ms/step , 6892.63 GFLOP/s , 15263.2 tokens/s INFO:__main__:2024-11-05 18:46:41 | Epoch: 1 | Step: 177520 | Dataset: 0-420301 | Loss: 0.766 | 913 ms/step , 6887.75 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 18:46:50 | Epoch: 1 | Step: 177530 | Dataset: 0-420621 | Loss: 0.763 | 915 ms/step , 6876.38 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 18:46:59 | Epoch: 1 | Step: 177540 | Dataset: 0-420941 | Loss: 0.655 | 914 ms/step , 6883.54 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 18:47:08 | Epoch: 1 | Step: 177550 | Dataset: 0-421261 | Loss: 0.671 | 914 ms/step , 6884.71 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 18:47:17 | Epoch: 1 | Step: 177560 | Dataset: 0-421581 | Loss: 0.668 | 913 ms/step , 6891.14 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 18:47:27 | Epoch: 1 | Step: 177570 | Dataset: 0-421901 | Loss: 0.747 | 913 ms/step , 6887.70 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-05 18:47:36 | Epoch: 1 | Step: 177580 | Dataset: 0-422221 | Loss: 0.732 | 914 ms/step , 6881.99 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 18:47:45 | Epoch: 1 | Step: 177590 | Dataset: 0-422541 | Loss: 0.747 | 914 ms/step , 6884.51 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 18:47:54 | Epoch: 1 | Step: 177600 | Dataset: 0-422861 | Loss: 0.696 | 913 ms/step , 6886.74 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 18:47:56 | Validation | Step: 177600 | Val_loss: 0.661 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:48:05 | Epoch: 1 | Step: 177610 | Dataset: 0-423181 | Loss: 0.727 | 914 ms/step , 6884.33 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-05 18:48:14 | Epoch: 1 | Step: 177620 | Dataset: 0-423501 | Loss: 0.710 | 914 ms/step , 6878.83 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 18:48:23 | Epoch: 1 | Step: 177630 | Dataset: 0-423821 | Loss: 0.758 | 914 ms/step , 6880.28 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 18:48:32 | Epoch: 1 | Step: 177640 | Dataset: 0-424141 | Loss: 0.755 | 915 ms/step , 6871.08 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 18:48:41 | Epoch: 1 | Step: 177650 | Dataset: 0-424461 | Loss: 0.743 | 914 ms/step , 6880.60 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 18:48:50 | Epoch: 1 | Step: 177660 | Dataset: 0-424781 | Loss: 0.609 | 914 ms/step , 6881.59 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 18:49:00 | Epoch: 1 | Step: 177670 | Dataset: 0-425101 | Loss: 0.766 | 913 ms/step , 6885.94 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 18:49:09 | Epoch: 1 | Step: 177680 | Dataset: 0-425421 | Loss: 0.721 | 914 ms/step , 6884.12 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 18:49:18 | Epoch: 1 | Step: 177690 | Dataset: 0-425741 | Loss: 0.658 | 915 ms/step , 6875.48 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 18:49:27 | Epoch: 1 | Step: 177700 | Dataset: 0-426061 | Loss: 0.690 | 915 ms/step , 6870.50 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-05 18:49:29 | Validation | Step: 177700 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:49:38 | Epoch: 1 | Step: 177710 | Dataset: 0-426381 | Loss: 0.732 | 914 ms/step , 6880.67 GFLOP/s , 15256.0 tokens/s INFO:__main__:2024-11-05 18:49:47 | Epoch: 1 | Step: 177720 | Dataset: 0-426701 | Loss: 0.732 | 914 ms/step , 6878.79 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 18:49:56 | Epoch: 1 | Step: 177730 | Dataset: 0-427021 | Loss: 0.747 | 913 ms/step , 6887.21 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 18:50:05 | Epoch: 1 | Step: 177740 | Dataset: 0-427341 | Loss: 0.895 | 914 ms/step , 6882.20 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 18:50:14 | Epoch: 1 | Step: 177750 | Dataset: 0-427661 | Loss: 0.750 | 914 ms/step , 6880.95 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 18:50:23 | Epoch: 1 | Step: 177760 | Dataset: 0-427981 | Loss: 0.717 | 914 ms/step , 6878.63 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 18:50:33 | Epoch: 1 | Step: 177770 | Dataset: 0-428301 | Loss: 0.706 | 913 ms/step , 6887.95 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 18:50:42 | Epoch: 1 | Step: 177780 | Dataset: 0-428621 | Loss: 0.728 | 913 ms/step , 6887.56 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 18:50:51 | Epoch: 1 | Step: 177790 | Dataset: 0-428941 | Loss: 0.692 | 913 ms/step , 6887.52 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 18:51:00 | Epoch: 1 | Step: 177800 | Dataset: 0-429261 | Loss: 0.746 | 914 ms/step , 6882.98 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 18:51:02 | Validation | Step: 177800 | Val_loss: 0.674 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:51:11 | Epoch: 1 | Step: 177810 | Dataset: 0-429581 | Loss: 0.714 | 913 ms/step , 6885.64 GFLOP/s , 15254.8 tokens/s INFO:__main__:2024-11-05 18:51:20 | Epoch: 1 | Step: 177820 | Dataset: 0-429901 | Loss: 0.679 | 913 ms/step , 6885.81 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 18:51:29 | Epoch: 1 | Step: 177830 | Dataset: 0-430221 | Loss: 0.685 | 915 ms/step , 6876.07 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 18:51:38 | Epoch: 1 | Step: 177840 | Dataset: 0-430541 | Loss: 0.715 | 913 ms/step , 6886.99 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 18:51:47 | Epoch: 1 | Step: 177850 | Dataset: 0-430861 | Loss: 0.716 | 915 ms/step , 6870.73 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-05 18:51:56 | Epoch: 1 | Step: 177860 | Dataset: 0-431181 | Loss: 0.653 | 913 ms/step , 6886.68 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 18:52:06 | Epoch: 1 | Step: 177870 | Dataset: 0-431501 | Loss: 0.760 | 915 ms/step , 6872.55 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 18:52:15 | Epoch: 1 | Step: 177880 | Dataset: 0-431821 | Loss: 0.714 | 913 ms/step , 6889.68 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 18:52:24 | Epoch: 1 | Step: 177890 | Dataset: 0-432141 | Loss: 0.694 | 914 ms/step , 6883.15 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 18:52:33 | Epoch: 1 | Step: 177900 | Dataset: 0-432461 | Loss: 0.721 | 914 ms/step , 6882.45 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 18:52:35 | Validation | Step: 177900 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:52:44 | Epoch: 1 | Step: 177910 | Dataset: 0-432781 | Loss: 0.804 | 914 ms/step , 6883.55 GFLOP/s , 15284.5 tokens/s INFO:__main__:2024-11-05 18:52:53 | Epoch: 1 | Step: 177920 | Dataset: 0-433101 | Loss: 0.667 | 915 ms/step , 6876.01 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 18:53:02 | Epoch: 1 | Step: 177930 | Dataset: 0-433421 | Loss: 0.691 | 913 ms/step , 6886.04 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 18:53:11 | Epoch: 1 | Step: 177940 | Dataset: 0-433741 | Loss: 0.739 | 913 ms/step , 6888.69 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 18:53:20 | Epoch: 1 | Step: 177950 | Dataset: 0-434061 | Loss: 0.723 | 913 ms/step , 6886.73 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 18:53:29 | Epoch: 1 | Step: 177960 | Dataset: 0-434381 | Loss: 0.724 | 914 ms/step , 6882.41 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 18:53:39 | Epoch: 1 | Step: 177970 | Dataset: 0-434701 | Loss: 0.769 | 914 ms/step , 6884.69 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 18:53:48 | Epoch: 1 | Step: 177980 | Dataset: 0-435021 | Loss: 0.620 | 915 ms/step , 6872.20 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 18:53:57 | Epoch: 1 | Step: 177990 | Dataset: 0-435341 | Loss: 0.663 | 913 ms/step , 6888.89 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 18:54:06 | Epoch: 1 | Step: 178000 | Dataset: 0-435661 | Loss: 0.716 | 913 ms/step , 6889.59 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 18:54:08 | Validation | Step: 178000 | Val_loss: 0.749 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:54:08 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_185408_step_178000.pt` INFO:__main__:2024-11-05 18:54:18 | Epoch: 1 | Step: 178010 | Dataset: 0-435981 | Loss: 0.699 | 914 ms/step , 6878.33 GFLOP/s , 13730.6 tokens/s INFO:__main__:2024-11-05 18:54:27 | Epoch: 1 | Step: 178020 | Dataset: 0-436301 | Loss: 0.745 | 914 ms/step , 6882.09 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 18:54:36 | Epoch: 1 | Step: 178030 | Dataset: 0-436621 | Loss: 0.687 | 913 ms/step , 6888.65 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 18:54:45 | Epoch: 1 | Step: 178040 | Dataset: 0-436941 | Loss: 0.745 | 914 ms/step , 6882.61 GFLOP/s , 17868.4 tokens/s INFO:__main__:2024-11-05 18:54:55 | Epoch: 1 | Step: 178050 | Dataset: 0-437261 | Loss: 0.734 | 914 ms/step , 6878.10 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 18:55:04 | Epoch: 1 | Step: 178060 | Dataset: 0-437581 | Loss: 0.709 | 915 ms/step , 6877.39 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 18:55:13 | Epoch: 1 | Step: 178070 | Dataset: 0-437901 | Loss: 0.734 | 916 ms/step , 6867.70 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 18:55:22 | Epoch: 1 | Step: 178080 | Dataset: 0-438221 | Loss: 0.648 | 915 ms/step , 6876.24 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 18:55:31 | Epoch: 1 | Step: 178090 | Dataset: 0-438541 | Loss: 0.708 | 914 ms/step , 6881.54 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 18:55:40 | Epoch: 1 | Step: 178100 | Dataset: 0-438861 | Loss: 0.699 | 914 ms/step , 6879.37 GFLOP/s , 17905.0 tokens/s INFO:__main__:2024-11-05 18:55:42 | Validation | Step: 178100 | Val_loss: 0.753 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:55:51 | Epoch: 1 | Step: 178110 | Dataset: 0-439181 | Loss: 0.672 | 913 ms/step , 6888.22 GFLOP/s , 15268.7 tokens/s INFO:__main__:2024-11-05 18:56:00 | Epoch: 1 | Step: 178120 | Dataset: 0-439501 | Loss: 0.703 | 915 ms/step , 6876.28 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 18:56:09 | Epoch: 1 | Step: 178130 | Dataset: 0-439821 | Loss: 0.733 | 915 ms/step , 6874.38 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 18:56:18 | Epoch: 1 | Step: 178140 | Dataset: 0-440141 | Loss: 0.658 | 915 ms/step , 6875.94 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 18:56:28 | Epoch: 1 | Step: 178150 | Dataset: 0-440461 | Loss: 0.737 | 915 ms/step , 6877.39 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 18:56:37 | Epoch: 1 | Step: 178160 | Dataset: 0-440781 | Loss: 0.785 | 914 ms/step , 6883.68 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 18:56:46 | Epoch: 1 | Step: 178170 | Dataset: 0-441101 | Loss: 0.759 | 914 ms/step , 6881.82 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 18:56:55 | Epoch: 1 | Step: 178180 | Dataset: 0-441421 | Loss: 0.723 | 913 ms/step , 6891.49 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 18:57:04 | Epoch: 1 | Step: 178190 | Dataset: 0-441741 | Loss: 0.725 | 915 ms/step , 6872.18 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 18:57:13 | Epoch: 1 | Step: 178200 | Dataset: 0-442061 | Loss: 0.709 | 913 ms/step , 6889.93 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 18:57:15 | Validation | Step: 178200 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:57:24 | Epoch: 1 | Step: 178210 | Dataset: 0-442381 | Loss: 0.741 | 913 ms/step , 6888.35 GFLOP/s , 15256.6 tokens/s INFO:__main__:2024-11-05 18:57:33 | Epoch: 1 | Step: 178220 | Dataset: 0-442701 | Loss: 0.822 | 914 ms/step , 6879.73 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 18:57:42 | Epoch: 1 | Step: 178230 | Dataset: 0-443021 | Loss: 0.733 | 914 ms/step , 6877.59 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 18:57:51 | Epoch: 1 | Step: 178240 | Dataset: 0-443341 | Loss: 0.703 | 913 ms/step , 6888.80 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 18:58:01 | Epoch: 1 | Step: 178250 | Dataset: 0-443661 | Loss: 0.557 | 915 ms/step , 6875.47 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 18:58:10 | Epoch: 1 | Step: 178260 | Dataset: 0-443981 | Loss: 0.626 | 913 ms/step , 6887.20 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 18:58:19 | Epoch: 1 | Step: 178270 | Dataset: 0-444301 | Loss: 0.682 | 913 ms/step , 6889.73 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 18:58:28 | Epoch: 1 | Step: 178280 | Dataset: 0-444621 | Loss: 0.732 | 913 ms/step , 6890.56 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 18:58:37 | Epoch: 1 | Step: 178290 | Dataset: 0-444941 | Loss: 0.738 | 916 ms/step , 6868.77 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 18:58:46 | Epoch: 1 | Step: 178300 | Dataset: 0-445261 | Loss: 0.808 | 915 ms/step , 6873.88 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 18:58:48 | Validation | Step: 178300 | Val_loss: 0.657 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 18:58:57 | Epoch: 1 | Step: 178310 | Dataset: 0-445581 | Loss: 0.717 | 914 ms/step , 6884.80 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 18:59:06 | Epoch: 1 | Step: 178320 | Dataset: 0-445901 | Loss: 0.774 | 915 ms/step , 6872.09 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 18:59:15 | Epoch: 1 | Step: 178330 | Dataset: 0-446221 | Loss: 0.701 | 913 ms/step , 6888.47 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 18:59:24 | Epoch: 1 | Step: 178340 | Dataset: 0-446541 | Loss: 0.679 | 914 ms/step , 6880.04 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 18:59:34 | Epoch: 1 | Step: 178350 | Dataset: 0-446861 | Loss: 0.657 | 913 ms/step , 6888.35 GFLOP/s , 17906.5 tokens/s INFO:__main__:2024-11-05 18:59:43 | Epoch: 1 | Step: 178360 | Dataset: 0-447181 | Loss: 0.707 | 917 ms/step , 6861.61 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 18:59:52 | Epoch: 1 | Step: 178370 | Dataset: 0-447501 | Loss: 0.682 | 913 ms/step , 6889.09 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 19:00:01 | Epoch: 1 | Step: 178380 | Dataset: 0-447821 | Loss: 0.562 | 913 ms/step , 6892.14 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-05 19:00:10 | Epoch: 1 | Step: 178390 | Dataset: 0-448141 | Loss: 0.629 | 914 ms/step , 6877.96 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-05 19:00:19 | Epoch: 1 | Step: 178400 | Dataset: 0-448461 | Loss: 0.588 | 913 ms/step , 6891.13 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 19:00:21 | Validation | Step: 178400 | Val_loss: 0.696 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:00:30 | Epoch: 1 | Step: 178410 | Dataset: 0-448781 | Loss: 0.708 | 914 ms/step , 6877.79 GFLOP/s , 15258.1 tokens/s INFO:__main__:2024-11-05 19:00:39 | Epoch: 1 | Step: 178420 | Dataset: 0-449101 | Loss: 0.741 | 915 ms/step , 6875.01 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 19:00:48 | Epoch: 1 | Step: 178430 | Dataset: 0-449421 | Loss: 0.692 | 915 ms/step , 6876.10 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 19:00:58 | Epoch: 1 | Step: 178440 | Dataset: 0-449741 | Loss: 0.706 | 914 ms/step , 6884.74 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 19:01:07 | Epoch: 1 | Step: 178450 | Dataset: 0-450061 | Loss: 0.732 | 915 ms/step , 6876.03 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-05 19:01:16 | Epoch: 1 | Step: 178460 | Dataset: 0-450381 | Loss: 0.701 | 915 ms/step , 6874.48 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 19:01:25 | Epoch: 1 | Step: 178470 | Dataset: 0-450701 | Loss: 0.725 | 913 ms/step , 6891.21 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 19:01:34 | Epoch: 1 | Step: 178480 | Dataset: 0-451021 | Loss: 0.621 | 914 ms/step , 6885.05 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 19:01:43 | Epoch: 1 | Step: 178490 | Dataset: 0-451341 | Loss: 0.677 | 913 ms/step , 6891.31 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 19:01:52 | Epoch: 1 | Step: 178500 | Dataset: 0-451661 | Loss: 0.762 | 916 ms/step , 6865.79 GFLOP/s , 17899.9 tokens/s INFO:__main__:2024-11-05 19:01:54 | Validation | Step: 178500 | Val_loss: 0.683 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:02:03 | Epoch: 1 | Step: 178510 | Dataset: 0-451981 | Loss: 0.785 | 915 ms/step , 6876.42 GFLOP/s , 15258.5 tokens/s INFO:__main__:2024-11-05 19:02:12 | Epoch: 1 | Step: 178520 | Dataset: 0-452301 | Loss: 0.681 | 913 ms/step , 6887.71 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 19:02:21 | Epoch: 1 | Step: 178530 | Dataset: 0-452621 | Loss: 0.712 | 913 ms/step , 6887.60 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 19:02:31 | Epoch: 1 | Step: 178540 | Dataset: 0-452941 | Loss: 0.683 | 915 ms/step , 6875.36 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 19:02:40 | Epoch: 1 | Step: 178550 | Dataset: 0-453261 | Loss: 0.721 | 914 ms/step , 6879.38 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-05 19:02:49 | Epoch: 1 | Step: 178560 | Dataset: 0-453581 | Loss: 0.719 | 914 ms/step , 6879.90 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 19:02:58 | Epoch: 1 | Step: 178570 | Dataset: 0-453901 | Loss: 0.642 | 913 ms/step , 6887.96 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 19:03:07 | Epoch: 1 | Step: 178580 | Dataset: 0-454221 | Loss: 0.806 | 915 ms/step , 6871.37 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 19:03:16 | Epoch: 1 | Step: 178590 | Dataset: 0-454541 | Loss: 0.657 | 912 ms/step , 6893.51 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 19:03:25 | Epoch: 1 | Step: 178600 | Dataset: 0-454861 | Loss: 0.660 | 914 ms/step , 6883.53 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 19:03:27 | Validation | Step: 178600 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:03:36 | Epoch: 1 | Step: 178610 | Dataset: 0-455181 | Loss: 0.794 | 914 ms/step , 6884.26 GFLOP/s , 15267.7 tokens/s INFO:__main__:2024-11-05 19:03:45 | Epoch: 1 | Step: 178620 | Dataset: 0-455501 | Loss: 0.678 | 912 ms/step , 6893.86 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 19:03:54 | Epoch: 1 | Step: 178630 | Dataset: 0-455821 | Loss: 0.797 | 913 ms/step , 6886.10 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 19:04:04 | Epoch: 1 | Step: 178640 | Dataset: 0-456141 | Loss: 0.805 | 914 ms/step , 6882.84 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 19:04:13 | Epoch: 1 | Step: 178650 | Dataset: 0-456461 | Loss: 0.804 | 914 ms/step , 6878.10 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 19:04:22 | Epoch: 1 | Step: 178660 | Dataset: 0-456781 | Loss: 0.699 | 914 ms/step , 6880.40 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 19:04:31 | Epoch: 1 | Step: 178670 | Dataset: 0-457101 | Loss: 0.733 | 915 ms/step , 6874.18 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 19:04:40 | Epoch: 1 | Step: 178680 | Dataset: 0-457421 | Loss: 0.859 | 913 ms/step , 6889.97 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 19:04:49 | Epoch: 1 | Step: 178690 | Dataset: 0-457741 | Loss: 0.722 | 913 ms/step , 6886.62 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 19:04:58 | Epoch: 1 | Step: 178700 | Dataset: 0-458061 | Loss: 0.717 | 912 ms/step , 6895.32 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 19:05:00 | Validation | Step: 178700 | Val_loss: 0.643 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:05:09 | Epoch: 1 | Step: 178710 | Dataset: 0-458381 | Loss: 0.792 | 912 ms/step , 6895.26 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-05 19:05:18 | Epoch: 1 | Step: 178720 | Dataset: 0-458701 | Loss: 0.767 | 912 ms/step , 6893.96 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 19:05:27 | Epoch: 1 | Step: 178730 | Dataset: 0-459021 | Loss: 0.672 | 912 ms/step , 6894.30 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 19:05:37 | Epoch: 1 | Step: 178740 | Dataset: 0-459341 | Loss: 0.662 | 913 ms/step , 6892.35 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 19:05:46 | Epoch: 1 | Step: 178750 | Dataset: 0-459661 | Loss: 0.670 | 913 ms/step , 6889.79 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 19:05:55 | Epoch: 1 | Step: 178760 | Dataset: 0-459981 | Loss: 0.646 | 912 ms/step , 6894.45 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 19:06:04 | Epoch: 1 | Step: 178770 | Dataset: 0-460301 | Loss: 0.666 | 913 ms/step , 6887.70 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 19:06:13 | Epoch: 1 | Step: 178780 | Dataset: 0-460621 | Loss: 0.774 | 914 ms/step , 6878.85 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 19:06:22 | Epoch: 1 | Step: 178790 | Dataset: 0-460941 | Loss: 0.719 | 913 ms/step , 6892.11 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 19:06:31 | Epoch: 1 | Step: 178800 | Dataset: 0-461261 | Loss: 0.666 | 913 ms/step , 6889.84 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 19:06:33 | Validation | Step: 178800 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:06:42 | Epoch: 1 | Step: 178810 | Dataset: 0-461581 | Loss: 0.803 | 913 ms/step , 6887.41 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 19:06:51 | Epoch: 1 | Step: 178820 | Dataset: 0-461901 | Loss: 0.687 | 913 ms/step , 6888.31 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 19:07:00 | Epoch: 1 | Step: 178830 | Dataset: 0-462221 | Loss: 0.820 | 915 ms/step , 6875.71 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 19:07:10 | Epoch: 1 | Step: 178840 | Dataset: 0-462541 | Loss: 0.812 | 912 ms/step , 6895.59 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 19:07:19 | Epoch: 1 | Step: 178850 | Dataset: 0-462861 | Loss: 0.792 | 913 ms/step , 6890.05 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 19:07:28 | Epoch: 1 | Step: 178860 | Dataset: 0-463181 | Loss: 0.817 | 913 ms/step , 6888.27 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 19:07:37 | Epoch: 1 | Step: 178870 | Dataset: 0-463501 | Loss: 0.764 | 914 ms/step , 6883.56 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 19:07:46 | Epoch: 1 | Step: 178880 | Dataset: 0-463821 | Loss: 0.611 | 914 ms/step , 6882.38 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 19:07:55 | Epoch: 1 | Step: 178890 | Dataset: 0-464141 | Loss: 0.686 | 915 ms/step , 6875.59 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 19:08:04 | Epoch: 1 | Step: 178900 | Dataset: 0-464461 | Loss: 0.648 | 913 ms/step , 6890.31 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 19:08:06 | Validation | Step: 178900 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:08:15 | Epoch: 1 | Step: 178910 | Dataset: 0-464781 | Loss: 0.611 | 912 ms/step , 6894.33 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 19:08:24 | Epoch: 1 | Step: 178920 | Dataset: 0-465101 | Loss: 0.603 | 914 ms/step , 6883.56 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 19:08:33 | Epoch: 1 | Step: 178930 | Dataset: 0-465421 | Loss: 0.770 | 912 ms/step , 6893.33 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 19:08:43 | Epoch: 1 | Step: 178940 | Dataset: 0-465741 | Loss: 0.689 | 913 ms/step , 6887.72 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 19:08:52 | Epoch: 1 | Step: 178950 | Dataset: 0-466061 | Loss: 0.608 | 913 ms/step , 6889.93 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 19:09:01 | Epoch: 1 | Step: 178960 | Dataset: 0-466381 | Loss: 0.640 | 913 ms/step , 6891.48 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 19:09:10 | Epoch: 1 | Step: 178970 | Dataset: 0-466701 | Loss: 0.684 | 912 ms/step , 6896.88 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 19:09:19 | Epoch: 1 | Step: 178980 | Dataset: 0-467021 | Loss: 0.895 | 913 ms/step , 6885.72 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 19:09:28 | Epoch: 1 | Step: 178990 | Dataset: 0-467341 | Loss: 0.737 | 914 ms/step , 6884.47 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 19:09:37 | Epoch: 1 | Step: 179000 | Dataset: 0-467661 | Loss: 0.730 | 914 ms/step , 6883.10 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 19:09:39 | Validation | Step: 179000 | Val_loss: 0.761 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:09:39 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_190939_step_179000.pt` INFO:__main__:2024-11-05 19:09:49 | Epoch: 1 | Step: 179010 | Dataset: 0-467981 | Loss: 0.762 | 914 ms/step , 6883.77 GFLOP/s , 13803.4 tokens/s INFO:__main__:2024-11-05 19:09:58 | Epoch: 1 | Step: 179020 | Dataset: 0-468301 | Loss: 0.705 | 913 ms/step , 6889.46 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 19:10:07 | Epoch: 1 | Step: 179030 | Dataset: 0-468621 | Loss: 0.841 | 913 ms/step , 6891.18 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 19:10:17 | Epoch: 1 | Step: 179040 | Dataset: 0-468941 | Loss: 0.700 | 914 ms/step , 6879.72 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-05 19:10:26 | Epoch: 1 | Step: 179050 | Dataset: 0-469261 | Loss: 0.572 | 912 ms/step , 6899.41 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 19:10:35 | Epoch: 1 | Step: 179060 | Dataset: 0-469581 | Loss: 0.708 | 914 ms/step , 6883.44 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 19:10:44 | Epoch: 1 | Step: 179070 | Dataset: 0-469901 | Loss: 0.753 | 914 ms/step , 6881.59 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 19:10:53 | Epoch: 1 | Step: 179080 | Dataset: 0-470221 | Loss: 0.683 | 914 ms/step , 6880.65 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 19:11:02 | Epoch: 1 | Step: 179090 | Dataset: 0-470541 | Loss: 0.755 | 914 ms/step , 6884.06 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 19:11:11 | Epoch: 1 | Step: 179100 | Dataset: 0-470861 | Loss: 0.629 | 912 ms/step , 6893.06 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 19:11:13 | Validation | Step: 179100 | Val_loss: 0.809 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:11:22 | Epoch: 1 | Step: 179110 | Dataset: 0-471181 | Loss: 0.832 | 914 ms/step , 6883.23 GFLOP/s , 15268.3 tokens/s INFO:__main__:2024-11-05 19:11:31 | Epoch: 1 | Step: 179120 | Dataset: 0-471501 | Loss: 0.770 | 914 ms/step , 6882.08 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 19:11:40 | Epoch: 1 | Step: 179130 | Dataset: 0-471821 | Loss: 0.754 | 912 ms/step , 6894.63 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 19:11:50 | Epoch: 1 | Step: 179140 | Dataset: 0-472141 | Loss: 0.743 | 913 ms/step , 6885.56 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 19:11:59 | Epoch: 1 | Step: 179150 | Dataset: 0-472461 | Loss: 0.650 | 912 ms/step , 6894.87 GFLOP/s , 17944.7 tokens/s INFO:__main__:2024-11-05 19:12:08 | Epoch: 1 | Step: 179160 | Dataset: 0-472781 | Loss: 0.770 | 914 ms/step , 6884.71 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 19:12:17 | Epoch: 1 | Step: 179170 | Dataset: 0-473101 | Loss: 0.772 | 913 ms/step , 6887.40 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 19:12:26 | Epoch: 1 | Step: 179180 | Dataset: 0-473421 | Loss: 0.633 | 913 ms/step , 6887.45 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 19:12:35 | Epoch: 1 | Step: 179190 | Dataset: 0-473741 | Loss: 0.816 | 914 ms/step , 6878.13 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 19:12:44 | Epoch: 1 | Step: 179200 | Dataset: 0-474061 | Loss: 0.729 | 914 ms/step , 6881.45 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 19:12:46 | Validation | Step: 179200 | Val_loss: 0.720 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:12:55 | Epoch: 1 | Step: 179210 | Dataset: 0-474381 | Loss: 0.746 | 915 ms/step , 6871.44 GFLOP/s , 15266.4 tokens/s INFO:__main__:2024-11-05 19:13:04 | Epoch: 1 | Step: 179220 | Dataset: 0-474701 | Loss: 0.787 | 915 ms/step , 6876.57 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 19:13:13 | Epoch: 1 | Step: 179230 | Dataset: 0-475021 | Loss: 0.635 | 914 ms/step , 6884.81 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 19:13:23 | Epoch: 1 | Step: 179240 | Dataset: 0-475341 | Loss: 0.738 | 913 ms/step , 6890.67 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 19:13:32 | Epoch: 1 | Step: 179250 | Dataset: 0-475661 | Loss: 0.687 | 914 ms/step , 6880.49 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 19:13:41 | Epoch: 1 | Step: 179260 | Dataset: 0-475981 | Loss: 0.683 | 912 ms/step , 6892.60 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 19:13:50 | Epoch: 1 | Step: 179270 | Dataset: 0-476301 | Loss: 0.631 | 914 ms/step , 6882.13 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 19:13:59 | Epoch: 1 | Step: 179280 | Dataset: 0-476621 | Loss: 0.755 | 914 ms/step , 6880.68 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 19:14:08 | Epoch: 1 | Step: 179290 | Dataset: 0-476941 | Loss: 0.653 | 913 ms/step , 6892.35 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 19:14:17 | Epoch: 1 | Step: 179300 | Dataset: 0-477261 | Loss: 0.576 | 914 ms/step , 6884.46 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 19:14:19 | Validation | Step: 179300 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:14:28 | Epoch: 1 | Step: 179310 | Dataset: 0-477581 | Loss: 0.709 | 913 ms/step , 6885.58 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 19:14:37 | Epoch: 1 | Step: 179320 | Dataset: 0-477901 | Loss: 0.831 | 912 ms/step , 6899.58 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 19:14:46 | Epoch: 1 | Step: 179330 | Dataset: 0-478221 | Loss: 0.665 | 914 ms/step , 6880.38 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 19:14:56 | Epoch: 1 | Step: 179340 | Dataset: 0-478541 | Loss: 0.667 | 914 ms/step , 6884.95 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 19:15:05 | Epoch: 1 | Step: 179350 | Dataset: 0-478861 | Loss: 0.782 | 912 ms/step , 6895.73 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 19:15:14 | Epoch: 1 | Step: 179360 | Dataset: 0-479181 | Loss: 0.738 | 912 ms/step , 6895.23 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 19:15:23 | Epoch: 1 | Step: 179370 | Dataset: 0-479501 | Loss: 0.719 | 912 ms/step , 6894.33 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 19:15:32 | Epoch: 1 | Step: 179380 | Dataset: 0-479821 | Loss: 0.683 | 914 ms/step , 6882.65 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 19:15:41 | Epoch: 1 | Step: 179390 | Dataset: 0-480141 | Loss: 0.879 | 914 ms/step , 6878.02 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 19:15:50 | Epoch: 1 | Step: 179400 | Dataset: 0-480461 | Loss: 0.671 | 913 ms/step , 6892.08 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 19:15:52 | Validation | Step: 179400 | Val_loss: 0.760 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:16:01 | Epoch: 1 | Step: 179410 | Dataset: 0-480781 | Loss: 0.728 | 913 ms/step , 6888.67 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 19:16:10 | Epoch: 1 | Step: 179420 | Dataset: 0-481101 | Loss: 0.671 | 912 ms/step , 6892.63 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 19:16:19 | Epoch: 1 | Step: 179430 | Dataset: 0-481421 | Loss: 0.785 | 913 ms/step , 6890.75 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 19:16:29 | Epoch: 1 | Step: 179440 | Dataset: 0-481741 | Loss: 0.849 | 915 ms/step , 6877.40 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 19:16:38 | Epoch: 1 | Step: 179450 | Dataset: 0-482061 | Loss: 0.798 | 914 ms/step , 6883.11 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 19:16:47 | Epoch: 1 | Step: 179460 | Dataset: 0-482381 | Loss: 0.824 | 916 ms/step , 6866.35 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 19:16:56 | Epoch: 1 | Step: 179470 | Dataset: 0-482701 | Loss: 0.690 | 915 ms/step , 6876.25 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 19:17:05 | Epoch: 1 | Step: 179480 | Dataset: 0-483021 | Loss: 0.668 | 914 ms/step , 6883.26 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 19:17:14 | Epoch: 1 | Step: 179490 | Dataset: 0-483341 | Loss: 0.776 | 913 ms/step , 6885.30 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 19:17:23 | Epoch: 1 | Step: 179500 | Dataset: 0-483661 | Loss: 0.764 | 915 ms/step , 6876.55 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 19:17:25 | Validation | Step: 179500 | Val_loss: 0.671 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:17:34 | Epoch: 1 | Step: 179510 | Dataset: 0-483981 | Loss: 0.800 | 913 ms/step , 6890.37 GFLOP/s , 15265.8 tokens/s INFO:__main__:2024-11-05 19:17:43 | Epoch: 1 | Step: 179520 | Dataset: 0-484301 | Loss: 0.636 | 913 ms/step , 6887.43 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 19:17:52 | Epoch: 1 | Step: 179530 | Dataset: 0-484621 | Loss: 0.657 | 913 ms/step , 6889.19 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 19:18:02 | Epoch: 1 | Step: 179540 | Dataset: 0-484941 | Loss: 0.810 | 914 ms/step , 6881.60 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 19:18:11 | Epoch: 1 | Step: 179550 | Dataset: 0-485261 | Loss: 0.728 | 914 ms/step , 6879.60 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 19:18:20 | Epoch: 1 | Step: 179560 | Dataset: 0-485581 | Loss: 0.814 | 913 ms/step , 6888.25 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 19:18:29 | Epoch: 1 | Step: 179570 | Dataset: 0-485901 | Loss: 0.707 | 913 ms/step , 6887.20 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 19:18:38 | Epoch: 1 | Step: 179580 | Dataset: 0-486221 | Loss: 0.647 | 912 ms/step , 6897.16 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 19:18:47 | Epoch: 1 | Step: 179590 | Dataset: 0-486541 | Loss: 0.643 | 915 ms/step , 6870.99 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-05 19:18:56 | Epoch: 1 | Step: 179600 | Dataset: 0-486861 | Loss: 0.725 | 915 ms/step , 6871.54 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 19:18:58 | Validation | Step: 179600 | Val_loss: 0.722 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:19:07 | Epoch: 1 | Step: 179610 | Dataset: 0-487181 | Loss: 0.688 | 913 ms/step , 6888.84 GFLOP/s , 15258.2 tokens/s INFO:__main__:2024-11-05 19:19:16 | Epoch: 1 | Step: 179620 | Dataset: 0-487501 | Loss: 0.712 | 914 ms/step , 6879.92 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 19:19:25 | Epoch: 1 | Step: 179630 | Dataset: 0-487821 | Loss: 0.649 | 914 ms/step , 6881.59 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 19:19:35 | Epoch: 1 | Step: 179640 | Dataset: 0-488141 | Loss: 0.752 | 913 ms/step , 6892.33 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 19:19:44 | Epoch: 1 | Step: 179650 | Dataset: 0-488461 | Loss: 0.728 | 915 ms/step , 6877.07 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 19:19:53 | Epoch: 1 | Step: 179660 | Dataset: 0-488781 | Loss: 0.723 | 914 ms/step , 6881.41 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 19:20:02 | Epoch: 1 | Step: 179670 | Dataset: 0-489101 | Loss: 0.619 | 913 ms/step , 6889.79 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 19:20:11 | Epoch: 1 | Step: 179680 | Dataset: 0-489421 | Loss: 0.704 | 913 ms/step , 6891.18 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 19:20:20 | Epoch: 1 | Step: 179690 | Dataset: 0-489741 | Loss: 0.815 | 914 ms/step , 6878.98 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 19:20:29 | Epoch: 1 | Step: 179700 | Dataset: 0-490061 | Loss: 0.760 | 913 ms/step , 6887.13 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 19:20:31 | Validation | Step: 179700 | Val_loss: 0.779 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:20:40 | Epoch: 1 | Step: 179710 | Dataset: 0-490381 | Loss: 0.757 | 914 ms/step , 6884.15 GFLOP/s , 15267.2 tokens/s INFO:__main__:2024-11-05 19:20:49 | Epoch: 1 | Step: 179720 | Dataset: 0-490701 | Loss: 0.714 | 914 ms/step , 6883.66 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 19:20:58 | Epoch: 1 | Step: 179730 | Dataset: 0-491021 | Loss: 0.748 | 913 ms/step , 6888.59 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 19:21:08 | Epoch: 1 | Step: 179740 | Dataset: 0-491341 | Loss: 0.697 | 913 ms/step , 6886.49 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 19:21:17 | Epoch: 1 | Step: 179750 | Dataset: 0-491661 | Loss: 0.629 | 914 ms/step , 6884.03 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 19:21:26 | Epoch: 1 | Step: 179760 | Dataset: 0-491981 | Loss: 0.727 | 914 ms/step , 6877.68 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 19:21:35 | Epoch: 1 | Step: 179770 | Dataset: 0-492301 | Loss: 0.745 | 913 ms/step , 6885.28 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 19:21:44 | Epoch: 1 | Step: 179780 | Dataset: 0-492621 | Loss: 0.732 | 914 ms/step , 6881.22 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 19:21:53 | Epoch: 1 | Step: 179790 | Dataset: 0-492941 | Loss: 0.780 | 913 ms/step , 6888.89 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 19:22:02 | Epoch: 1 | Step: 179800 | Dataset: 0-493261 | Loss: 0.742 | 914 ms/step , 6879.03 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-05 19:22:04 | Validation | Step: 179800 | Val_loss: 0.810 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:22:13 | Epoch: 1 | Step: 179810 | Dataset: 0-493581 | Loss: 0.720 | 914 ms/step , 6878.76 GFLOP/s , 15266.9 tokens/s INFO:__main__:2024-11-05 19:22:22 | Epoch: 1 | Step: 179820 | Dataset: 0-493901 | Loss: 0.755 | 913 ms/step , 6889.15 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 19:22:31 | Epoch: 1 | Step: 179830 | Dataset: 0-494221 | Loss: 0.734 | 913 ms/step , 6887.10 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 19:22:41 | Epoch: 1 | Step: 179840 | Dataset: 0-494541 | Loss: 0.727 | 913 ms/step , 6885.29 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 19:22:50 | Epoch: 1 | Step: 179850 | Dataset: 0-494861 | Loss: 0.694 | 914 ms/step , 6879.05 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-05 19:22:59 | Epoch: 1 | Step: 179860 | Dataset: 0-495181 | Loss: 0.733 | 914 ms/step , 6882.48 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 19:23:08 | Epoch: 1 | Step: 179870 | Dataset: 0-495501 | Loss: 0.746 | 915 ms/step , 6877.21 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-05 19:23:17 | Epoch: 1 | Step: 179880 | Dataset: 0-495821 | Loss: 0.712 | 915 ms/step , 6871.14 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 19:23:26 | Epoch: 1 | Step: 179890 | Dataset: 0-496141 | Loss: 0.719 | 914 ms/step , 6882.78 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:23:35 | Epoch: 1 | Step: 179900 | Dataset: 0-496461 | Loss: 0.762 | 914 ms/step , 6880.16 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 19:23:37 | Validation | Step: 179900 | Val_loss: 0.721 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:23:46 | Epoch: 1 | Step: 179910 | Dataset: 0-496781 | Loss: 0.722 | 913 ms/step , 6886.09 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-05 19:23:55 | Epoch: 1 | Step: 179920 | Dataset: 0-497101 | Loss: 0.796 | 914 ms/step , 6884.79 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 19:24:05 | Epoch: 1 | Step: 179930 | Dataset: 0-497421 | Loss: 0.675 | 914 ms/step , 6880.21 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 19:24:14 | Epoch: 1 | Step: 179940 | Dataset: 0-497741 | Loss: 0.685 | 914 ms/step , 6884.46 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 19:24:23 | Epoch: 1 | Step: 179950 | Dataset: 0-498061 | Loss: 0.706 | 913 ms/step , 6889.12 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 19:24:32 | Epoch: 1 | Step: 179960 | Dataset: 0-498381 | Loss: 0.712 | 913 ms/step , 6889.79 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 19:24:41 | Epoch: 1 | Step: 179970 | Dataset: 0-498701 | Loss: 0.741 | 914 ms/step , 6881.44 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 19:24:50 | Epoch: 1 | Step: 179980 | Dataset: 0-499021 | Loss: 0.690 | 913 ms/step , 6887.14 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 19:24:59 | Epoch: 1 | Step: 179990 | Dataset: 0-499341 | Loss: 0.718 | 914 ms/step , 6880.43 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 19:25:09 | Epoch: 1 | Step: 180000 | Dataset: 0-499661 | Loss: 0.721 | 914 ms/step , 6882.38 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 19:25:10 | Validation | Step: 180000 | Val_loss: 0.792 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:25:10 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_192510_step_180000.pt` INFO:__main__:2024-11-05 19:25:20 | Epoch: 1 | Step: 180010 | Dataset: 0-499981 | Loss: 0.775 | 915 ms/step , 6876.84 GFLOP/s , 13795.4 tokens/s INFO:__main__:2024-11-05 19:25:30 | Epoch: 1 | Step: 180020 | Dataset: 0-500301 | Loss: 0.699 | 914 ms/step , 6884.38 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 19:25:39 | Epoch: 1 | Step: 180030 | Dataset: 0-500621 | Loss: 0.776 | 915 ms/step , 6872.60 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 19:25:48 | Epoch: 1 | Step: 180040 | Dataset: 0-500941 | Loss: 0.745 | 917 ms/step , 6860.44 GFLOP/s , 17891.8 tokens/s INFO:__main__:2024-11-05 19:25:57 | Epoch: 1 | Step: 180050 | Dataset: 0-501261 | Loss: 0.683 | 914 ms/step , 6882.91 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:26:06 | Epoch: 1 | Step: 180060 | Dataset: 0-501581 | Loss: 0.863 | 915 ms/step , 6876.57 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 19:26:15 | Epoch: 1 | Step: 180070 | Dataset: 0-501901 | Loss: 0.652 | 913 ms/step , 6890.08 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 19:26:24 | Epoch: 1 | Step: 180080 | Dataset: 0-502221 | Loss: 0.693 | 915 ms/step , 6870.42 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-05 19:26:34 | Epoch: 1 | Step: 180090 | Dataset: 0-502541 | Loss: 0.748 | 914 ms/step , 6881.21 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-05 19:26:43 | Epoch: 1 | Step: 180100 | Dataset: 0-502861 | Loss: 0.717 | 913 ms/step , 6890.33 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 19:26:44 | Validation | Step: 180100 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:26:53 | Epoch: 1 | Step: 180110 | Dataset: 0-503181 | Loss: 0.670 | 914 ms/step , 6878.71 GFLOP/s , 15258.1 tokens/s INFO:__main__:2024-11-05 19:27:03 | Epoch: 1 | Step: 180120 | Dataset: 0-503501 | Loss: 0.748 | 914 ms/step , 6881.62 GFLOP/s , 17905.5 tokens/s INFO:__main__:2024-11-05 19:27:12 | Epoch: 1 | Step: 180130 | Dataset: 0-503821 | Loss: 0.759 | 914 ms/step , 6882.97 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 19:27:21 | Epoch: 1 | Step: 180140 | Dataset: 0-504141 | Loss: 0.763 | 914 ms/step , 6881.74 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 19:27:30 | Epoch: 1 | Step: 180150 | Dataset: 0-504461 | Loss: 0.750 | 914 ms/step , 6880.58 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 19:27:39 | Epoch: 1 | Step: 180160 | Dataset: 0-504781 | Loss: 0.718 | 914 ms/step , 6879.92 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 19:27:48 | Epoch: 1 | Step: 180170 | Dataset: 0-505101 | Loss: 0.830 | 915 ms/step , 6874.19 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 19:27:57 | Epoch: 1 | Step: 180180 | Dataset: 0-505421 | Loss: 0.696 | 914 ms/step , 6884.55 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 19:28:07 | Epoch: 1 | Step: 180190 | Dataset: 0-505741 | Loss: 0.750 | 914 ms/step , 6880.10 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 19:28:16 | Epoch: 1 | Step: 180200 | Dataset: 0-506061 | Loss: 0.720 | 913 ms/step , 6886.09 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 19:28:17 | Validation | Step: 180200 | Val_loss: 0.731 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:28:26 | Epoch: 1 | Step: 180210 | Dataset: 0-506381 | Loss: 0.691 | 915 ms/step , 6876.57 GFLOP/s , 15255.1 tokens/s INFO:__main__:2024-11-05 19:28:36 | Epoch: 1 | Step: 180220 | Dataset: 0-506701 | Loss: 0.706 | 913 ms/step , 6889.29 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 19:28:45 | Epoch: 1 | Step: 180230 | Dataset: 0-507021 | Loss: 0.656 | 913 ms/step , 6885.97 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 19:28:54 | Epoch: 1 | Step: 180240 | Dataset: 0-507341 | Loss: 0.745 | 915 ms/step , 6876.95 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 19:29:03 | Epoch: 1 | Step: 180250 | Dataset: 0-507661 | Loss: 0.669 | 915 ms/step , 6877.10 GFLOP/s , 17900.3 tokens/s INFO:__main__:2024-11-05 19:29:12 | Epoch: 1 | Step: 180260 | Dataset: 0-507981 | Loss: 0.764 | 914 ms/step , 6879.46 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 19:29:21 | Epoch: 1 | Step: 180270 | Dataset: 0-508301 | Loss: 0.718 | 914 ms/step , 6884.57 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-05 19:29:31 | Epoch: 1 | Step: 180280 | Dataset: 0-508621 | Loss: 0.724 | 915 ms/step , 6876.45 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 19:29:40 | Epoch: 1 | Step: 180290 | Dataset: 0-508941 | Loss: 0.682 | 912 ms/step , 6894.46 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 19:29:49 | Epoch: 1 | Step: 180300 | Dataset: 0-509261 | Loss: 0.653 | 913 ms/step , 6888.59 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 19:29:50 | Validation | Step: 180300 | Val_loss: 0.772 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:30:00 | Epoch: 1 | Step: 180310 | Dataset: 0-509581 | Loss: 0.673 | 914 ms/step , 6880.09 GFLOP/s , 15257.9 tokens/s INFO:__main__:2024-11-05 19:30:09 | Epoch: 1 | Step: 180320 | Dataset: 0-509901 | Loss: 0.681 | 914 ms/step , 6878.73 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 19:30:18 | Epoch: 1 | Step: 180330 | Dataset: 0-510221 | Loss: 0.774 | 916 ms/step , 6867.82 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:30:27 | Epoch: 1 | Step: 180340 | Dataset: 0-510541 | Loss: 0.620 | 915 ms/step , 6876.84 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 19:30:36 | Epoch: 1 | Step: 180350 | Dataset: 0-510861 | Loss: 0.693 | 915 ms/step , 6873.84 GFLOP/s , 17907.1 tokens/s INFO:__main__:2024-11-05 19:30:45 | Epoch: 1 | Step: 180360 | Dataset: 0-511181 | Loss: 0.825 | 914 ms/step , 6879.69 GFLOP/s , 17908.8 tokens/s INFO:__main__:2024-11-05 19:30:54 | Epoch: 1 | Step: 180370 | Dataset: 0-511501 | Loss: 0.753 | 914 ms/step , 6879.30 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 19:31:04 | Epoch: 1 | Step: 180380 | Dataset: 0-511821 | Loss: 0.739 | 914 ms/step , 6882.50 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 19:31:13 | Epoch: 1 | Step: 180390 | Dataset: 0-512141 | Loss: 0.776 | 913 ms/step , 6888.41 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 19:31:22 | Epoch: 1 | Step: 180400 | Dataset: 0-512461 | Loss: 0.749 | 916 ms/step , 6867.90 GFLOP/s , 17902.0 tokens/s INFO:__main__:2024-11-05 19:31:23 | Validation | Step: 180400 | Val_loss: 0.725 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:31:33 | Epoch: 1 | Step: 180410 | Dataset: 0-512781 | Loss: 0.678 | 913 ms/step , 6885.76 GFLOP/s , 15261.3 tokens/s INFO:__main__:2024-11-05 19:31:42 | Epoch: 1 | Step: 180420 | Dataset: 0-513101 | Loss: 0.777 | 913 ms/step , 6890.66 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-05 19:31:51 | Epoch: 1 | Step: 180430 | Dataset: 0-513421 | Loss: 0.696 | 915 ms/step , 6872.59 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-05 19:32:00 | Epoch: 1 | Step: 180440 | Dataset: 0-513741 | Loss: 0.682 | 914 ms/step , 6883.60 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:32:09 | Epoch: 1 | Step: 180450 | Dataset: 0-514061 | Loss: 0.665 | 914 ms/step , 6879.86 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 19:32:18 | Epoch: 1 | Step: 180460 | Dataset: 0-514381 | Loss: 0.688 | 913 ms/step , 6888.06 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 19:32:27 | Epoch: 1 | Step: 180470 | Dataset: 0-514701 | Loss: 0.688 | 915 ms/step , 6870.52 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 19:32:37 | Epoch: 1 | Step: 180480 | Dataset: 0-515021 | Loss: 0.723 | 914 ms/step , 6879.36 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 19:32:46 | Epoch: 1 | Step: 180490 | Dataset: 0-515341 | Loss: 0.708 | 914 ms/step , 6883.44 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 19:32:55 | Epoch: 1 | Step: 180500 | Dataset: 0-515661 | Loss: 0.709 | 914 ms/step , 6879.85 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 19:32:56 | Validation | Step: 180500 | Val_loss: 0.742 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:33:06 | Epoch: 1 | Step: 180510 | Dataset: 0-515981 | Loss: 0.696 | 914 ms/step , 6884.50 GFLOP/s , 15261.4 tokens/s INFO:__main__:2024-11-05 19:33:15 | Epoch: 1 | Step: 180520 | Dataset: 0-516301 | Loss: 0.705 | 914 ms/step , 6883.47 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 19:33:24 | Epoch: 1 | Step: 180530 | Dataset: 0-516621 | Loss: 0.824 | 914 ms/step , 6882.35 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 19:33:33 | Epoch: 1 | Step: 180540 | Dataset: 0-516941 | Loss: 0.790 | 916 ms/step , 6865.75 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 19:33:42 | Epoch: 1 | Step: 180550 | Dataset: 0-517261 | Loss: 0.706 | 915 ms/step , 6872.92 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 19:33:51 | Epoch: 1 | Step: 180560 | Dataset: 0-517581 | Loss: 0.685 | 912 ms/step , 6893.16 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 19:34:00 | Epoch: 1 | Step: 180570 | Dataset: 0-517901 | Loss: 0.706 | 915 ms/step , 6874.99 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 19:34:10 | Epoch: 1 | Step: 180580 | Dataset: 0-518221 | Loss: 0.772 | 913 ms/step , 6885.58 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 19:34:19 | Epoch: 1 | Step: 180590 | Dataset: 0-518541 | Loss: 0.740 | 915 ms/step , 6873.69 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 19:34:28 | Epoch: 1 | Step: 180600 | Dataset: 0-518861 | Loss: 0.718 | 912 ms/step , 6897.99 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 19:34:30 | Validation | Step: 180600 | Val_loss: 0.705 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:34:39 | Epoch: 1 | Step: 180610 | Dataset: 0-519181 | Loss: 0.727 | 913 ms/step , 6885.76 GFLOP/s , 15261.6 tokens/s INFO:__main__:2024-11-05 19:34:48 | Epoch: 1 | Step: 180620 | Dataset: 0-519501 | Loss: 0.712 | 915 ms/step , 6872.53 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 19:34:57 | Epoch: 1 | Step: 180630 | Dataset: 0-519821 | Loss: 0.659 | 913 ms/step , 6890.08 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 19:35:06 | Epoch: 1 | Step: 180640 | Dataset: 0-520141 | Loss: 0.771 | 913 ms/step , 6885.53 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 19:35:15 | Epoch: 1 | Step: 180650 | Dataset: 0-520461 | Loss: 0.738 | 914 ms/step , 6880.48 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 19:35:24 | Epoch: 1 | Step: 180660 | Dataset: 0-520781 | Loss: 0.658 | 914 ms/step , 6880.59 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 19:35:34 | Epoch: 1 | Step: 180670 | Dataset: 0-521101 | Loss: 0.699 | 913 ms/step , 6888.85 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 19:35:43 | Epoch: 1 | Step: 180680 | Dataset: 0-521421 | Loss: 0.686 | 914 ms/step , 6881.43 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 19:35:52 | Epoch: 1 | Step: 180690 | Dataset: 0-521741 | Loss: 0.802 | 913 ms/step , 6886.84 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 19:36:01 | Epoch: 1 | Step: 180700 | Dataset: 0-522061 | Loss: 0.701 | 915 ms/step , 6872.58 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 19:36:03 | Validation | Step: 180700 | Val_loss: 0.770 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:36:12 | Epoch: 1 | Step: 180710 | Dataset: 0-522381 | Loss: 0.690 | 913 ms/step , 6889.29 GFLOP/s , 15243.5 tokens/s INFO:__main__:2024-11-05 19:36:21 | Epoch: 1 | Step: 180720 | Dataset: 0-522701 | Loss: 0.721 | 913 ms/step , 6887.41 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 19:36:30 | Epoch: 1 | Step: 180730 | Dataset: 0-523021 | Loss: 0.761 | 914 ms/step , 6878.60 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 19:36:39 | Epoch: 1 | Step: 180740 | Dataset: 0-523341 | Loss: 0.670 | 913 ms/step , 6889.66 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 19:36:48 | Epoch: 1 | Step: 180750 | Dataset: 0-523661 | Loss: 0.769 | 914 ms/step , 6880.84 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 19:36:57 | Epoch: 1 | Step: 180760 | Dataset: 0-523981 | Loss: 0.709 | 913 ms/step , 6886.00 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 19:37:07 | Epoch: 1 | Step: 180770 | Dataset: 0-524301 | Loss: 0.667 | 915 ms/step , 6876.91 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 19:37:16 | Epoch: 1 | Step: 180780 | Dataset: 0-524621 | Loss: 0.666 | 912 ms/step , 6893.50 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 19:37:25 | Epoch: 1 | Step: 180790 | Dataset: 0-524941 | Loss: 0.794 | 914 ms/step , 6881.93 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 19:37:34 | Epoch: 1 | Step: 180800 | Dataset: 0-525261 | Loss: 0.757 | 915 ms/step , 6877.48 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 19:37:36 | Validation | Step: 180800 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:37:45 | Epoch: 1 | Step: 180810 | Dataset: 0-525581 | Loss: 0.591 | 913 ms/step , 6886.32 GFLOP/s , 15261.9 tokens/s INFO:__main__:2024-11-05 19:37:54 | Epoch: 1 | Step: 180820 | Dataset: 0-525901 | Loss: 0.658 | 914 ms/step , 6881.70 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 19:38:03 | Epoch: 1 | Step: 180830 | Dataset: 0-526221 | Loss: 0.626 | 914 ms/step , 6880.42 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 19:38:12 | Epoch: 1 | Step: 180840 | Dataset: 0-526541 | Loss: 0.667 | 914 ms/step , 6883.68 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 19:38:21 | Epoch: 1 | Step: 180850 | Dataset: 0-526861 | Loss: 0.677 | 913 ms/step , 6890.30 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 19:38:30 | Epoch: 1 | Step: 180860 | Dataset: 0-527181 | Loss: 0.691 | 915 ms/step , 6875.04 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 19:38:40 | Epoch: 1 | Step: 180870 | Dataset: 0-527501 | Loss: 0.724 | 915 ms/step , 6875.82 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 19:38:49 | Epoch: 1 | Step: 180880 | Dataset: 0-527821 | Loss: 0.737 | 915 ms/step , 6876.82 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 19:38:58 | Epoch: 1 | Step: 180890 | Dataset: 0-528141 | Loss: 0.720 | 913 ms/step , 6886.03 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 19:39:07 | Epoch: 1 | Step: 180900 | Dataset: 0-528461 | Loss: 0.693 | 913 ms/step , 6887.48 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 19:39:09 | Validation | Step: 180900 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:39:18 | Epoch: 1 | Step: 180910 | Dataset: 0-528781 | Loss: 0.745 | 912 ms/step , 6894.21 GFLOP/s , 15263.7 tokens/s INFO:__main__:2024-11-05 19:39:27 | Epoch: 1 | Step: 180920 | Dataset: 0-529101 | Loss: 0.761 | 915 ms/step , 6877.08 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 19:39:36 | Epoch: 1 | Step: 180930 | Dataset: 0-529421 | Loss: 0.662 | 913 ms/step , 6888.39 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 19:39:45 | Epoch: 1 | Step: 180940 | Dataset: 0-529741 | Loss: 0.645 | 912 ms/step , 6895.11 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 19:39:54 | Epoch: 1 | Step: 180950 | Dataset: 0-530061 | Loss: 0.670 | 912 ms/step , 6894.26 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 19:40:03 | Epoch: 1 | Step: 180960 | Dataset: 0-530381 | Loss: 0.660 | 913 ms/step , 6889.13 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 19:40:13 | Epoch: 1 | Step: 180970 | Dataset: 0-530701 | Loss: 0.771 | 916 ms/step , 6868.62 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-05 19:40:22 | Epoch: 1 | Step: 180980 | Dataset: 0-531021 | Loss: 0.738 | 914 ms/step , 6881.21 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 19:40:31 | Epoch: 1 | Step: 180990 | Dataset: 0-531341 | Loss: 0.614 | 913 ms/step , 6886.12 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-05 19:40:40 | Epoch: 1 | Step: 181000 | Dataset: 0-531661 | Loss: 0.760 | 914 ms/step , 6882.38 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 19:40:42 | Validation | Step: 181000 | Val_loss: 0.724 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:40:42 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_194042_step_181000.pt` INFO:__main__:2024-11-05 19:40:52 | Epoch: 1 | Step: 181010 | Dataset: 0-531981 | Loss: 0.641 | 913 ms/step , 6892.59 GFLOP/s , 13802.5 tokens/s INFO:__main__:2024-11-05 19:41:01 | Epoch: 1 | Step: 181020 | Dataset: 0-532301 | Loss: 0.563 | 914 ms/step , 6878.60 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-05 19:41:10 | Epoch: 1 | Step: 181030 | Dataset: 0-532621 | Loss: 0.693 | 913 ms/step , 6889.62 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 19:41:19 | Epoch: 1 | Step: 181040 | Dataset: 0-532941 | Loss: 0.632 | 913 ms/step , 6888.84 GFLOP/s , 17900.3 tokens/s INFO:__main__:2024-11-05 19:41:28 | Epoch: 1 | Step: 181050 | Dataset: 0-533261 | Loss: 0.759 | 917 ms/step , 6862.28 GFLOP/s , 17899.9 tokens/s INFO:__main__:2024-11-05 19:41:38 | Epoch: 1 | Step: 181060 | Dataset: 0-533581 | Loss: 0.681 | 915 ms/step , 6877.04 GFLOP/s , 17896.7 tokens/s INFO:__main__:2024-11-05 19:41:47 | Epoch: 1 | Step: 181070 | Dataset: 0-533901 | Loss: 0.689 | 916 ms/step , 6862.93 GFLOP/s , 17904.1 tokens/s INFO:__main__:2024-11-05 19:41:56 | Epoch: 1 | Step: 181080 | Dataset: 0-534221 | Loss: 0.761 | 914 ms/step , 6883.90 GFLOP/s , 17901.9 tokens/s INFO:__main__:2024-11-05 19:42:05 | Epoch: 1 | Step: 181090 | Dataset: 0-534541 | Loss: 0.830 | 916 ms/step , 6869.15 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-05 19:42:14 | Epoch: 1 | Step: 181100 | Dataset: 0-534861 | Loss: 0.647 | 914 ms/step , 6880.32 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 19:42:16 | Validation | Step: 181100 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:42:25 | Epoch: 1 | Step: 181110 | Dataset: 0-535181 | Loss: 0.680 | 912 ms/step , 6893.22 GFLOP/s , 15262.9 tokens/s INFO:__main__:2024-11-05 19:42:34 | Epoch: 1 | Step: 181120 | Dataset: 0-535501 | Loss: 0.656 | 915 ms/step , 6876.73 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-05 19:42:43 | Epoch: 1 | Step: 181130 | Dataset: 0-535821 | Loss: 0.736 | 914 ms/step , 6880.68 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 19:42:52 | Epoch: 1 | Step: 181140 | Dataset: 0-536141 | Loss: 0.695 | 914 ms/step , 6884.97 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 19:43:02 | Epoch: 1 | Step: 181150 | Dataset: 0-536461 | Loss: 0.756 | 914 ms/step , 6881.20 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 19:43:11 | Epoch: 1 | Step: 181160 | Dataset: 0-536781 | Loss: 0.709 | 913 ms/step , 6886.02 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 19:43:20 | Epoch: 1 | Step: 181170 | Dataset: 0-537101 | Loss: 0.807 | 914 ms/step , 6883.33 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 19:43:29 | Epoch: 1 | Step: 181180 | Dataset: 0-537421 | Loss: 0.730 | 912 ms/step , 6892.95 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 19:43:38 | Epoch: 1 | Step: 181190 | Dataset: 0-537741 | Loss: 0.773 | 914 ms/step , 6884.67 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-05 19:43:47 | Epoch: 1 | Step: 181200 | Dataset: 0-538061 | Loss: 0.721 | 914 ms/step , 6878.13 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 19:43:49 | Validation | Step: 181200 | Val_loss: 0.769 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:43:58 | Epoch: 1 | Step: 181210 | Dataset: 0-538381 | Loss: 0.826 | 914 ms/step , 6879.00 GFLOP/s , 15280.8 tokens/s INFO:__main__:2024-11-05 19:44:07 | Epoch: 1 | Step: 181220 | Dataset: 0-538701 | Loss: 0.679 | 915 ms/step , 6877.37 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-05 19:44:16 | Epoch: 1 | Step: 181230 | Dataset: 0-539021 | Loss: 0.659 | 914 ms/step , 6877.98 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 19:44:25 | Epoch: 1 | Step: 181240 | Dataset: 0-539341 | Loss: 0.717 | 914 ms/step , 6882.72 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 19:44:35 | Epoch: 1 | Step: 181250 | Dataset: 0-539661 | Loss: 0.687 | 913 ms/step , 6886.70 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 19:44:44 | Epoch: 1 | Step: 181260 | Dataset: 0-539981 | Loss: 0.716 | 914 ms/step , 6880.04 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 19:44:53 | Epoch: 1 | Step: 181270 | Dataset: 0-540301 | Loss: 0.749 | 914 ms/step , 6881.02 GFLOP/s , 17901.3 tokens/s INFO:__main__:2024-11-05 19:45:02 | Epoch: 1 | Step: 181280 | Dataset: 0-540621 | Loss: 0.809 | 916 ms/step , 6868.60 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-05 19:45:11 | Epoch: 1 | Step: 181290 | Dataset: 0-540941 | Loss: 0.847 | 913 ms/step , 6886.64 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 19:45:20 | Epoch: 1 | Step: 181300 | Dataset: 0-541261 | Loss: 0.766 | 915 ms/step , 6876.67 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 19:45:22 | Validation | Step: 181300 | Val_loss: 0.740 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:45:31 | Epoch: 1 | Step: 181310 | Dataset: 0-541581 | Loss: 0.703 | 913 ms/step , 6887.78 GFLOP/s , 15284.4 tokens/s INFO:__main__:2024-11-05 19:45:40 | Epoch: 1 | Step: 181320 | Dataset: 0-541901 | Loss: 0.671 | 915 ms/step , 6871.09 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 19:45:49 | Epoch: 1 | Step: 181330 | Dataset: 0-542221 | Loss: 0.768 | 915 ms/step , 6876.67 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 19:45:58 | Epoch: 1 | Step: 181340 | Dataset: 0-542541 | Loss: 0.748 | 914 ms/step , 6880.86 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 19:46:08 | Epoch: 1 | Step: 181350 | Dataset: 0-542861 | Loss: 0.610 | 914 ms/step , 6881.12 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:46:17 | Epoch: 1 | Step: 181360 | Dataset: 0-543181 | Loss: 0.730 | 914 ms/step , 6884.22 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 19:46:26 | Epoch: 1 | Step: 181370 | Dataset: 0-543501 | Loss: 0.667 | 913 ms/step , 6887.60 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 19:46:35 | Epoch: 1 | Step: 181380 | Dataset: 0-543821 | Loss: 0.702 | 915 ms/step , 6876.92 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 19:46:44 | Epoch: 1 | Step: 181390 | Dataset: 0-544141 | Loss: 0.654 | 913 ms/step , 6890.74 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 19:46:53 | Epoch: 1 | Step: 181400 | Dataset: 0-544461 | Loss: 0.604 | 912 ms/step , 6892.63 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 19:46:55 | Validation | Step: 181400 | Val_loss: 0.808 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:47:04 | Epoch: 1 | Step: 181410 | Dataset: 0-544781 | Loss: 0.640 | 914 ms/step , 6878.94 GFLOP/s , 15257.1 tokens/s INFO:__main__:2024-11-05 19:47:13 | Epoch: 1 | Step: 181420 | Dataset: 0-545101 | Loss: 0.701 | 913 ms/step , 6886.93 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 19:47:22 | Epoch: 1 | Step: 181430 | Dataset: 0-545421 | Loss: 0.809 | 915 ms/step , 6871.88 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 19:47:31 | Epoch: 1 | Step: 181440 | Dataset: 0-545741 | Loss: 0.825 | 914 ms/step , 6879.20 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 19:47:41 | Epoch: 1 | Step: 181450 | Dataset: 0-546061 | Loss: 0.722 | 914 ms/step , 6881.92 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 19:47:50 | Epoch: 1 | Step: 181460 | Dataset: 0-546381 | Loss: 0.803 | 915 ms/step , 6876.36 GFLOP/s , 17908.8 tokens/s INFO:__main__:2024-11-05 19:47:59 | Epoch: 1 | Step: 181470 | Dataset: 0-546701 | Loss: 0.642 | 915 ms/step , 6871.11 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 19:48:08 | Epoch: 1 | Step: 181480 | Dataset: 0-547021 | Loss: 0.780 | 915 ms/step , 6876.68 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 19:48:17 | Epoch: 1 | Step: 181490 | Dataset: 0-547341 | Loss: 0.692 | 913 ms/step , 6889.77 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 19:48:26 | Epoch: 1 | Step: 181500 | Dataset: 0-547661 | Loss: 0.761 | 915 ms/step , 6872.77 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 19:48:28 | Validation | Step: 181500 | Val_loss: 0.789 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:48:37 | Epoch: 1 | Step: 181510 | Dataset: 0-547981 | Loss: 0.670 | 913 ms/step , 6886.97 GFLOP/s , 15253.8 tokens/s INFO:__main__:2024-11-05 19:48:46 | Epoch: 1 | Step: 181520 | Dataset: 0-548301 | Loss: 0.721 | 914 ms/step , 6881.07 GFLOP/s , 17902.5 tokens/s INFO:__main__:2024-11-05 19:48:55 | Epoch: 1 | Step: 181530 | Dataset: 0-548621 | Loss: 0.680 | 914 ms/step , 6883.88 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 19:49:05 | Epoch: 1 | Step: 181540 | Dataset: 0-548941 | Loss: 0.628 | 914 ms/step , 6879.91 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 19:49:14 | Epoch: 1 | Step: 181550 | Dataset: 0-549261 | Loss: 0.683 | 913 ms/step , 6891.96 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 19:49:23 | Epoch: 1 | Step: 181560 | Dataset: 0-549581 | Loss: 0.734 | 913 ms/step , 6886.76 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 19:49:32 | Epoch: 1 | Step: 181570 | Dataset: 0-549901 | Loss: 0.711 | 913 ms/step , 6888.94 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 19:49:41 | Epoch: 1 | Step: 181580 | Dataset: 0-550221 | Loss: 0.716 | 913 ms/step , 6889.00 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 19:49:50 | Epoch: 1 | Step: 181590 | Dataset: 0-550541 | Loss: 0.754 | 915 ms/step , 6875.19 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 19:49:59 | Epoch: 1 | Step: 181600 | Dataset: 0-550861 | Loss: 0.714 | 915 ms/step , 6875.35 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 19:50:01 | Validation | Step: 181600 | Val_loss: 0.773 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:50:10 | Epoch: 1 | Step: 181610 | Dataset: 0-551181 | Loss: 0.676 | 915 ms/step , 6876.97 GFLOP/s , 15263.2 tokens/s INFO:__main__:2024-11-05 19:50:19 | Epoch: 1 | Step: 181620 | Dataset: 0-551501 | Loss: 0.600 | 913 ms/step , 6885.51 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 19:50:28 | Epoch: 1 | Step: 181630 | Dataset: 0-551821 | Loss: 0.703 | 913 ms/step , 6885.85 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 19:50:38 | Epoch: 1 | Step: 181640 | Dataset: 0-552141 | Loss: 0.759 | 914 ms/step , 6884.24 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 19:50:47 | Epoch: 1 | Step: 181650 | Dataset: 0-552461 | Loss: 0.743 | 913 ms/step , 6890.60 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-05 19:50:56 | Epoch: 1 | Step: 181660 | Dataset: 0-552781 | Loss: 0.680 | 914 ms/step , 6880.73 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 19:51:05 | Epoch: 1 | Step: 181670 | Dataset: 0-553101 | Loss: 0.724 | 914 ms/step , 6881.72 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 19:51:14 | Epoch: 1 | Step: 181680 | Dataset: 0-553421 | Loss: 0.820 | 917 ms/step , 6861.56 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 19:51:23 | Epoch: 1 | Step: 181690 | Dataset: 0-553741 | Loss: 0.701 | 916 ms/step , 6868.67 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-05 19:51:32 | Epoch: 1 | Step: 181700 | Dataset: 0-554061 | Loss: 0.754 | 914 ms/step , 6883.76 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-05 19:51:34 | Validation | Step: 181700 | Val_loss: 0.715 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:51:43 | Epoch: 1 | Step: 181710 | Dataset: 0-554381 | Loss: 0.715 | 914 ms/step , 6880.38 GFLOP/s , 15257.9 tokens/s INFO:__main__:2024-11-05 19:51:52 | Epoch: 1 | Step: 181720 | Dataset: 0-554701 | Loss: 0.750 | 917 ms/step , 6858.20 GFLOP/s , 17903.3 tokens/s INFO:__main__:2024-11-05 19:52:01 | Epoch: 1 | Step: 181730 | Dataset: 0-555021 | Loss: 0.637 | 914 ms/step , 6884.82 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-05 19:52:11 | Epoch: 1 | Step: 181740 | Dataset: 0-555341 | Loss: 0.688 | 914 ms/step , 6878.20 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-05 19:52:20 | Epoch: 1 | Step: 181750 | Dataset: 0-555661 | Loss: 0.714 | 916 ms/step , 6863.32 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-05 19:52:29 | Epoch: 1 | Step: 181760 | Dataset: 0-555981 | Loss: 0.766 | 914 ms/step , 6880.09 GFLOP/s , 17907.7 tokens/s INFO:__main__:2024-11-05 19:52:38 | Epoch: 1 | Step: 181770 | Dataset: 0-556301 | Loss: 0.840 | 914 ms/step , 6883.47 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-05 19:52:47 | Epoch: 1 | Step: 181780 | Dataset: 0-556621 | Loss: 0.638 | 914 ms/step , 6882.93 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-05 19:52:56 | Epoch: 1 | Step: 181790 | Dataset: 0-556941 | Loss: 0.711 | 914 ms/step , 6882.67 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-05 19:53:05 | Epoch: 1 | Step: 181800 | Dataset: 0-557261 | Loss: 0.623 | 914 ms/step , 6883.96 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 19:53:07 | Validation | Step: 181800 | Val_loss: 0.468 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:53:16 | Epoch: 1 | Step: 181810 | Dataset: 0-557581 | Loss: 0.672 | 914 ms/step , 6880.66 GFLOP/s , 15275.1 tokens/s INFO:__main__:2024-11-05 19:53:25 | Epoch: 1 | Step: 181820 | Dataset: 0-557901 | Loss: 0.707 | 913 ms/step , 6886.35 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 19:53:34 | Epoch: 1 | Step: 181830 | Dataset: 0-558221 | Loss: 0.614 | 914 ms/step , 6879.90 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 19:53:44 | Epoch: 1 | Step: 181840 | Dataset: 0-558541 | Loss: 0.723 | 913 ms/step , 6888.07 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 19:53:53 | Epoch: 1 | Step: 181850 | Dataset: 0-558861 | Loss: 0.696 | 915 ms/step , 6872.43 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 19:54:02 | Epoch: 1 | Step: 181860 | Dataset: 0-559181 | Loss: 0.677 | 913 ms/step , 6887.46 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 19:54:11 | Epoch: 1 | Step: 181870 | Dataset: 0-559501 | Loss: 0.719 | 914 ms/step , 6881.48 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 19:54:20 | Epoch: 1 | Step: 181880 | Dataset: 0-559821 | Loss: 0.713 | 914 ms/step , 6878.54 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 19:54:29 | Epoch: 1 | Step: 181890 | Dataset: 0-560141 | Loss: 0.719 | 915 ms/step , 6871.30 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-05 19:54:38 | Epoch: 1 | Step: 181900 | Dataset: 0-560461 | Loss: 0.699 | 914 ms/step , 6880.01 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 19:54:40 | Validation | Step: 181900 | Val_loss: 0.425 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:54:49 | Epoch: 1 | Step: 181910 | Dataset: 0-560781 | Loss: 0.640 | 914 ms/step , 6878.78 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-05 19:54:58 | Epoch: 1 | Step: 181920 | Dataset: 0-561101 | Loss: 0.644 | 913 ms/step , 6888.39 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 19:55:08 | Epoch: 1 | Step: 181930 | Dataset: 0-561421 | Loss: 0.688 | 913 ms/step , 6886.23 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 19:55:17 | Epoch: 1 | Step: 181940 | Dataset: 0-561741 | Loss: 0.829 | 913 ms/step , 6886.05 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-05 19:55:26 | Epoch: 1 | Step: 181950 | Dataset: 0-562061 | Loss: 0.693 | 914 ms/step , 6883.89 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-05 19:55:35 | Epoch: 1 | Step: 181960 | Dataset: 0-562381 | Loss: 0.690 | 912 ms/step , 6898.62 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 19:55:44 | Epoch: 1 | Step: 181970 | Dataset: 0-562701 | Loss: 0.785 | 914 ms/step , 6882.16 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 19:55:53 | Epoch: 1 | Step: 181980 | Dataset: 0-563021 | Loss: 0.738 | 914 ms/step , 6880.27 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 19:56:02 | Epoch: 1 | Step: 181990 | Dataset: 0-563341 | Loss: 0.710 | 913 ms/step , 6885.17 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-05 19:56:12 | Epoch: 1 | Step: 182000 | Dataset: 0-563661 | Loss: 0.660 | 913 ms/step , 6888.32 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 19:56:13 | Validation | Step: 182000 | Val_loss: 0.408 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:56:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_195613_step_182000.pt` INFO:__main__:2024-11-05 19:56:23 | Epoch: 1 | Step: 182010 | Dataset: 0-563981 | Loss: 0.688 | 914 ms/step , 6882.91 GFLOP/s , 13801.3 tokens/s INFO:__main__:2024-11-05 19:56:33 | Epoch: 1 | Step: 182020 | Dataset: 0-564301 | Loss: 0.685 | 915 ms/step , 6873.35 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 19:56:42 | Epoch: 1 | Step: 182030 | Dataset: 0-564621 | Loss: 0.762 | 915 ms/step , 6870.44 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-05 19:56:51 | Epoch: 1 | Step: 182040 | Dataset: 0-564941 | Loss: 0.666 | 914 ms/step , 6877.80 GFLOP/s , 17910.0 tokens/s INFO:__main__:2024-11-05 19:57:00 | Epoch: 1 | Step: 182050 | Dataset: 0-565261 | Loss: 0.657 | 915 ms/step , 6877.22 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 19:57:09 | Epoch: 1 | Step: 182060 | Dataset: 0-565581 | Loss: 0.731 | 913 ms/step , 6885.80 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 19:57:18 | Epoch: 1 | Step: 182070 | Dataset: 0-565901 | Loss: 0.728 | 913 ms/step , 6886.24 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 19:57:27 | Epoch: 1 | Step: 182080 | Dataset: 0-566221 | Loss: 0.695 | 915 ms/step , 6876.68 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 19:57:37 | Epoch: 1 | Step: 182090 | Dataset: 0-566541 | Loss: 0.738 | 914 ms/step , 6881.59 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 19:57:46 | Epoch: 1 | Step: 182100 | Dataset: 0-566861 | Loss: 0.714 | 915 ms/step , 6877.36 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 19:57:47 | Validation | Step: 182100 | Val_loss: 0.692 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:57:56 | Epoch: 1 | Step: 182110 | Dataset: 0-567181 | Loss: 0.717 | 913 ms/step , 6885.50 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-05 19:58:06 | Epoch: 1 | Step: 182120 | Dataset: 0-567501 | Loss: 0.682 | 913 ms/step , 6891.75 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 19:58:15 | Epoch: 1 | Step: 182130 | Dataset: 0-567821 | Loss: 0.837 | 916 ms/step , 6864.11 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 19:58:24 | Epoch: 1 | Step: 182140 | Dataset: 0-568141 | Loss: 0.718 | 912 ms/step , 6896.87 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 19:58:33 | Epoch: 1 | Step: 182150 | Dataset: 0-568461 | Loss: 0.790 | 914 ms/step , 6884.41 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 19:58:42 | Epoch: 1 | Step: 182160 | Dataset: 0-568781 | Loss: 0.700 | 912 ms/step , 6893.29 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-05 19:58:51 | Epoch: 1 | Step: 182170 | Dataset: 0-569101 | Loss: 0.710 | 912 ms/step , 6892.93 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 19:59:00 | Epoch: 1 | Step: 182180 | Dataset: 0-569421 | Loss: 0.769 | 913 ms/step , 6887.49 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 19:59:09 | Epoch: 1 | Step: 182190 | Dataset: 0-569741 | Loss: 0.715 | 913 ms/step , 6886.09 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 19:59:19 | Epoch: 1 | Step: 182200 | Dataset: 0-570061 | Loss: 0.733 | 912 ms/step , 6898.08 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 19:59:20 | Validation | Step: 182200 | Val_loss: 0.657 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 19:59:29 | Epoch: 1 | Step: 182210 | Dataset: 0-570381 | Loss: 0.695 | 913 ms/step , 6886.76 GFLOP/s , 15275.9 tokens/s INFO:__main__:2024-11-05 19:59:38 | Epoch: 1 | Step: 182220 | Dataset: 0-570701 | Loss: 0.704 | 914 ms/step , 6882.56 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 19:59:48 | Epoch: 1 | Step: 182230 | Dataset: 0-571021 | Loss: 0.726 | 913 ms/step , 6886.15 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 19:59:57 | Epoch: 1 | Step: 182240 | Dataset: 0-571341 | Loss: 0.535 | 914 ms/step , 6884.97 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 20:00:06 | Epoch: 1 | Step: 182250 | Dataset: 0-571661 | Loss: 0.758 | 913 ms/step , 6885.37 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 20:00:15 | Epoch: 1 | Step: 182260 | Dataset: 0-571981 | Loss: 0.631 | 914 ms/step , 6880.49 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 20:00:24 | Epoch: 1 | Step: 182270 | Dataset: 0-572301 | Loss: 0.759 | 917 ms/step , 6859.77 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 20:00:33 | Epoch: 1 | Step: 182280 | Dataset: 0-572621 | Loss: 0.607 | 913 ms/step , 6892.48 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-05 20:00:42 | Epoch: 1 | Step: 182290 | Dataset: 0-572941 | Loss: 0.738 | 914 ms/step , 6883.05 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 20:00:52 | Epoch: 1 | Step: 182300 | Dataset: 0-573261 | Loss: 0.705 | 913 ms/step , 6892.07 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 20:00:53 | Validation | Step: 182300 | Val_loss: 0.702 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:01:02 | Epoch: 1 | Step: 182310 | Dataset: 0-573581 | Loss: 0.769 | 914 ms/step , 6884.11 GFLOP/s , 15261.3 tokens/s INFO:__main__:2024-11-05 20:01:12 | Epoch: 1 | Step: 182320 | Dataset: 0-573901 | Loss: 0.779 | 914 ms/step , 6881.18 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 20:01:21 | Epoch: 1 | Step: 182330 | Dataset: 0-574221 | Loss: 0.777 | 913 ms/step , 6886.82 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-05 20:01:30 | Epoch: 1 | Step: 182340 | Dataset: 0-574541 | Loss: 0.721 | 913 ms/step , 6886.72 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 20:01:39 | Epoch: 1 | Step: 182350 | Dataset: 0-574861 | Loss: 0.750 | 916 ms/step , 6867.93 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 20:01:48 | Epoch: 1 | Step: 182360 | Dataset: 0-575181 | Loss: 0.733 | 913 ms/step , 6889.73 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 20:01:57 | Epoch: 1 | Step: 182370 | Dataset: 0-575501 | Loss: 0.730 | 914 ms/step , 6883.95 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 20:02:06 | Epoch: 1 | Step: 182380 | Dataset: 0-575821 | Loss: 0.714 | 913 ms/step , 6890.21 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 20:02:15 | Epoch: 1 | Step: 182390 | Dataset: 0-576141 | Loss: 0.730 | 913 ms/step , 6892.04 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 20:02:25 | Epoch: 1 | Step: 182400 | Dataset: 0-576461 | Loss: 0.721 | 913 ms/step , 6891.69 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 20:02:26 | Validation | Step: 182400 | Val_loss: 0.688 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:02:35 | Epoch: 1 | Step: 182410 | Dataset: 0-576781 | Loss: 0.623 | 912 ms/step , 6893.57 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-05 20:02:45 | Epoch: 1 | Step: 182420 | Dataset: 0-577101 | Loss: 0.705 | 915 ms/step , 6874.54 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 20:02:54 | Epoch: 1 | Step: 182430 | Dataset: 0-577421 | Loss: 0.766 | 915 ms/step , 6872.25 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 20:03:03 | Epoch: 1 | Step: 182440 | Dataset: 0-577741 | Loss: 0.674 | 914 ms/step , 6881.21 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 20:03:12 | Epoch: 1 | Step: 182450 | Dataset: 0-578061 | Loss: 0.661 | 913 ms/step , 6887.78 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 20:03:21 | Epoch: 1 | Step: 182460 | Dataset: 0-578381 | Loss: 0.763 | 916 ms/step , 6869.71 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 20:03:30 | Epoch: 1 | Step: 182470 | Dataset: 0-578701 | Loss: 0.714 | 915 ms/step , 6876.24 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 20:03:39 | Epoch: 1 | Step: 182480 | Dataset: 0-579021 | Loss: 0.812 | 914 ms/step , 6881.87 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 20:03:48 | Epoch: 1 | Step: 182490 | Dataset: 0-579341 | Loss: 0.718 | 914 ms/step , 6884.48 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 20:03:58 | Epoch: 1 | Step: 182500 | Dataset: 0-579661 | Loss: 0.725 | 913 ms/step , 6885.88 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 20:03:59 | Validation | Step: 182500 | Val_loss: 0.695 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:04:08 | Epoch: 1 | Step: 182510 | Dataset: 0-579981 | Loss: 0.646 | 913 ms/step , 6886.89 GFLOP/s , 15266.1 tokens/s INFO:__main__:2024-11-05 20:04:18 | Epoch: 1 | Step: 182520 | Dataset: 0-580301 | Loss: 0.705 | 914 ms/step , 6879.10 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 20:04:27 | Epoch: 1 | Step: 182530 | Dataset: 0-580621 | Loss: 0.597 | 914 ms/step , 6881.01 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 20:04:36 | Epoch: 1 | Step: 182540 | Dataset: 0-580941 | Loss: 0.678 | 914 ms/step , 6882.90 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 20:04:45 | Epoch: 1 | Step: 182550 | Dataset: 0-581261 | Loss: 0.739 | 914 ms/step , 6883.12 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 20:04:54 | Epoch: 1 | Step: 182560 | Dataset: 0-581581 | Loss: 0.717 | 914 ms/step , 6881.57 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 20:05:03 | Epoch: 1 | Step: 182570 | Dataset: 0-581901 | Loss: 0.718 | 915 ms/step , 6876.87 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 20:05:12 | Epoch: 1 | Step: 182580 | Dataset: 0-582221 | Loss: 0.750 | 913 ms/step , 6889.43 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 20:05:22 | Epoch: 1 | Step: 182590 | Dataset: 0-582541 | Loss: 0.748 | 913 ms/step , 6890.93 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 20:05:31 | Epoch: 1 | Step: 182600 | Dataset: 0-582861 | Loss: 0.708 | 914 ms/step , 6882.76 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 20:05:32 | Validation | Step: 182600 | Val_loss: 0.730 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:05:41 | Epoch: 1 | Step: 182610 | Dataset: 0-583181 | Loss: 0.715 | 915 ms/step , 6875.88 GFLOP/s , 15264.4 tokens/s INFO:__main__:2024-11-05 20:05:51 | Epoch: 1 | Step: 182620 | Dataset: 0-583501 | Loss: 0.607 | 913 ms/step , 6888.13 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 20:06:00 | Epoch: 1 | Step: 182630 | Dataset: 0-583821 | Loss: 0.688 | 914 ms/step , 6880.05 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 20:06:09 | Epoch: 1 | Step: 182640 | Dataset: 0-584141 | Loss: 0.697 | 914 ms/step , 6881.04 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 20:06:18 | Epoch: 1 | Step: 182650 | Dataset: 0-584461 | Loss: 0.635 | 913 ms/step , 6886.50 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 20:06:27 | Epoch: 1 | Step: 182660 | Dataset: 0-584781 | Loss: 0.708 | 913 ms/step , 6885.27 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 20:06:36 | Epoch: 1 | Step: 182670 | Dataset: 0-585101 | Loss: 0.669 | 914 ms/step , 6883.25 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 20:06:45 | Epoch: 1 | Step: 182680 | Dataset: 0-585421 | Loss: 0.726 | 915 ms/step , 6876.45 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 20:06:55 | Epoch: 1 | Step: 182690 | Dataset: 0-585741 | Loss: 0.676 | 914 ms/step , 6884.65 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-05 20:07:04 | Epoch: 1 | Step: 182700 | Dataset: 0-586061 | Loss: 0.779 | 913 ms/step , 6887.90 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 20:07:05 | Validation | Step: 182700 | Val_loss: 0.671 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:07:14 | Epoch: 1 | Step: 182710 | Dataset: 0-586381 | Loss: 0.731 | 914 ms/step , 6882.16 GFLOP/s , 15264.2 tokens/s INFO:__main__:2024-11-05 20:07:24 | Epoch: 1 | Step: 182720 | Dataset: 0-586701 | Loss: 0.728 | 913 ms/step , 6886.50 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 20:07:33 | Epoch: 1 | Step: 182730 | Dataset: 0-587021 | Loss: 0.704 | 915 ms/step , 6874.76 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-05 20:07:42 | Epoch: 1 | Step: 182740 | Dataset: 0-587341 | Loss: 0.713 | 913 ms/step , 6887.44 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 20:07:51 | Epoch: 1 | Step: 182750 | Dataset: 0-587661 | Loss: 0.662 | 913 ms/step , 6888.89 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 20:08:00 | Epoch: 1 | Step: 182760 | Dataset: 0-587981 | Loss: 0.656 | 915 ms/step , 6875.14 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 20:08:09 | Epoch: 1 | Step: 182770 | Dataset: 0-588301 | Loss: 0.733 | 915 ms/step , 6876.70 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 20:08:18 | Epoch: 1 | Step: 182780 | Dataset: 0-588621 | Loss: 0.652 | 913 ms/step , 6887.97 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:08:28 | Epoch: 1 | Step: 182790 | Dataset: 0-588941 | Loss: 0.749 | 916 ms/step , 6867.12 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 20:08:37 | Epoch: 1 | Step: 182800 | Dataset: 0-589261 | Loss: 0.728 | 914 ms/step , 6882.77 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 20:08:38 | Validation | Step: 182800 | Val_loss: 0.695 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:08:47 | Epoch: 1 | Step: 182810 | Dataset: 0-589581 | Loss: 0.734 | 913 ms/step , 6891.22 GFLOP/s , 15265.1 tokens/s INFO:__main__:2024-11-05 20:08:57 | Epoch: 1 | Step: 182820 | Dataset: 0-589901 | Loss: 0.617 | 913 ms/step , 6885.14 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 20:09:06 | Epoch: 1 | Step: 182830 | Dataset: 0-590221 | Loss: 0.683 | 913 ms/step , 6890.98 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 20:09:15 | Epoch: 1 | Step: 182840 | Dataset: 0-590541 | Loss: 0.650 | 914 ms/step , 6882.66 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 20:09:24 | Epoch: 1 | Step: 182850 | Dataset: 0-590861 | Loss: 0.717 | 913 ms/step , 6885.45 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 20:09:33 | Epoch: 1 | Step: 182860 | Dataset: 0-591181 | Loss: 0.605 | 913 ms/step , 6886.45 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 20:09:42 | Epoch: 1 | Step: 182870 | Dataset: 0-591501 | Loss: 0.662 | 915 ms/step , 6875.41 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 20:09:51 | Epoch: 1 | Step: 182880 | Dataset: 0-591821 | Loss: 0.713 | 915 ms/step , 6874.09 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 20:10:01 | Epoch: 1 | Step: 182890 | Dataset: 0-592141 | Loss: 0.635 | 914 ms/step , 6882.84 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 20:10:10 | Epoch: 1 | Step: 182900 | Dataset: 0-592461 | Loss: 0.725 | 914 ms/step , 6879.92 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 20:10:11 | Validation | Step: 182900 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:10:20 | Epoch: 1 | Step: 182910 | Dataset: 0-592781 | Loss: 0.680 | 913 ms/step , 6885.40 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 20:10:30 | Epoch: 1 | Step: 182920 | Dataset: 0-593101 | Loss: 0.671 | 914 ms/step , 6880.46 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-05 20:10:39 | Epoch: 1 | Step: 182930 | Dataset: 0-593421 | Loss: 0.651 | 913 ms/step , 6890.62 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-05 20:10:48 | Epoch: 1 | Step: 182940 | Dataset: 0-593741 | Loss: 0.635 | 913 ms/step , 6886.08 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-05 20:10:57 | Epoch: 1 | Step: 182950 | Dataset: 0-594061 | Loss: 0.789 | 914 ms/step , 6880.66 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 20:11:06 | Epoch: 1 | Step: 182960 | Dataset: 0-594381 | Loss: 0.688 | 915 ms/step , 6876.61 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-05 20:11:15 | Epoch: 1 | Step: 182970 | Dataset: 0-594701 | Loss: 0.674 | 915 ms/step , 6870.05 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-05 20:11:24 | Epoch: 1 | Step: 182980 | Dataset: 0-595021 | Loss: 0.697 | 914 ms/step , 6883.86 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-05 20:11:34 | Epoch: 1 | Step: 182990 | Dataset: 0-595341 | Loss: 0.559 | 913 ms/step , 6889.65 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 20:11:43 | Epoch: 1 | Step: 183000 | Dataset: 0-595661 | Loss: 0.697 | 914 ms/step , 6882.50 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 20:11:44 | Validation | Step: 183000 | Val_loss: 0.713 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:11:44 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_201144_step_183000.pt` INFO:__main__:2024-11-05 20:11:55 | Epoch: 1 | Step: 183010 | Dataset: 0-595981 | Loss: 0.751 | 914 ms/step , 6880.13 GFLOP/s , 13785.3 tokens/s INFO:__main__:2024-11-05 20:12:04 | Epoch: 1 | Step: 183020 | Dataset: 0-596301 | Loss: 0.658 | 913 ms/step , 6892.33 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 20:12:13 | Epoch: 1 | Step: 183030 | Dataset: 0-596621 | Loss: 0.605 | 914 ms/step , 6878.68 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 20:12:22 | Epoch: 1 | Step: 183040 | Dataset: 0-596941 | Loss: 0.681 | 913 ms/step , 6885.39 GFLOP/s , 17897.2 tokens/s INFO:__main__:2024-11-05 20:12:31 | Epoch: 1 | Step: 183050 | Dataset: 0-597261 | Loss: 0.773 | 914 ms/step , 6881.30 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-05 20:12:40 | Epoch: 1 | Step: 183060 | Dataset: 0-597581 | Loss: 0.716 | 914 ms/step , 6879.40 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 20:12:49 | Epoch: 1 | Step: 183070 | Dataset: 0-597901 | Loss: 0.564 | 914 ms/step , 6879.85 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 20:12:59 | Epoch: 1 | Step: 183080 | Dataset: 0-598221 | Loss: 0.836 | 914 ms/step , 6881.36 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 20:13:08 | Epoch: 1 | Step: 183090 | Dataset: 0-598541 | Loss: 0.625 | 914 ms/step , 6883.02 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-05 20:13:17 | Epoch: 1 | Step: 183100 | Dataset: 0-598861 | Loss: 0.790 | 913 ms/step , 6886.25 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 20:13:18 | Validation | Step: 183100 | Val_loss: 0.658 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:13:28 | Epoch: 1 | Step: 183110 | Dataset: 0-599181 | Loss: 0.652 | 913 ms/step , 6890.30 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-05 20:13:37 | Epoch: 1 | Step: 183120 | Dataset: 0-599501 | Loss: 0.702 | 913 ms/step , 6887.09 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 20:13:46 | Epoch: 1 | Step: 183130 | Dataset: 0-599821 | Loss: 0.662 | 913 ms/step , 6892.55 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 20:13:55 | Epoch: 1 | Step: 183140 | Dataset: 0-600141 | Loss: 0.680 | 913 ms/step , 6889.53 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 20:14:04 | Epoch: 1 | Step: 183150 | Dataset: 0-600461 | Loss: 0.657 | 914 ms/step , 6881.27 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 20:14:13 | Epoch: 1 | Step: 183160 | Dataset: 0-600781 | Loss: 0.800 | 915 ms/step , 6873.53 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 20:14:22 | Epoch: 1 | Step: 183170 | Dataset: 0-601101 | Loss: 0.712 | 914 ms/step , 6878.25 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-05 20:14:32 | Epoch: 1 | Step: 183180 | Dataset: 0-601421 | Loss: 0.686 | 913 ms/step , 6889.73 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 20:14:41 | Epoch: 1 | Step: 183190 | Dataset: 0-601741 | Loss: 0.669 | 913 ms/step , 6885.49 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 20:14:50 | Epoch: 1 | Step: 183200 | Dataset: 0-602061 | Loss: 0.806 | 913 ms/step , 6885.89 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 20:14:51 | Validation | Step: 183200 | Val_loss: 0.674 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:15:01 | Epoch: 1 | Step: 183210 | Dataset: 0-602381 | Loss: 0.707 | 915 ms/step , 6874.63 GFLOP/s , 15269.0 tokens/s INFO:__main__:2024-11-05 20:15:10 | Epoch: 1 | Step: 183220 | Dataset: 0-602701 | Loss: 0.808 | 915 ms/step , 6873.75 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-05 20:15:19 | Epoch: 1 | Step: 183230 | Dataset: 0-603021 | Loss: 0.649 | 913 ms/step , 6888.36 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 20:15:28 | Epoch: 1 | Step: 183240 | Dataset: 0-603341 | Loss: 0.768 | 915 ms/step , 6874.14 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 20:15:37 | Epoch: 1 | Step: 183250 | Dataset: 0-603661 | Loss: 0.705 | 915 ms/step , 6870.06 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 20:15:46 | Epoch: 1 | Step: 183260 | Dataset: 0-603981 | Loss: 0.745 | 914 ms/step , 6882.82 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 20:15:55 | Epoch: 1 | Step: 183270 | Dataset: 0-604301 | Loss: 0.666 | 914 ms/step , 6879.51 GFLOP/s , 17906.4 tokens/s INFO:__main__:2024-11-05 20:16:05 | Epoch: 1 | Step: 183280 | Dataset: 0-604621 | Loss: 0.776 | 913 ms/step , 6891.97 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 20:16:14 | Epoch: 1 | Step: 183290 | Dataset: 0-604941 | Loss: 0.722 | 915 ms/step , 6874.64 GFLOP/s , 17906.6 tokens/s INFO:__main__:2024-11-05 20:16:23 | Epoch: 1 | Step: 183300 | Dataset: 0-605261 | Loss: 0.670 | 914 ms/step , 6882.20 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-05 20:16:25 | Validation | Step: 183300 | Val_loss: 0.690 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:16:34 | Epoch: 1 | Step: 183310 | Dataset: 0-605581 | Loss: 0.630 | 914 ms/step , 6884.56 GFLOP/s , 15258.2 tokens/s INFO:__main__:2024-11-05 20:16:43 | Epoch: 1 | Step: 183320 | Dataset: 0-605901 | Loss: 0.813 | 913 ms/step , 6886.76 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 20:16:52 | Epoch: 1 | Step: 183330 | Dataset: 0-606221 | Loss: 0.687 | 913 ms/step , 6890.48 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 20:17:01 | Epoch: 1 | Step: 183340 | Dataset: 0-606541 | Loss: 0.703 | 913 ms/step , 6885.38 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 20:17:10 | Epoch: 1 | Step: 183350 | Dataset: 0-606861 | Loss: 0.782 | 914 ms/step , 6882.22 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 20:17:19 | Epoch: 1 | Step: 183360 | Dataset: 0-607181 | Loss: 0.678 | 912 ms/step , 6893.99 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 20:17:29 | Epoch: 1 | Step: 183370 | Dataset: 0-607501 | Loss: 0.617 | 914 ms/step , 6883.68 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 20:17:38 | Epoch: 1 | Step: 183380 | Dataset: 0-607821 | Loss: 0.719 | 914 ms/step , 6883.98 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 20:17:47 | Epoch: 1 | Step: 183390 | Dataset: 0-608141 | Loss: 0.571 | 912 ms/step , 6892.88 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 20:17:56 | Epoch: 1 | Step: 183400 | Dataset: 0-608461 | Loss: 0.738 | 916 ms/step , 6868.36 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 20:17:58 | Validation | Step: 183400 | Val_loss: 0.757 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:18:07 | Epoch: 1 | Step: 183410 | Dataset: 0-608781 | Loss: 0.440 | 914 ms/step , 6881.74 GFLOP/s , 15267.1 tokens/s INFO:__main__:2024-11-05 20:18:16 | Epoch: 1 | Step: 183420 | Dataset: 0-609101 | Loss: 0.851 | 913 ms/step , 6887.28 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 20:18:25 | Epoch: 1 | Step: 183430 | Dataset: 0-609421 | Loss: 0.749 | 915 ms/step , 6876.74 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 20:18:34 | Epoch: 1 | Step: 183440 | Dataset: 0-609741 | Loss: 0.732 | 914 ms/step , 6881.74 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 20:18:43 | Epoch: 1 | Step: 183450 | Dataset: 0-610061 | Loss: 0.704 | 913 ms/step , 6889.42 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 20:18:52 | Epoch: 1 | Step: 183460 | Dataset: 0-610381 | Loss: 0.609 | 913 ms/step , 6885.34 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 20:19:02 | Epoch: 1 | Step: 183470 | Dataset: 0-610701 | Loss: 0.569 | 914 ms/step , 6885.00 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 20:19:11 | Epoch: 1 | Step: 183480 | Dataset: 0-611021 | Loss: 0.627 | 914 ms/step , 6882.00 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 20:19:20 | Epoch: 1 | Step: 183490 | Dataset: 0-611341 | Loss: 0.809 | 915 ms/step , 6870.38 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 20:19:29 | Epoch: 1 | Step: 183500 | Dataset: 0-611661 | Loss: 0.781 | 914 ms/step , 6878.60 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 20:19:31 | Validation | Step: 183500 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:19:40 | Epoch: 1 | Step: 183510 | Dataset: 0-611981 | Loss: 0.807 | 914 ms/step , 6881.66 GFLOP/s , 15268.9 tokens/s INFO:__main__:2024-11-05 20:19:49 | Epoch: 1 | Step: 183520 | Dataset: 0-612301 | Loss: 0.799 | 914 ms/step , 6879.69 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-05 20:19:58 | Epoch: 1 | Step: 183530 | Dataset: 0-612621 | Loss: 0.744 | 912 ms/step , 6894.10 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 20:20:07 | Epoch: 1 | Step: 183540 | Dataset: 0-612941 | Loss: 0.788 | 913 ms/step , 6886.15 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 20:20:16 | Epoch: 1 | Step: 183550 | Dataset: 0-613261 | Loss: 0.755 | 913 ms/step , 6888.70 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 20:20:25 | Epoch: 1 | Step: 183560 | Dataset: 0-613581 | Loss: 0.554 | 912 ms/step , 6898.15 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 20:20:34 | Epoch: 1 | Step: 183570 | Dataset: 0-613901 | Loss: 0.583 | 914 ms/step , 6882.18 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 20:20:44 | Epoch: 1 | Step: 183580 | Dataset: 0-614221 | Loss: 0.724 | 912 ms/step , 6893.27 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 20:20:53 | Epoch: 1 | Step: 183590 | Dataset: 0-614541 | Loss: 0.847 | 915 ms/step , 6873.41 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 20:21:02 | Epoch: 1 | Step: 183600 | Dataset: 0-614861 | Loss: 0.758 | 914 ms/step , 6884.85 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 20:21:04 | Validation | Step: 183600 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:21:13 | Epoch: 1 | Step: 183610 | Dataset: 0-615181 | Loss: 0.765 | 914 ms/step , 6882.17 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 20:21:22 | Epoch: 1 | Step: 183620 | Dataset: 0-615501 | Loss: 0.777 | 915 ms/step , 6874.25 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 20:21:31 | Epoch: 1 | Step: 183630 | Dataset: 0-615821 | Loss: 0.782 | 914 ms/step , 6881.30 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 20:21:40 | Epoch: 1 | Step: 183640 | Dataset: 0-616141 | Loss: 0.850 | 915 ms/step , 6875.06 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 20:21:49 | Epoch: 1 | Step: 183650 | Dataset: 0-616461 | Loss: 0.774 | 914 ms/step , 6878.60 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 20:21:58 | Epoch: 1 | Step: 183660 | Dataset: 0-616781 | Loss: 0.732 | 914 ms/step , 6878.50 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 20:22:07 | Epoch: 1 | Step: 183670 | Dataset: 0-617101 | Loss: 0.807 | 914 ms/step , 6882.24 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 20:22:17 | Epoch: 1 | Step: 183680 | Dataset: 0-617421 | Loss: 0.690 | 913 ms/step , 6887.78 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 20:22:26 | Epoch: 1 | Step: 183690 | Dataset: 0-617741 | Loss: 0.734 | 914 ms/step , 6884.63 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-05 20:22:35 | Epoch: 1 | Step: 183700 | Dataset: 0-618061 | Loss: 0.712 | 914 ms/step , 6881.35 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-05 20:22:37 | Validation | Step: 183700 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:22:46 | Epoch: 1 | Step: 183710 | Dataset: 0-618381 | Loss: 0.722 | 913 ms/step , 6886.13 GFLOP/s , 15267.4 tokens/s INFO:__main__:2024-11-05 20:22:55 | Epoch: 1 | Step: 183720 | Dataset: 0-618701 | Loss: 0.399 | 912 ms/step , 6897.04 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 20:23:04 | Epoch: 1 | Step: 183730 | Dataset: 0-619021 | Loss: 0.842 | 914 ms/step , 6883.91 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 20:23:13 | Epoch: 1 | Step: 183740 | Dataset: 0-619341 | Loss: 0.791 | 914 ms/step , 6880.38 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 20:23:22 | Epoch: 1 | Step: 183750 | Dataset: 0-619661 | Loss: 0.779 | 912 ms/step , 6894.03 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 20:23:31 | Epoch: 1 | Step: 183760 | Dataset: 0-619981 | Loss: 0.416 | 911 ms/step , 6900.32 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 20:23:40 | Epoch: 1 | Step: 183770 | Dataset: 0-620301 | Loss: 0.729 | 914 ms/step , 6878.86 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 20:23:50 | Epoch: 1 | Step: 183780 | Dataset: 0-620621 | Loss: 0.703 | 914 ms/step , 6883.42 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 20:23:59 | Epoch: 1 | Step: 183790 | Dataset: 0-620941 | Loss: 0.657 | 913 ms/step , 6887.93 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 20:24:08 | Epoch: 1 | Step: 183800 | Dataset: 0-621261 | Loss: 0.702 | 914 ms/step , 6882.64 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 20:24:09 | Validation | Step: 183800 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:24:19 | Epoch: 1 | Step: 183810 | Dataset: 0-621581 | Loss: 0.808 | 914 ms/step , 6878.92 GFLOP/s , 15266.6 tokens/s INFO:__main__:2024-11-05 20:24:28 | Epoch: 1 | Step: 183820 | Dataset: 0-621901 | Loss: 0.624 | 913 ms/step , 6888.31 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 20:24:37 | Epoch: 1 | Step: 183830 | Dataset: 0-622221 | Loss: 0.791 | 914 ms/step , 6884.74 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 20:24:46 | Epoch: 1 | Step: 183840 | Dataset: 0-622541 | Loss: 0.803 | 916 ms/step , 6868.95 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 20:24:55 | Epoch: 1 | Step: 183850 | Dataset: 0-622861 | Loss: 0.882 | 914 ms/step , 6880.29 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 20:25:04 | Epoch: 1 | Step: 183860 | Dataset: 0-623181 | Loss: 0.830 | 916 ms/step , 6866.91 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 20:25:13 | Epoch: 1 | Step: 183870 | Dataset: 0-623501 | Loss: 0.800 | 913 ms/step , 6890.96 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 20:25:23 | Epoch: 1 | Step: 183880 | Dataset: 0-623821 | Loss: 0.540 | 913 ms/step , 6887.39 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 20:25:32 | Epoch: 1 | Step: 183890 | Dataset: 0-624141 | Loss: 0.699 | 914 ms/step , 6881.90 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 20:25:41 | Epoch: 1 | Step: 183900 | Dataset: 0-624461 | Loss: 0.727 | 914 ms/step , 6882.89 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 20:25:42 | Validation | Step: 183900 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:25:52 | Epoch: 1 | Step: 183910 | Dataset: 0-624781 | Loss: 0.632 | 913 ms/step , 6892.23 GFLOP/s , 15269.3 tokens/s INFO:__main__:2024-11-05 20:26:01 | Epoch: 1 | Step: 183920 | Dataset: 0-625101 | Loss: 0.836 | 914 ms/step , 6884.68 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 20:26:10 | Epoch: 1 | Step: 183930 | Dataset: 0-625421 | Loss: 0.862 | 914 ms/step , 6882.01 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 20:26:19 | Epoch: 1 | Step: 183940 | Dataset: 0-625741 | Loss: 0.700 | 913 ms/step , 6887.21 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 20:26:28 | Epoch: 1 | Step: 183950 | Dataset: 0-626061 | Loss: 0.654 | 914 ms/step , 6882.32 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 20:26:37 | Epoch: 1 | Step: 183960 | Dataset: 0-626381 | Loss: 0.520 | 913 ms/step , 6888.46 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:26:46 | Epoch: 1 | Step: 183970 | Dataset: 0-626701 | Loss: 0.718 | 913 ms/step , 6886.44 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 20:26:56 | Epoch: 1 | Step: 183980 | Dataset: 0-627021 | Loss: 0.737 | 913 ms/step , 6892.47 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 20:27:05 | Epoch: 1 | Step: 183990 | Dataset: 0-627341 | Loss: 0.819 | 913 ms/step , 6887.23 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 20:27:14 | Epoch: 1 | Step: 184000 | Dataset: 0-627661 | Loss: 0.708 | 913 ms/step , 6885.13 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:27:15 | Validation | Step: 184000 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:27:15 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_202715_step_184000.pt` INFO:__main__:2024-11-05 20:27:26 | Epoch: 1 | Step: 184010 | Dataset: 0-627981 | Loss: 0.792 | 915 ms/step , 6876.93 GFLOP/s , 13804.2 tokens/s INFO:__main__:2024-11-05 20:27:35 | Epoch: 1 | Step: 184020 | Dataset: 0-628301 | Loss: 0.685 | 913 ms/step , 6889.03 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 20:27:44 | Epoch: 1 | Step: 184030 | Dataset: 0-628621 | Loss: 0.612 | 914 ms/step , 6883.90 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 20:27:53 | Epoch: 1 | Step: 184040 | Dataset: 0-628941 | Loss: 0.522 | 912 ms/step , 6893.96 GFLOP/s , 17905.1 tokens/s INFO:__main__:2024-11-05 20:28:02 | Epoch: 1 | Step: 184050 | Dataset: 0-629261 | Loss: 0.732 | 914 ms/step , 6881.80 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 20:28:11 | Epoch: 1 | Step: 184060 | Dataset: 0-629581 | Loss: 0.735 | 914 ms/step , 6882.98 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 20:28:21 | Epoch: 1 | Step: 184070 | Dataset: 0-629901 | Loss: 0.860 | 913 ms/step , 6886.82 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 20:28:30 | Epoch: 1 | Step: 184080 | Dataset: 0-630221 | Loss: 0.692 | 913 ms/step , 6886.85 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 20:28:39 | Epoch: 1 | Step: 184090 | Dataset: 0-630541 | Loss: 0.726 | 915 ms/step , 6877.29 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 20:28:48 | Epoch: 1 | Step: 184100 | Dataset: 0-630861 | Loss: 0.714 | 914 ms/step , 6879.94 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 20:28:50 | Validation | Step: 184100 | Val_loss: 0.777 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:28:59 | Epoch: 1 | Step: 184110 | Dataset: 0-631181 | Loss: 0.776 | 914 ms/step , 6880.88 GFLOP/s , 15270.1 tokens/s INFO:__main__:2024-11-05 20:29:08 | Epoch: 1 | Step: 184120 | Dataset: 0-631501 | Loss: 0.856 | 914 ms/step , 6880.70 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 20:29:17 | Epoch: 1 | Step: 184130 | Dataset: 0-631821 | Loss: 0.742 | 913 ms/step , 6888.17 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 20:29:26 | Epoch: 1 | Step: 184140 | Dataset: 0-632141 | Loss: 0.752 | 915 ms/step , 6875.95 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 20:29:35 | Epoch: 1 | Step: 184150 | Dataset: 0-632461 | Loss: 0.847 | 913 ms/step , 6885.62 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 20:29:44 | Epoch: 1 | Step: 184160 | Dataset: 0-632781 | Loss: 0.829 | 914 ms/step , 6878.37 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 20:29:54 | Epoch: 1 | Step: 184170 | Dataset: 0-633101 | Loss: 0.699 | 913 ms/step , 6892.43 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 20:30:03 | Epoch: 1 | Step: 184180 | Dataset: 0-633421 | Loss: 0.777 | 913 ms/step , 6888.55 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 20:30:12 | Epoch: 1 | Step: 184190 | Dataset: 0-633741 | Loss: 0.714 | 914 ms/step , 6883.85 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 20:30:21 | Epoch: 1 | Step: 184200 | Dataset: 0-634061 | Loss: 0.724 | 913 ms/step , 6890.31 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 20:30:23 | Validation | Step: 184200 | Val_loss: 0.741 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:30:32 | Epoch: 1 | Step: 184210 | Dataset: 0-634381 | Loss: 0.712 | 913 ms/step , 6887.72 GFLOP/s , 15263.6 tokens/s INFO:__main__:2024-11-05 20:30:41 | Epoch: 1 | Step: 184220 | Dataset: 0-634701 | Loss: 0.700 | 913 ms/step , 6889.05 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 20:30:50 | Epoch: 1 | Step: 184230 | Dataset: 0-635021 | Loss: 0.625 | 913 ms/step , 6887.80 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 20:30:59 | Epoch: 1 | Step: 184240 | Dataset: 0-635341 | Loss: 0.728 | 914 ms/step , 6883.94 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 20:31:08 | Epoch: 1 | Step: 184250 | Dataset: 0-635661 | Loss: 0.824 | 915 ms/step , 6875.07 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 20:31:17 | Epoch: 1 | Step: 184260 | Dataset: 0-635981 | Loss: 0.711 | 914 ms/step , 6881.11 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 20:31:27 | Epoch: 1 | Step: 184270 | Dataset: 0-636301 | Loss: 0.697 | 912 ms/step , 6895.83 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 20:31:36 | Epoch: 1 | Step: 184280 | Dataset: 0-636621 | Loss: 0.696 | 914 ms/step , 6883.74 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 20:31:45 | Epoch: 1 | Step: 184290 | Dataset: 0-636941 | Loss: 0.732 | 913 ms/step , 6888.22 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 20:31:54 | Epoch: 1 | Step: 184300 | Dataset: 0-637261 | Loss: 0.760 | 914 ms/step , 6884.02 GFLOP/s , 17912.9 tokens/s INFO:__main__:2024-11-05 20:31:56 | Validation | Step: 184300 | Val_loss: 0.732 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:32:05 | Epoch: 1 | Step: 184310 | Dataset: 0-637581 | Loss: 0.743 | 914 ms/step , 6884.90 GFLOP/s , 15272.5 tokens/s INFO:__main__:2024-11-05 20:32:14 | Epoch: 1 | Step: 184320 | Dataset: 0-637901 | Loss: 0.642 | 912 ms/step , 6896.47 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 20:32:23 | Epoch: 1 | Step: 184330 | Dataset: 0-638221 | Loss: 0.733 | 914 ms/step , 6880.65 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 20:32:32 | Epoch: 1 | Step: 184340 | Dataset: 0-638541 | Loss: 0.822 | 915 ms/step , 6873.66 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 20:32:41 | Epoch: 1 | Step: 184350 | Dataset: 0-638861 | Loss: 0.729 | 912 ms/step , 6894.61 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 20:32:50 | Epoch: 1 | Step: 184360 | Dataset: 0-639181 | Loss: 0.823 | 915 ms/step , 6873.06 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 20:33:00 | Epoch: 1 | Step: 184370 | Dataset: 0-639501 | Loss: 0.770 | 914 ms/step , 6883.07 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 20:33:09 | Epoch: 1 | Step: 184380 | Dataset: 0-639821 | Loss: 0.679 | 913 ms/step , 6887.40 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 20:33:18 | Epoch: 1 | Step: 184390 | Dataset: 0-640141 | Loss: 0.712 | 914 ms/step , 6882.96 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 20:33:27 | Epoch: 1 | Step: 184400 | Dataset: 0-640461 | Loss: 0.674 | 913 ms/step , 6892.05 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 20:33:29 | Validation | Step: 184400 | Val_loss: 0.758 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:33:38 | Epoch: 1 | Step: 184410 | Dataset: 0-640781 | Loss: 0.953 | 915 ms/step , 6872.92 GFLOP/s , 15258.6 tokens/s INFO:__main__:2024-11-05 20:33:47 | Epoch: 1 | Step: 184420 | Dataset: 0-641101 | Loss: 0.610 | 913 ms/step , 6886.23 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 20:33:56 | Epoch: 1 | Step: 184430 | Dataset: 0-641421 | Loss: 0.708 | 913 ms/step , 6886.75 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 20:34:05 | Epoch: 1 | Step: 184440 | Dataset: 0-641741 | Loss: 0.792 | 914 ms/step , 6880.35 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 20:34:14 | Epoch: 1 | Step: 184450 | Dataset: 0-642061 | Loss: 0.732 | 914 ms/step , 6883.54 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 20:34:23 | Epoch: 1 | Step: 184460 | Dataset: 0-642381 | Loss: 0.754 | 914 ms/step , 6878.25 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 20:34:33 | Epoch: 1 | Step: 184470 | Dataset: 0-642701 | Loss: 0.746 | 915 ms/step , 6877.09 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 20:34:42 | Epoch: 1 | Step: 184480 | Dataset: 0-643021 | Loss: 0.781 | 913 ms/step , 6887.16 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 20:34:51 | Epoch: 1 | Step: 184490 | Dataset: 0-643341 | Loss: 0.699 | 913 ms/step , 6892.01 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 20:35:00 | Epoch: 1 | Step: 184500 | Dataset: 0-643661 | Loss: 0.619 | 912 ms/step , 6897.57 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 20:35:02 | Validation | Step: 184500 | Val_loss: 0.741 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:35:11 | Epoch: 1 | Step: 184510 | Dataset: 0-643981 | Loss: 0.741 | 913 ms/step , 6888.94 GFLOP/s , 15262.9 tokens/s INFO:__main__:2024-11-05 20:35:20 | Epoch: 1 | Step: 184520 | Dataset: 0-644301 | Loss: 0.727 | 914 ms/step , 6884.22 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 20:35:29 | Epoch: 1 | Step: 184530 | Dataset: 0-644621 | Loss: 0.782 | 913 ms/step , 6891.93 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 20:35:38 | Epoch: 1 | Step: 184540 | Dataset: 0-644941 | Loss: 0.716 | 913 ms/step , 6887.74 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 20:35:47 | Epoch: 1 | Step: 184550 | Dataset: 0-645261 | Loss: 0.864 | 913 ms/step , 6885.93 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 20:35:56 | Epoch: 1 | Step: 184560 | Dataset: 0-645581 | Loss: 0.534 | 912 ms/step , 6893.35 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 20:36:06 | Epoch: 1 | Step: 184570 | Dataset: 0-645901 | Loss: 0.676 | 915 ms/step , 6873.59 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-05 20:36:15 | Epoch: 1 | Step: 184580 | Dataset: 0-646221 | Loss: 0.656 | 914 ms/step , 6883.91 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 20:36:24 | Epoch: 1 | Step: 184590 | Dataset: 0-646541 | Loss: 0.751 | 914 ms/step , 6884.13 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 20:36:33 | Epoch: 1 | Step: 184600 | Dataset: 0-646861 | Loss: 0.594 | 913 ms/step , 6887.49 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 20:36:35 | Validation | Step: 184600 | Val_loss: 0.738 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:36:44 | Epoch: 1 | Step: 184610 | Dataset: 0-647181 | Loss: 0.773 | 915 ms/step , 6875.71 GFLOP/s , 15261.6 tokens/s INFO:__main__:2024-11-05 20:36:53 | Epoch: 1 | Step: 184620 | Dataset: 0-647501 | Loss: 0.854 | 914 ms/step , 6878.44 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 20:37:02 | Epoch: 1 | Step: 184630 | Dataset: 0-647821 | Loss: 0.674 | 913 ms/step , 6888.18 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 20:37:11 | Epoch: 1 | Step: 184640 | Dataset: 0-648141 | Loss: 0.720 | 914 ms/step , 6884.69 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 20:37:20 | Epoch: 1 | Step: 184650 | Dataset: 0-648461 | Loss: 0.758 | 913 ms/step , 6885.93 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 20:37:29 | Epoch: 1 | Step: 184660 | Dataset: 0-648781 | Loss: 0.717 | 914 ms/step , 6879.27 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 20:37:39 | Epoch: 1 | Step: 184670 | Dataset: 0-649101 | Loss: 0.699 | 914 ms/step , 6879.96 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 20:37:48 | Epoch: 1 | Step: 184680 | Dataset: 0-649421 | Loss: 0.648 | 914 ms/step , 6882.47 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 20:37:57 | Epoch: 1 | Step: 184690 | Dataset: 0-649741 | Loss: 0.751 | 913 ms/step , 6887.06 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:38:06 | Epoch: 1 | Step: 184700 | Dataset: 0-650061 | Loss: 0.766 | 914 ms/step , 6878.62 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 20:38:08 | Validation | Step: 184700 | Val_loss: 0.735 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:38:17 | Epoch: 1 | Step: 184710 | Dataset: 0-650381 | Loss: 0.782 | 913 ms/step , 6886.11 GFLOP/s , 15282.2 tokens/s INFO:__main__:2024-11-05 20:38:26 | Epoch: 1 | Step: 184720 | Dataset: 0-650701 | Loss: 0.709 | 914 ms/step , 6882.88 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 20:38:35 | Epoch: 1 | Step: 184730 | Dataset: 0-651021 | Loss: 0.668 | 912 ms/step , 6892.71 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 20:38:44 | Epoch: 1 | Step: 184740 | Dataset: 0-651341 | Loss: 0.794 | 912 ms/step , 6895.85 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 20:38:53 | Epoch: 1 | Step: 184750 | Dataset: 0-651661 | Loss: 0.755 | 914 ms/step , 6884.55 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 20:39:02 | Epoch: 1 | Step: 184760 | Dataset: 0-651981 | Loss: 0.837 | 914 ms/step , 6881.99 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:39:12 | Epoch: 1 | Step: 184770 | Dataset: 0-652301 | Loss: 0.672 | 913 ms/step , 6885.36 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 20:39:21 | Epoch: 1 | Step: 184780 | Dataset: 0-652621 | Loss: 0.804 | 914 ms/step , 6880.49 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 20:39:30 | Epoch: 1 | Step: 184790 | Dataset: 0-652941 | Loss: 0.844 | 913 ms/step , 6887.50 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:39:39 | Epoch: 1 | Step: 184800 | Dataset: 0-653261 | Loss: 0.793 | 913 ms/step , 6885.88 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 20:39:41 | Validation | Step: 184800 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:39:50 | Epoch: 1 | Step: 184810 | Dataset: 0-653581 | Loss: 0.794 | 913 ms/step , 6886.49 GFLOP/s , 15272.5 tokens/s INFO:__main__:2024-11-05 20:39:59 | Epoch: 1 | Step: 184820 | Dataset: 0-653901 | Loss: 0.699 | 913 ms/step , 6886.19 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 20:40:08 | Epoch: 1 | Step: 184830 | Dataset: 0-654221 | Loss: 0.583 | 914 ms/step , 6883.49 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 20:40:17 | Epoch: 1 | Step: 184840 | Dataset: 0-654541 | Loss: 0.729 | 915 ms/step , 6873.86 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 20:40:26 | Epoch: 1 | Step: 184850 | Dataset: 0-654861 | Loss: 0.771 | 913 ms/step , 6889.91 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 20:40:35 | Epoch: 1 | Step: 184860 | Dataset: 0-655181 | Loss: 0.797 | 914 ms/step , 6883.51 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 20:40:44 | Epoch: 1 | Step: 184870 | Dataset: 0-655501 | Loss: 0.789 | 914 ms/step , 6882.13 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 20:40:54 | Epoch: 1 | Step: 184880 | Dataset: 0-655821 | Loss: 0.777 | 913 ms/step , 6891.00 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 20:41:03 | Epoch: 1 | Step: 184890 | Dataset: 0-656141 | Loss: 0.726 | 914 ms/step , 6879.86 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 20:41:12 | Epoch: 1 | Step: 184900 | Dataset: 0-656461 | Loss: 0.738 | 914 ms/step , 6885.05 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 20:41:13 | Validation | Step: 184900 | Val_loss: 0.799 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:41:23 | Epoch: 1 | Step: 184910 | Dataset: 0-656781 | Loss: 0.759 | 913 ms/step , 6886.64 GFLOP/s , 15269.4 tokens/s INFO:__main__:2024-11-05 20:41:32 | Epoch: 1 | Step: 184920 | Dataset: 0-657101 | Loss: 0.706 | 914 ms/step , 6882.59 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 20:41:41 | Epoch: 1 | Step: 184930 | Dataset: 0-657421 | Loss: 0.722 | 913 ms/step , 6888.13 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 20:41:50 | Epoch: 1 | Step: 184940 | Dataset: 0-657741 | Loss: 0.783 | 912 ms/step , 6897.46 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 20:41:59 | Epoch: 1 | Step: 184950 | Dataset: 0-658061 | Loss: 0.666 | 913 ms/step , 6888.32 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:42:08 | Epoch: 1 | Step: 184960 | Dataset: 0-658381 | Loss: 0.828 | 913 ms/step , 6887.95 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 20:42:17 | Epoch: 1 | Step: 184970 | Dataset: 0-658701 | Loss: 0.732 | 913 ms/step , 6891.79 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 20:42:27 | Epoch: 1 | Step: 184980 | Dataset: 0-659021 | Loss: 0.581 | 913 ms/step , 6887.10 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 20:42:36 | Epoch: 1 | Step: 184990 | Dataset: 0-659341 | Loss: 0.714 | 914 ms/step , 6882.45 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 20:42:45 | Epoch: 1 | Step: 185000 | Dataset: 0-659661 | Loss: 0.756 | 913 ms/step , 6887.22 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 20:42:46 | Validation | Step: 185000 | Val_loss: 0.739 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:42:46 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_204246_step_185000.pt` INFO:__main__:2024-11-05 20:42:57 | Epoch: 1 | Step: 185010 | Dataset: 0-659981 | Loss: 0.649 | 913 ms/step , 6886.21 GFLOP/s , 13826.5 tokens/s INFO:__main__:2024-11-05 20:43:06 | Epoch: 1 | Step: 185020 | Dataset: 0-660301 | Loss: 0.716 | 915 ms/step , 6873.57 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 20:43:15 | Epoch: 1 | Step: 185030 | Dataset: 0-660621 | Loss: 0.700 | 912 ms/step , 6895.49 GFLOP/s , 17945.9 tokens/s INFO:__main__:2024-11-05 20:43:24 | Epoch: 1 | Step: 185040 | Dataset: 0-660941 | Loss: 0.744 | 913 ms/step , 6885.31 GFLOP/s , 17898.0 tokens/s INFO:__main__:2024-11-05 20:43:33 | Epoch: 1 | Step: 185050 | Dataset: 0-661261 | Loss: 0.780 | 914 ms/step , 6882.78 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 20:43:42 | Epoch: 1 | Step: 185060 | Dataset: 0-661581 | Loss: 0.692 | 915 ms/step , 6875.75 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 20:43:52 | Epoch: 1 | Step: 185070 | Dataset: 0-661901 | Loss: 0.720 | 913 ms/step , 6892.40 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:44:01 | Epoch: 1 | Step: 185080 | Dataset: 0-662221 | Loss: 0.810 | 913 ms/step , 6886.59 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 20:44:10 | Epoch: 1 | Step: 185090 | Dataset: 0-662541 | Loss: 0.659 | 913 ms/step , 6891.42 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 20:44:19 | Epoch: 1 | Step: 185100 | Dataset: 0-662861 | Loss: 0.655 | 914 ms/step , 6884.94 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 20:44:21 | Validation | Step: 185100 | Val_loss: 0.775 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:44:30 | Epoch: 1 | Step: 185110 | Dataset: 0-663181 | Loss: 0.743 | 913 ms/step , 6891.37 GFLOP/s , 15281.1 tokens/s INFO:__main__:2024-11-05 20:44:39 | Epoch: 1 | Step: 185120 | Dataset: 0-663501 | Loss: 0.742 | 913 ms/step , 6886.41 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 20:44:48 | Epoch: 1 | Step: 185130 | Dataset: 0-663821 | Loss: 0.900 | 914 ms/step , 6879.32 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 20:44:57 | Epoch: 1 | Step: 185140 | Dataset: 0-664141 | Loss: 0.679 | 914 ms/step , 6879.77 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 20:45:06 | Epoch: 1 | Step: 185150 | Dataset: 0-664461 | Loss: 0.731 | 912 ms/step , 6897.63 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 20:45:15 | Epoch: 1 | Step: 185160 | Dataset: 0-664781 | Loss: 0.716 | 913 ms/step , 6890.05 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 20:45:24 | Epoch: 1 | Step: 185170 | Dataset: 0-665101 | Loss: 0.752 | 912 ms/step , 6896.33 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 20:45:34 | Epoch: 1 | Step: 185180 | Dataset: 0-665421 | Loss: 0.727 | 913 ms/step , 6887.16 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 20:45:43 | Epoch: 1 | Step: 185190 | Dataset: 0-665741 | Loss: 0.553 | 912 ms/step , 6898.01 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 20:45:52 | Epoch: 1 | Step: 185200 | Dataset: 0-666061 | Loss: 0.791 | 913 ms/step , 6890.23 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-05 20:45:53 | Validation | Step: 185200 | Val_loss: 0.774 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:46:03 | Epoch: 1 | Step: 185210 | Dataset: 0-666381 | Loss: 0.672 | 914 ms/step , 6879.68 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-05 20:46:12 | Epoch: 1 | Step: 185220 | Dataset: 0-666701 | Loss: 0.644 | 913 ms/step , 6885.50 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 20:46:21 | Epoch: 1 | Step: 185230 | Dataset: 0-667021 | Loss: 0.752 | 915 ms/step , 6873.55 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 20:46:30 | Epoch: 1 | Step: 185240 | Dataset: 0-667341 | Loss: 0.665 | 915 ms/step , 6873.78 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 20:46:39 | Epoch: 1 | Step: 185250 | Dataset: 0-667661 | Loss: 0.792 | 913 ms/step , 6891.46 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 20:46:48 | Epoch: 1 | Step: 185260 | Dataset: 0-667981 | Loss: 0.751 | 913 ms/step , 6887.83 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 20:46:57 | Epoch: 1 | Step: 185270 | Dataset: 0-668301 | Loss: 0.629 | 912 ms/step , 6895.34 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 20:47:07 | Epoch: 1 | Step: 185280 | Dataset: 0-668621 | Loss: 0.745 | 912 ms/step , 6893.58 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 20:47:16 | Epoch: 1 | Step: 185290 | Dataset: 0-668941 | Loss: 0.855 | 914 ms/step , 6884.75 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:47:25 | Epoch: 1 | Step: 185300 | Dataset: 0-669261 | Loss: 0.690 | 914 ms/step , 6877.84 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 20:47:26 | Validation | Step: 185300 | Val_loss: 0.765 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:47:36 | Epoch: 1 | Step: 185310 | Dataset: 0-669581 | Loss: 0.694 | 912 ms/step , 6897.06 GFLOP/s , 15275.6 tokens/s INFO:__main__:2024-11-05 20:47:45 | Epoch: 1 | Step: 185320 | Dataset: 0-669901 | Loss: 0.741 | 913 ms/step , 6885.77 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 20:47:54 | Epoch: 1 | Step: 185330 | Dataset: 0-670221 | Loss: 0.732 | 914 ms/step , 6883.92 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 20:48:03 | Epoch: 1 | Step: 185340 | Dataset: 0-670541 | Loss: 0.632 | 912 ms/step , 6896.59 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 20:48:12 | Epoch: 1 | Step: 185350 | Dataset: 0-670861 | Loss: 0.716 | 912 ms/step , 6893.02 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 20:48:21 | Epoch: 1 | Step: 185360 | Dataset: 0-671181 | Loss: 0.755 | 913 ms/step , 6886.07 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 20:48:30 | Epoch: 1 | Step: 185370 | Dataset: 0-671501 | Loss: 0.831 | 912 ms/step , 6893.16 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 20:48:40 | Epoch: 1 | Step: 185380 | Dataset: 0-671821 | Loss: 0.784 | 915 ms/step , 6877.35 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 20:48:49 | Epoch: 1 | Step: 185390 | Dataset: 0-672141 | Loss: 0.728 | 912 ms/step , 6898.09 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 20:48:58 | Epoch: 1 | Step: 185400 | Dataset: 0-672461 | Loss: 0.722 | 913 ms/step , 6885.80 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 20:48:59 | Validation | Step: 185400 | Val_loss: 0.762 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:49:09 | Epoch: 1 | Step: 185410 | Dataset: 0-672781 | Loss: 0.799 | 914 ms/step , 6884.56 GFLOP/s , 15276.2 tokens/s INFO:__main__:2024-11-05 20:49:18 | Epoch: 1 | Step: 185420 | Dataset: 0-673101 | Loss: 0.614 | 913 ms/step , 6887.60 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 20:49:27 | Epoch: 1 | Step: 185430 | Dataset: 0-673421 | Loss: 0.804 | 914 ms/step , 6878.48 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 20:49:36 | Epoch: 1 | Step: 185440 | Dataset: 0-673741 | Loss: 0.606 | 913 ms/step , 6888.88 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 20:49:45 | Epoch: 1 | Step: 185450 | Dataset: 0-674061 | Loss: 0.802 | 914 ms/step , 6882.63 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 20:49:54 | Epoch: 1 | Step: 185460 | Dataset: 0-674381 | Loss: 0.626 | 914 ms/step , 6884.71 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 20:50:03 | Epoch: 1 | Step: 185470 | Dataset: 0-674701 | Loss: 0.677 | 914 ms/step , 6881.02 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-05 20:50:12 | Epoch: 1 | Step: 185480 | Dataset: 0-675021 | Loss: 0.612 | 913 ms/step , 6889.11 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 20:50:22 | Epoch: 1 | Step: 185490 | Dataset: 0-675341 | Loss: 0.753 | 912 ms/step , 6896.40 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-05 20:50:31 | Epoch: 1 | Step: 185500 | Dataset: 0-675661 | Loss: 0.719 | 913 ms/step , 6890.78 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 20:50:32 | Validation | Step: 185500 | Val_loss: 0.755 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:50:41 | Epoch: 1 | Step: 185510 | Dataset: 0-675981 | Loss: 0.658 | 912 ms/step , 6899.87 GFLOP/s , 15287.4 tokens/s INFO:__main__:2024-11-05 20:50:51 | Epoch: 1 | Step: 185520 | Dataset: 0-676301 | Loss: 0.783 | 913 ms/step , 6887.60 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 20:51:00 | Epoch: 1 | Step: 185530 | Dataset: 0-676621 | Loss: 0.698 | 913 ms/step , 6892.18 GFLOP/s , 17945.6 tokens/s INFO:__main__:2024-11-05 20:51:09 | Epoch: 1 | Step: 185540 | Dataset: 0-676941 | Loss: 0.660 | 912 ms/step , 6897.76 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 20:51:18 | Epoch: 1 | Step: 185550 | Dataset: 0-677261 | Loss: 0.742 | 913 ms/step , 6890.71 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 20:51:27 | Epoch: 1 | Step: 185560 | Dataset: 0-677581 | Loss: 0.749 | 914 ms/step , 6883.90 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 20:51:36 | Epoch: 1 | Step: 185570 | Dataset: 0-677901 | Loss: 0.781 | 914 ms/step , 6882.34 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 20:51:45 | Epoch: 1 | Step: 185580 | Dataset: 0-678221 | Loss: 0.751 | 913 ms/step , 6889.68 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 20:51:55 | Epoch: 1 | Step: 185590 | Dataset: 0-678541 | Loss: 0.688 | 912 ms/step , 6894.42 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 20:52:04 | Epoch: 1 | Step: 185600 | Dataset: 0-678861 | Loss: 0.714 | 913 ms/step , 6891.37 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 20:52:05 | Validation | Step: 185600 | Val_loss: 0.751 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:52:14 | Epoch: 1 | Step: 185610 | Dataset: 0-679181 | Loss: 0.763 | 915 ms/step , 6876.21 GFLOP/s , 15289.5 tokens/s INFO:__main__:2024-11-05 20:52:23 | Epoch: 1 | Step: 185620 | Dataset: 0-679501 | Loss: 0.652 | 914 ms/step , 6878.96 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 20:52:33 | Epoch: 1 | Step: 185630 | Dataset: 0-679821 | Loss: 0.686 | 913 ms/step , 6887.49 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 20:52:42 | Epoch: 1 | Step: 185640 | Dataset: 0-680141 | Loss: 0.646 | 913 ms/step , 6888.38 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:52:51 | Epoch: 1 | Step: 185650 | Dataset: 0-680461 | Loss: 0.630 | 913 ms/step , 6892.17 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 20:53:00 | Epoch: 1 | Step: 185660 | Dataset: 0-680781 | Loss: 0.749 | 914 ms/step , 6881.82 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 20:53:09 | Epoch: 1 | Step: 185670 | Dataset: 0-681101 | Loss: 0.719 | 915 ms/step , 6877.47 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 20:53:18 | Epoch: 1 | Step: 185680 | Dataset: 0-681421 | Loss: 0.847 | 913 ms/step , 6887.87 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 20:53:27 | Epoch: 1 | Step: 185690 | Dataset: 0-681741 | Loss: 0.663 | 912 ms/step , 6894.60 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 20:53:37 | Epoch: 1 | Step: 185700 | Dataset: 0-682061 | Loss: 0.772 | 914 ms/step , 6880.19 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 20:53:38 | Validation | Step: 185700 | Val_loss: 0.515 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:53:47 | Epoch: 1 | Step: 185710 | Dataset: 0-682381 | Loss: 0.681 | 912 ms/step , 6893.31 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-05 20:53:56 | Epoch: 1 | Step: 185720 | Dataset: 0-682701 | Loss: 0.780 | 913 ms/step , 6887.69 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 20:54:06 | Epoch: 1 | Step: 185730 | Dataset: 0-683021 | Loss: 0.776 | 914 ms/step , 6880.10 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 20:54:15 | Epoch: 1 | Step: 185740 | Dataset: 0-683341 | Loss: 0.608 | 913 ms/step , 6891.73 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 20:54:24 | Epoch: 1 | Step: 185750 | Dataset: 0-683661 | Loss: 0.679 | 914 ms/step , 6884.87 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 20:54:33 | Epoch: 1 | Step: 185760 | Dataset: 0-683981 | Loss: 0.751 | 914 ms/step , 6878.11 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 20:54:42 | Epoch: 1 | Step: 185770 | Dataset: 0-684301 | Loss: 0.735 | 913 ms/step , 6887.72 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 20:54:51 | Epoch: 1 | Step: 185780 | Dataset: 0-684621 | Loss: 0.834 | 913 ms/step , 6885.99 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 20:55:00 | Epoch: 1 | Step: 185790 | Dataset: 0-684941 | Loss: 0.626 | 913 ms/step , 6887.77 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 20:55:10 | Epoch: 1 | Step: 185800 | Dataset: 0-685261 | Loss: 0.746 | 913 ms/step , 6885.83 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 20:55:11 | Validation | Step: 185800 | Val_loss: 0.422 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:55:20 | Epoch: 1 | Step: 185810 | Dataset: 0-685581 | Loss: 0.708 | 913 ms/step , 6888.12 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-05 20:55:29 | Epoch: 1 | Step: 185820 | Dataset: 0-685901 | Loss: 0.789 | 914 ms/step , 6880.97 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-05 20:55:39 | Epoch: 1 | Step: 185830 | Dataset: 0-686221 | Loss: 0.683 | 914 ms/step , 6880.62 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 20:55:48 | Epoch: 1 | Step: 185840 | Dataset: 0-686541 | Loss: 0.717 | 912 ms/step , 6894.00 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 20:55:57 | Epoch: 1 | Step: 185850 | Dataset: 0-686861 | Loss: 0.775 | 914 ms/step , 6878.27 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 20:56:06 | Epoch: 1 | Step: 185860 | Dataset: 0-687181 | Loss: 0.367 | 913 ms/step , 6890.56 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 20:56:15 | Epoch: 1 | Step: 185870 | Dataset: 0-687501 | Loss: 0.764 | 916 ms/step , 6867.59 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-05 20:56:24 | Epoch: 1 | Step: 185880 | Dataset: 0-687821 | Loss: 0.737 | 913 ms/step , 6887.88 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 20:56:33 | Epoch: 1 | Step: 185890 | Dataset: 0-688141 | Loss: 0.744 | 913 ms/step , 6885.30 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 20:56:43 | Epoch: 1 | Step: 185900 | Dataset: 0-688461 | Loss: 0.739 | 914 ms/step , 6881.33 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 20:56:44 | Validation | Step: 185900 | Val_loss: 0.471 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:56:53 | Epoch: 1 | Step: 185910 | Dataset: 0-688781 | Loss: 0.558 | 912 ms/step , 6892.67 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-05 20:57:02 | Epoch: 1 | Step: 185920 | Dataset: 0-689101 | Loss: 0.864 | 913 ms/step , 6887.48 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 20:57:12 | Epoch: 1 | Step: 185930 | Dataset: 0-689421 | Loss: 0.606 | 913 ms/step , 6886.78 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 20:57:21 | Epoch: 1 | Step: 185940 | Dataset: 0-689741 | Loss: 0.682 | 913 ms/step , 6889.52 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 20:57:30 | Epoch: 1 | Step: 185950 | Dataset: 0-690061 | Loss: 0.814 | 913 ms/step , 6890.01 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 20:57:39 | Epoch: 1 | Step: 185960 | Dataset: 0-690381 | Loss: 0.708 | 913 ms/step , 6886.34 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 20:57:48 | Epoch: 1 | Step: 185970 | Dataset: 0-690701 | Loss: 0.831 | 913 ms/step , 6891.53 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 20:57:57 | Epoch: 1 | Step: 185980 | Dataset: 0-691021 | Loss: 0.672 | 913 ms/step , 6889.19 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 20:58:06 | Epoch: 1 | Step: 185990 | Dataset: 0-691341 | Loss: 0.589 | 914 ms/step , 6879.59 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 20:58:16 | Epoch: 1 | Step: 186000 | Dataset: 0-691661 | Loss: 0.691 | 914 ms/step , 6881.42 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 20:58:17 | Validation | Step: 186000 | Val_loss: 0.570 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 20:58:17 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_205817_step_186000.pt` INFO:__main__:2024-11-05 20:58:27 | Epoch: 1 | Step: 186010 | Dataset: 0-691981 | Loss: 0.612 | 913 ms/step , 6892.29 GFLOP/s , 13825.7 tokens/s INFO:__main__:2024-11-05 20:58:37 | Epoch: 1 | Step: 186020 | Dataset: 0-692301 | Loss: 0.764 | 913 ms/step , 6887.66 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 20:58:46 | Epoch: 1 | Step: 186030 | Dataset: 0-692621 | Loss: 0.838 | 915 ms/step , 6872.46 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 20:58:55 | Epoch: 1 | Step: 186040 | Dataset: 0-692941 | Loss: 0.718 | 916 ms/step , 6869.59 GFLOP/s , 17897.6 tokens/s INFO:__main__:2024-11-05 20:59:04 | Epoch: 1 | Step: 186050 | Dataset: 0-693261 | Loss: 0.672 | 914 ms/step , 6878.36 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 20:59:13 | Epoch: 1 | Step: 186060 | Dataset: 0-693581 | Loss: 0.651 | 912 ms/step , 6898.01 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 20:59:22 | Epoch: 1 | Step: 186070 | Dataset: 0-693901 | Loss: 0.582 | 913 ms/step , 6891.90 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 20:59:31 | Epoch: 1 | Step: 186080 | Dataset: 0-694221 | Loss: 0.760 | 914 ms/step , 6883.46 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 20:59:41 | Epoch: 1 | Step: 186090 | Dataset: 0-694541 | Loss: 0.744 | 914 ms/step , 6881.73 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 20:59:50 | Epoch: 1 | Step: 186100 | Dataset: 0-694861 | Loss: 0.682 | 914 ms/step , 6883.34 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 20:59:51 | Validation | Step: 186100 | Val_loss: 0.723 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:00:00 | Epoch: 1 | Step: 186110 | Dataset: 0-695181 | Loss: 0.725 | 913 ms/step , 6889.79 GFLOP/s , 15266.6 tokens/s INFO:__main__:2024-11-05 21:00:10 | Epoch: 1 | Step: 186120 | Dataset: 0-695501 | Loss: 0.757 | 913 ms/step , 6886.80 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 21:00:19 | Epoch: 1 | Step: 186130 | Dataset: 0-695821 | Loss: 0.437 | 913 ms/step , 6890.48 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 21:00:28 | Epoch: 1 | Step: 186140 | Dataset: 0-696141 | Loss: 0.836 | 916 ms/step , 6866.25 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 21:00:37 | Epoch: 1 | Step: 186150 | Dataset: 0-696461 | Loss: 0.808 | 913 ms/step , 6891.54 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 21:00:46 | Epoch: 1 | Step: 186160 | Dataset: 0-696781 | Loss: 0.604 | 913 ms/step , 6891.03 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 21:00:55 | Epoch: 1 | Step: 186170 | Dataset: 0-697101 | Loss: 0.691 | 912 ms/step , 6893.38 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 21:01:04 | Epoch: 1 | Step: 186180 | Dataset: 0-697421 | Loss: 0.899 | 914 ms/step , 6880.02 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 21:01:14 | Epoch: 1 | Step: 186190 | Dataset: 0-697741 | Loss: 0.705 | 913 ms/step , 6890.25 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 21:01:23 | Epoch: 1 | Step: 186200 | Dataset: 0-698061 | Loss: 0.710 | 913 ms/step , 6889.02 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 21:01:24 | Validation | Step: 186200 | Val_loss: 0.706 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:01:33 | Epoch: 1 | Step: 186210 | Dataset: 0-698381 | Loss: 0.818 | 913 ms/step , 6885.35 GFLOP/s , 15276.6 tokens/s INFO:__main__:2024-11-05 21:01:43 | Epoch: 1 | Step: 186220 | Dataset: 0-698701 | Loss: 0.749 | 915 ms/step , 6870.96 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 21:01:52 | Epoch: 1 | Step: 186230 | Dataset: 0-699021 | Loss: 0.502 | 913 ms/step , 6885.46 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 21:02:01 | Epoch: 1 | Step: 186240 | Dataset: 0-699341 | Loss: 0.885 | 914 ms/step , 6880.58 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 21:02:10 | Epoch: 1 | Step: 186250 | Dataset: 0-699661 | Loss: 0.896 | 915 ms/step , 6875.79 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 21:02:19 | Epoch: 1 | Step: 186260 | Dataset: 0-699981 | Loss: 0.692 | 913 ms/step , 6887.99 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 21:02:28 | Epoch: 1 | Step: 186270 | Dataset: 0-700301 | Loss: 0.803 | 914 ms/step , 6884.00 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 21:02:37 | Epoch: 1 | Step: 186280 | Dataset: 0-700621 | Loss: 0.549 | 914 ms/step , 6881.23 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 21:02:47 | Epoch: 1 | Step: 186290 | Dataset: 0-700941 | Loss: 0.752 | 913 ms/step , 6885.16 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 21:02:56 | Epoch: 1 | Step: 186300 | Dataset: 0-701261 | Loss: 0.918 | 914 ms/step , 6880.38 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 21:02:57 | Validation | Step: 186300 | Val_loss: 0.729 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:03:06 | Epoch: 1 | Step: 186310 | Dataset: 0-701581 | Loss: 0.865 | 913 ms/step , 6885.79 GFLOP/s , 15264.6 tokens/s INFO:__main__:2024-11-05 21:03:16 | Epoch: 1 | Step: 186320 | Dataset: 0-701901 | Loss: 0.832 | 915 ms/step , 6876.86 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 21:03:25 | Epoch: 1 | Step: 186330 | Dataset: 0-702221 | Loss: 0.753 | 912 ms/step , 6894.05 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-05 21:03:34 | Epoch: 1 | Step: 186340 | Dataset: 0-702541 | Loss: 0.723 | 913 ms/step , 6890.17 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-05 21:03:43 | Epoch: 1 | Step: 186350 | Dataset: 0-702861 | Loss: 0.774 | 913 ms/step , 6886.52 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 21:03:52 | Epoch: 1 | Step: 186360 | Dataset: 0-703181 | Loss: 0.748 | 913 ms/step , 6890.57 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 21:04:01 | Epoch: 1 | Step: 186370 | Dataset: 0-703501 | Loss: 0.745 | 914 ms/step , 6883.76 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 21:04:10 | Epoch: 1 | Step: 186380 | Dataset: 0-703821 | Loss: 0.749 | 914 ms/step , 6883.80 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 21:04:19 | Epoch: 1 | Step: 186390 | Dataset: 0-704141 | Loss: 0.745 | 913 ms/step , 6891.69 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 21:04:29 | Epoch: 1 | Step: 186400 | Dataset: 0-704461 | Loss: 0.759 | 912 ms/step , 6898.08 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 21:04:30 | Validation | Step: 186400 | Val_loss: 0.714 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:04:39 | Epoch: 1 | Step: 186410 | Dataset: 0-704781 | Loss: 0.758 | 914 ms/step , 6880.69 GFLOP/s , 15274.3 tokens/s INFO:__main__:2024-11-05 21:04:48 | Epoch: 1 | Step: 186420 | Dataset: 0-705101 | Loss: 0.749 | 913 ms/step , 6891.04 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 21:04:58 | Epoch: 1 | Step: 186430 | Dataset: 0-705421 | Loss: 0.710 | 915 ms/step , 6871.00 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 21:05:07 | Epoch: 1 | Step: 186440 | Dataset: 0-705741 | Loss: 0.853 | 914 ms/step , 6882.04 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 21:05:16 | Epoch: 1 | Step: 186450 | Dataset: 0-706061 | Loss: 0.710 | 913 ms/step , 6887.70 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 21:05:25 | Epoch: 1 | Step: 186460 | Dataset: 0-706381 | Loss: 0.596 | 914 ms/step , 6884.67 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 21:05:34 | Epoch: 1 | Step: 186470 | Dataset: 0-706701 | Loss: 0.784 | 914 ms/step , 6879.47 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 21:05:43 | Epoch: 1 | Step: 186480 | Dataset: 0-707021 | Loss: 0.729 | 913 ms/step , 6890.50 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 21:05:52 | Epoch: 1 | Step: 186490 | Dataset: 0-707341 | Loss: 0.821 | 914 ms/step , 6879.72 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 21:06:02 | Epoch: 1 | Step: 186500 | Dataset: 0-707661 | Loss: 0.634 | 913 ms/step , 6887.43 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 21:06:03 | Validation | Step: 186500 | Val_loss: 0.699 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:06:12 | Epoch: 1 | Step: 186510 | Dataset: 0-707981 | Loss: 0.692 | 914 ms/step , 6879.75 GFLOP/s , 15285.3 tokens/s INFO:__main__:2024-11-05 21:06:21 | Epoch: 1 | Step: 186520 | Dataset: 0-708301 | Loss: 0.552 | 912 ms/step , 6897.34 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 21:06:31 | Epoch: 1 | Step: 186530 | Dataset: 0-708621 | Loss: 0.663 | 912 ms/step , 6895.94 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 21:06:40 | Epoch: 1 | Step: 186540 | Dataset: 0-708941 | Loss: 0.817 | 913 ms/step , 6887.36 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 21:06:49 | Epoch: 1 | Step: 186550 | Dataset: 0-709261 | Loss: 0.640 | 913 ms/step , 6888.64 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 21:06:58 | Epoch: 1 | Step: 186560 | Dataset: 0-709581 | Loss: 0.618 | 913 ms/step , 6889.06 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 21:07:07 | Epoch: 1 | Step: 186570 | Dataset: 0-709901 | Loss: 0.737 | 913 ms/step , 6886.02 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 21:07:16 | Epoch: 1 | Step: 186580 | Dataset: 0-710221 | Loss: 0.760 | 913 ms/step , 6885.67 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 21:07:25 | Epoch: 1 | Step: 186590 | Dataset: 0-710541 | Loss: 0.764 | 913 ms/step , 6885.08 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 21:07:35 | Epoch: 1 | Step: 186600 | Dataset: 0-710861 | Loss: 0.781 | 913 ms/step , 6888.85 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 21:07:36 | Validation | Step: 186600 | Val_loss: 0.709 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:07:45 | Epoch: 1 | Step: 186610 | Dataset: 0-711181 | Loss: 0.624 | 912 ms/step , 6893.76 GFLOP/s , 15265.3 tokens/s INFO:__main__:2024-11-05 21:07:54 | Epoch: 1 | Step: 186620 | Dataset: 0-711501 | Loss: 0.855 | 915 ms/step , 6870.18 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 21:08:04 | Epoch: 1 | Step: 186630 | Dataset: 0-711821 | Loss: 0.760 | 915 ms/step , 6873.60 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 21:08:13 | Epoch: 1 | Step: 186640 | Dataset: 0-712141 | Loss: 0.568 | 912 ms/step , 6892.66 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 21:08:22 | Epoch: 1 | Step: 186650 | Dataset: 0-712461 | Loss: 0.533 | 913 ms/step , 6891.26 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 21:08:31 | Epoch: 1 | Step: 186660 | Dataset: 0-712781 | Loss: 0.565 | 913 ms/step , 6888.63 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 21:08:40 | Epoch: 1 | Step: 186670 | Dataset: 0-713101 | Loss: 0.658 | 914 ms/step , 6884.99 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 21:08:49 | Epoch: 1 | Step: 186680 | Dataset: 0-713421 | Loss: 0.502 | 913 ms/step , 6886.07 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 21:08:58 | Epoch: 1 | Step: 186690 | Dataset: 0-713741 | Loss: 0.655 | 913 ms/step , 6887.23 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 21:09:08 | Epoch: 1 | Step: 186700 | Dataset: 0-714061 | Loss: 0.734 | 913 ms/step , 6887.12 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 21:09:09 | Validation | Step: 186700 | Val_loss: 0.718 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:09:18 | Epoch: 1 | Step: 186710 | Dataset: 0-714381 | Loss: 0.784 | 913 ms/step , 6887.12 GFLOP/s , 15274.0 tokens/s INFO:__main__:2024-11-05 21:09:27 | Epoch: 1 | Step: 186720 | Dataset: 0-714701 | Loss: 0.775 | 914 ms/step , 6881.07 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 21:09:37 | Epoch: 1 | Step: 186730 | Dataset: 0-715021 | Loss: 0.670 | 914 ms/step , 6881.97 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 21:09:46 | Epoch: 1 | Step: 186740 | Dataset: 0-715341 | Loss: 0.425 | 911 ms/step , 6900.92 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 21:09:55 | Epoch: 1 | Step: 186750 | Dataset: 0-715661 | Loss: 0.781 | 913 ms/step , 6890.75 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 21:10:04 | Epoch: 1 | Step: 186760 | Dataset: 0-715981 | Loss: 0.703 | 913 ms/step , 6889.37 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 21:10:13 | Epoch: 1 | Step: 186770 | Dataset: 0-716301 | Loss: 0.628 | 912 ms/step , 6894.83 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 21:10:22 | Epoch: 1 | Step: 186780 | Dataset: 0-716621 | Loss: 0.683 | 913 ms/step , 6891.12 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 21:10:31 | Epoch: 1 | Step: 186790 | Dataset: 0-716941 | Loss: 0.738 | 914 ms/step , 6882.61 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 21:10:41 | Epoch: 1 | Step: 186800 | Dataset: 0-717261 | Loss: 0.823 | 913 ms/step , 6889.26 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 21:10:42 | Validation | Step: 186800 | Val_loss: 0.701 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:10:51 | Epoch: 1 | Step: 186810 | Dataset: 0-717581 | Loss: 0.798 | 914 ms/step , 6881.80 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 21:11:00 | Epoch: 1 | Step: 186820 | Dataset: 0-717901 | Loss: 0.845 | 913 ms/step , 6887.19 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 21:11:10 | Epoch: 1 | Step: 186830 | Dataset: 0-718221 | Loss: 0.849 | 914 ms/step , 6880.96 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-05 21:11:19 | Epoch: 1 | Step: 186840 | Dataset: 0-718541 | Loss: 0.731 | 914 ms/step , 6879.98 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 21:11:28 | Epoch: 1 | Step: 186850 | Dataset: 0-718861 | Loss: 0.675 | 913 ms/step , 6886.16 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 21:11:37 | Epoch: 1 | Step: 186860 | Dataset: 0-719181 | Loss: 0.620 | 913 ms/step , 6891.12 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 21:11:46 | Epoch: 1 | Step: 186870 | Dataset: 0-719501 | Loss: 0.783 | 914 ms/step , 6883.90 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 21:11:55 | Epoch: 1 | Step: 186880 | Dataset: 0-719821 | Loss: 0.824 | 913 ms/step , 6888.60 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 21:12:04 | Epoch: 1 | Step: 186890 | Dataset: 0-720141 | Loss: 0.783 | 914 ms/step , 6884.24 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 21:12:13 | Epoch: 1 | Step: 186900 | Dataset: 0-720461 | Loss: 0.698 | 914 ms/step , 6883.61 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 21:12:15 | Validation | Step: 186900 | Val_loss: 0.728 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:12:24 | Epoch: 1 | Step: 186910 | Dataset: 0-720781 | Loss: 0.776 | 912 ms/step , 6892.94 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-05 21:12:33 | Epoch: 1 | Step: 186920 | Dataset: 0-721101 | Loss: 0.812 | 914 ms/step , 6880.10 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 21:12:42 | Epoch: 1 | Step: 186930 | Dataset: 0-721421 | Loss: 0.825 | 913 ms/step , 6888.04 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 21:12:52 | Epoch: 1 | Step: 186940 | Dataset: 0-721741 | Loss: 0.841 | 912 ms/step , 6892.73 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 21:13:01 | Epoch: 1 | Step: 186950 | Dataset: 0-722061 | Loss: 0.769 | 913 ms/step , 6887.26 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 21:13:10 | Epoch: 1 | Step: 186960 | Dataset: 0-722381 | Loss: 0.717 | 912 ms/step , 6893.33 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 21:13:19 | Epoch: 1 | Step: 186970 | Dataset: 0-722701 | Loss: 0.709 | 913 ms/step , 6887.35 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 21:13:28 | Epoch: 1 | Step: 186980 | Dataset: 0-723021 | Loss: 0.716 | 913 ms/step , 6887.37 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 21:13:37 | Epoch: 1 | Step: 186990 | Dataset: 0-723341 | Loss: 0.804 | 913 ms/step , 6886.17 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 21:13:46 | Epoch: 1 | Step: 187000 | Dataset: 0-723661 | Loss: 0.686 | 914 ms/step , 6881.37 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 21:13:48 | Validation | Step: 187000 | Val_loss: 0.756 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:13:48 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_211348_step_187000.pt` INFO:__main__:2024-11-05 21:13:58 | Epoch: 1 | Step: 187010 | Dataset: 0-723981 | Loss: 0.727 | 913 ms/step , 6888.23 GFLOP/s , 13794.5 tokens/s INFO:__main__:2024-11-05 21:14:07 | Epoch: 1 | Step: 187020 | Dataset: 0-724301 | Loss: 0.780 | 914 ms/step , 6878.11 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 21:14:17 | Epoch: 1 | Step: 187030 | Dataset: 0-724621 | Loss: 0.626 | 915 ms/step , 6875.03 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 21:14:26 | Epoch: 1 | Step: 187040 | Dataset: 0-724941 | Loss: 0.852 | 914 ms/step , 6882.20 GFLOP/s , 17891.9 tokens/s INFO:__main__:2024-11-05 21:14:35 | Epoch: 1 | Step: 187050 | Dataset: 0-725261 | Loss: 0.787 | 914 ms/step , 6878.16 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 21:14:44 | Epoch: 1 | Step: 187060 | Dataset: 0-725581 | Loss: 0.857 | 915 ms/step , 6876.79 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 21:14:53 | Epoch: 1 | Step: 187070 | Dataset: 0-725901 | Loss: 0.736 | 914 ms/step , 6881.62 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-05 21:15:02 | Epoch: 1 | Step: 187080 | Dataset: 0-726221 | Loss: 0.711 | 913 ms/step , 6889.49 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 21:15:11 | Epoch: 1 | Step: 187090 | Dataset: 0-726541 | Loss: 0.717 | 913 ms/step , 6888.20 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 21:15:21 | Epoch: 1 | Step: 187100 | Dataset: 0-726861 | Loss: 0.708 | 915 ms/step , 6874.25 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 21:15:22 | Validation | Step: 187100 | Val_loss: 0.688 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:15:31 | Epoch: 1 | Step: 187110 | Dataset: 0-727181 | Loss: 0.643 | 913 ms/step , 6889.47 GFLOP/s , 15272.7 tokens/s INFO:__main__:2024-11-05 21:15:40 | Epoch: 1 | Step: 187120 | Dataset: 0-727501 | Loss: 0.672 | 913 ms/step , 6889.59 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 21:15:50 | Epoch: 1 | Step: 187130 | Dataset: 0-727821 | Loss: 0.650 | 913 ms/step , 6885.20 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 21:15:59 | Epoch: 1 | Step: 187140 | Dataset: 0-728141 | Loss: 0.705 | 913 ms/step , 6887.27 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 21:16:08 | Epoch: 1 | Step: 187150 | Dataset: 0-728461 | Loss: 0.770 | 913 ms/step , 6888.30 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 21:16:17 | Epoch: 1 | Step: 187160 | Dataset: 0-728781 | Loss: 0.809 | 913 ms/step , 6888.56 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 21:16:26 | Epoch: 1 | Step: 187170 | Dataset: 0-729101 | Loss: 0.788 | 913 ms/step , 6885.32 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-05 21:16:35 | Epoch: 1 | Step: 187180 | Dataset: 0-729421 | Loss: 0.729 | 914 ms/step , 6880.81 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-05 21:16:44 | Epoch: 1 | Step: 187190 | Dataset: 0-729741 | Loss: 0.665 | 913 ms/step , 6885.38 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-05 21:16:54 | Epoch: 1 | Step: 187200 | Dataset: 0-730061 | Loss: 0.673 | 914 ms/step , 6879.26 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-05 21:16:55 | Validation | Step: 187200 | Val_loss: 0.735 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:17:04 | Epoch: 1 | Step: 187210 | Dataset: 0-730381 | Loss: 0.649 | 916 ms/step , 6867.62 GFLOP/s , 15259.5 tokens/s INFO:__main__:2024-11-05 21:17:14 | Epoch: 1 | Step: 187220 | Dataset: 0-730701 | Loss: 0.676 | 914 ms/step , 6883.56 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-05 21:17:23 | Epoch: 1 | Step: 187230 | Dataset: 0-731021 | Loss: 0.716 | 914 ms/step , 6879.12 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 21:17:32 | Epoch: 1 | Step: 187240 | Dataset: 0-731341 | Loss: 0.688 | 913 ms/step , 6887.57 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 21:17:41 | Epoch: 1 | Step: 187250 | Dataset: 0-731661 | Loss: 0.735 | 914 ms/step , 6880.43 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-05 21:17:50 | Epoch: 1 | Step: 187260 | Dataset: 0-731981 | Loss: 0.662 | 914 ms/step , 6879.72 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 21:17:59 | Epoch: 1 | Step: 187270 | Dataset: 0-732301 | Loss: 0.691 | 915 ms/step , 6874.37 GFLOP/s , 17910.7 tokens/s INFO:__main__:2024-11-05 21:18:08 | Epoch: 1 | Step: 187280 | Dataset: 0-732621 | Loss: 0.746 | 913 ms/step , 6887.35 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 21:18:18 | Epoch: 1 | Step: 187290 | Dataset: 0-732941 | Loss: 0.794 | 914 ms/step , 6883.06 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-05 21:18:27 | Epoch: 1 | Step: 187300 | Dataset: 0-733261 | Loss: 0.722 | 913 ms/step , 6887.61 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-05 21:18:28 | Validation | Step: 187300 | Val_loss: 0.684 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:18:37 | Epoch: 1 | Step: 187310 | Dataset: 0-733581 | Loss: 0.713 | 913 ms/step , 6886.65 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-05 21:18:47 | Epoch: 1 | Step: 187320 | Dataset: 0-733901 | Loss: 0.711 | 913 ms/step , 6887.93 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 21:18:56 | Epoch: 1 | Step: 187330 | Dataset: 0-734221 | Loss: 0.846 | 915 ms/step , 6870.60 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 21:19:05 | Epoch: 1 | Step: 187340 | Dataset: 0-734541 | Loss: 0.670 | 915 ms/step , 6876.39 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-05 21:19:14 | Epoch: 1 | Step: 187350 | Dataset: 0-734861 | Loss: 0.637 | 916 ms/step , 6869.01 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 21:19:23 | Epoch: 1 | Step: 187360 | Dataset: 0-735181 | Loss: 0.732 | 916 ms/step , 6868.02 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-05 21:19:32 | Epoch: 1 | Step: 187370 | Dataset: 0-735501 | Loss: 0.650 | 913 ms/step , 6885.26 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 21:19:41 | Epoch: 1 | Step: 187380 | Dataset: 0-735821 | Loss: 0.663 | 914 ms/step , 6880.38 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 21:19:51 | Epoch: 1 | Step: 187390 | Dataset: 0-736141 | Loss: 0.652 | 913 ms/step , 6887.71 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 21:20:00 | Epoch: 1 | Step: 187400 | Dataset: 0-736461 | Loss: 0.753 | 914 ms/step , 6882.28 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 21:20:01 | Validation | Step: 187400 | Val_loss: 0.853 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:20:10 | Epoch: 1 | Step: 187410 | Dataset: 0-736781 | Loss: 0.770 | 915 ms/step , 6876.64 GFLOP/s , 15260.8 tokens/s INFO:__main__:2024-11-05 21:20:20 | Epoch: 1 | Step: 187420 | Dataset: 0-737101 | Loss: 0.717 | 913 ms/step , 6886.11 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 21:20:29 | Epoch: 1 | Step: 187430 | Dataset: 0-737421 | Loss: 0.772 | 915 ms/step , 6873.74 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-05 21:20:38 | Epoch: 1 | Step: 187440 | Dataset: 0-737741 | Loss: 0.733 | 914 ms/step , 6880.05 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 21:20:47 | Epoch: 1 | Step: 187450 | Dataset: 0-738061 | Loss: 0.675 | 913 ms/step , 6891.80 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 21:20:56 | Epoch: 1 | Step: 187460 | Dataset: 0-738381 | Loss: 0.672 | 915 ms/step , 6876.01 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 21:21:05 | Epoch: 1 | Step: 187470 | Dataset: 0-738701 | Loss: 0.642 | 914 ms/step , 6884.53 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 21:21:14 | Epoch: 1 | Step: 187480 | Dataset: 0-739021 | Loss: 0.783 | 913 ms/step , 6888.29 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 21:21:24 | Epoch: 1 | Step: 187490 | Dataset: 0-739341 | Loss: 0.586 | 914 ms/step , 6880.15 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 21:21:33 | Epoch: 1 | Step: 187500 | Dataset: 0-739661 | Loss: 0.750 | 914 ms/step , 6880.56 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 21:21:34 | Validation | Step: 187500 | Val_loss: 0.735 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:21:43 | Epoch: 1 | Step: 187510 | Dataset: 0-739981 | Loss: 0.717 | 913 ms/step , 6885.47 GFLOP/s , 15267.1 tokens/s INFO:__main__:2024-11-05 21:21:53 | Epoch: 1 | Step: 187520 | Dataset: 0-740301 | Loss: 0.786 | 914 ms/step , 6881.63 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 21:22:02 | Epoch: 1 | Step: 187530 | Dataset: 0-740621 | Loss: 0.772 | 914 ms/step , 6881.21 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-05 21:22:11 | Epoch: 1 | Step: 187540 | Dataset: 0-740941 | Loss: 0.815 | 914 ms/step , 6882.88 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-05 21:22:20 | Epoch: 1 | Step: 187550 | Dataset: 0-741261 | Loss: 0.745 | 913 ms/step , 6891.91 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 21:22:29 | Epoch: 1 | Step: 187560 | Dataset: 0-741581 | Loss: 0.768 | 914 ms/step , 6883.82 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 21:22:38 | Epoch: 1 | Step: 187570 | Dataset: 0-741901 | Loss: 0.710 | 914 ms/step , 6882.67 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 21:22:47 | Epoch: 1 | Step: 187580 | Dataset: 0-742221 | Loss: 0.622 | 913 ms/step , 6886.28 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 21:22:57 | Epoch: 1 | Step: 187590 | Dataset: 0-742541 | Loss: 0.641 | 913 ms/step , 6890.77 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 21:23:06 | Epoch: 1 | Step: 187600 | Dataset: 0-742861 | Loss: 0.783 | 913 ms/step , 6892.13 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 21:23:07 | Validation | Step: 187600 | Val_loss: 0.804 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:23:16 | Epoch: 1 | Step: 187610 | Dataset: 0-743181 | Loss: 0.771 | 914 ms/step , 6881.50 GFLOP/s , 15269.9 tokens/s INFO:__main__:2024-11-05 21:23:26 | Epoch: 1 | Step: 187620 | Dataset: 0-743501 | Loss: 0.749 | 914 ms/step , 6879.17 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 21:23:35 | Epoch: 1 | Step: 187630 | Dataset: 0-743821 | Loss: 0.756 | 913 ms/step , 6885.38 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 21:23:44 | Epoch: 1 | Step: 187640 | Dataset: 0-744141 | Loss: 0.693 | 913 ms/step , 6889.20 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 21:23:53 | Epoch: 1 | Step: 187650 | Dataset: 0-744461 | Loss: 0.719 | 915 ms/step , 6870.33 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 21:24:02 | Epoch: 1 | Step: 187660 | Dataset: 0-744781 | Loss: 0.667 | 914 ms/step , 6879.07 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 21:24:11 | Epoch: 1 | Step: 187670 | Dataset: 0-745101 | Loss: 0.796 | 913 ms/step , 6890.40 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 21:24:20 | Epoch: 1 | Step: 187680 | Dataset: 0-745421 | Loss: 0.680 | 914 ms/step , 6879.32 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-05 21:24:30 | Epoch: 1 | Step: 187690 | Dataset: 0-745741 | Loss: 0.605 | 913 ms/step , 6885.98 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 21:24:39 | Epoch: 1 | Step: 187700 | Dataset: 0-746061 | Loss: 0.628 | 913 ms/step , 6891.39 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-05 21:24:40 | Validation | Step: 187700 | Val_loss: 0.827 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:24:49 | Epoch: 1 | Step: 187710 | Dataset: 0-746381 | Loss: 0.723 | 914 ms/step , 6881.62 GFLOP/s , 15288.2 tokens/s INFO:__main__:2024-11-05 21:24:59 | Epoch: 1 | Step: 187720 | Dataset: 0-746701 | Loss: 0.742 | 913 ms/step , 6892.52 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 21:25:08 | Epoch: 1 | Step: 187730 | Dataset: 0-747021 | Loss: 0.806 | 913 ms/step , 6886.21 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 21:25:17 | Epoch: 1 | Step: 187740 | Dataset: 0-747341 | Loss: 0.805 | 913 ms/step , 6891.89 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 21:25:26 | Epoch: 1 | Step: 187750 | Dataset: 0-747661 | Loss: 0.831 | 913 ms/step , 6888.64 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 21:25:35 | Epoch: 1 | Step: 187760 | Dataset: 0-747981 | Loss: 0.675 | 914 ms/step , 6881.56 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 21:25:44 | Epoch: 1 | Step: 187770 | Dataset: 0-748301 | Loss: 0.738 | 912 ms/step , 6895.48 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 21:25:53 | Epoch: 1 | Step: 187780 | Dataset: 0-748621 | Loss: 0.689 | 913 ms/step , 6890.32 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 21:26:02 | Epoch: 1 | Step: 187790 | Dataset: 0-748941 | Loss: 0.579 | 915 ms/step , 6876.30 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 21:26:12 | Epoch: 1 | Step: 187800 | Dataset: 0-749261 | Loss: 0.636 | 913 ms/step , 6885.58 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 21:26:13 | Validation | Step: 187800 | Val_loss: 0.840 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:26:22 | Epoch: 1 | Step: 187810 | Dataset: 0-749581 | Loss: 0.657 | 914 ms/step , 6881.23 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 21:26:31 | Epoch: 1 | Step: 187820 | Dataset: 0-749901 | Loss: 0.685 | 914 ms/step , 6882.50 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 21:26:41 | Epoch: 1 | Step: 187830 | Dataset: 0-750221 | Loss: 0.734 | 913 ms/step , 6885.51 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 21:26:50 | Epoch: 1 | Step: 187840 | Dataset: 0-750541 | Loss: 0.724 | 914 ms/step , 6882.81 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 21:26:59 | Epoch: 1 | Step: 187850 | Dataset: 0-750861 | Loss: 0.716 | 915 ms/step , 6874.47 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 21:27:08 | Epoch: 1 | Step: 187860 | Dataset: 0-751181 | Loss: 0.709 | 913 ms/step , 6888.38 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 21:27:17 | Epoch: 1 | Step: 187870 | Dataset: 0-751501 | Loss: 0.731 | 915 ms/step , 6877.08 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 21:27:26 | Epoch: 1 | Step: 187880 | Dataset: 0-751821 | Loss: 0.618 | 913 ms/step , 6888.93 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-05 21:27:35 | Epoch: 1 | Step: 187890 | Dataset: 0-752141 | Loss: 0.758 | 914 ms/step , 6880.30 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 21:27:45 | Epoch: 1 | Step: 187900 | Dataset: 0-752461 | Loss: 0.684 | 914 ms/step , 6879.80 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-05 21:27:46 | Validation | Step: 187900 | Val_loss: 0.692 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:27:55 | Epoch: 1 | Step: 187910 | Dataset: 0-752781 | Loss: 0.772 | 913 ms/step , 6885.48 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-05 21:28:04 | Epoch: 1 | Step: 187920 | Dataset: 0-753101 | Loss: 0.790 | 913 ms/step , 6888.84 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 21:28:14 | Epoch: 1 | Step: 187930 | Dataset: 0-753421 | Loss: 0.712 | 913 ms/step , 6887.81 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 21:28:23 | Epoch: 1 | Step: 187940 | Dataset: 0-753741 | Loss: 0.633 | 915 ms/step , 6875.24 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 21:28:32 | Epoch: 1 | Step: 187950 | Dataset: 0-754061 | Loss: 0.710 | 913 ms/step , 6889.88 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 21:28:41 | Epoch: 1 | Step: 187960 | Dataset: 0-754381 | Loss: 0.789 | 915 ms/step , 6872.76 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 21:28:50 | Epoch: 1 | Step: 187970 | Dataset: 0-754701 | Loss: 0.810 | 914 ms/step , 6879.77 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 21:28:59 | Epoch: 1 | Step: 187980 | Dataset: 0-755021 | Loss: 0.662 | 913 ms/step , 6885.91 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 21:29:08 | Epoch: 1 | Step: 187990 | Dataset: 0-755341 | Loss: 0.722 | 913 ms/step , 6885.21 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-05 21:29:18 | Epoch: 1 | Step: 188000 | Dataset: 0-755661 | Loss: 0.623 | 913 ms/step , 6891.34 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-05 21:29:19 | Validation | Step: 188000 | Val_loss: 0.712 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:29:19 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_212919_step_188000.pt` INFO:__main__:2024-11-05 21:29:29 | Epoch: 1 | Step: 188010 | Dataset: 0-755981 | Loss: 0.651 | 914 ms/step , 6881.44 GFLOP/s , 13805.5 tokens/s INFO:__main__:2024-11-05 21:29:39 | Epoch: 1 | Step: 188020 | Dataset: 0-756301 | Loss: 0.703 | 912 ms/step , 6892.77 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 21:29:48 | Epoch: 1 | Step: 188030 | Dataset: 0-756621 | Loss: 0.732 | 913 ms/step , 6886.40 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 21:29:57 | Epoch: 1 | Step: 188040 | Dataset: 0-756941 | Loss: 0.611 | 913 ms/step , 6890.07 GFLOP/s , 17905.2 tokens/s INFO:__main__:2024-11-05 21:30:06 | Epoch: 1 | Step: 188050 | Dataset: 0-757261 | Loss: 0.479 | 913 ms/step , 6891.58 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 21:30:15 | Epoch: 1 | Step: 188060 | Dataset: 0-757581 | Loss: 0.725 | 914 ms/step , 6879.76 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 21:30:24 | Epoch: 1 | Step: 188070 | Dataset: 0-757901 | Loss: 0.764 | 913 ms/step , 6887.99 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 21:30:33 | Epoch: 1 | Step: 188080 | Dataset: 0-758221 | Loss: 0.766 | 915 ms/step , 6875.45 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 21:30:43 | Epoch: 1 | Step: 188090 | Dataset: 0-758541 | Loss: 0.659 | 914 ms/step , 6881.88 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 21:30:52 | Epoch: 1 | Step: 188100 | Dataset: 0-758861 | Loss: 0.714 | 913 ms/step , 6889.52 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 21:30:53 | Validation | Step: 188100 | Val_loss: 0.837 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:31:02 | Epoch: 1 | Step: 188110 | Dataset: 0-759181 | Loss: 0.671 | 913 ms/step , 6891.95 GFLOP/s , 15272.9 tokens/s INFO:__main__:2024-11-05 21:31:12 | Epoch: 1 | Step: 188120 | Dataset: 0-759501 | Loss: 0.642 | 913 ms/step , 6887.54 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 21:31:21 | Epoch: 1 | Step: 188130 | Dataset: 0-759821 | Loss: 0.674 | 912 ms/step , 6892.65 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 21:31:30 | Epoch: 1 | Step: 188140 | Dataset: 0-760141 | Loss: 0.709 | 913 ms/step , 6889.58 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 21:31:39 | Epoch: 1 | Step: 188150 | Dataset: 0-760461 | Loss: 0.681 | 913 ms/step , 6886.24 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 21:31:48 | Epoch: 1 | Step: 188160 | Dataset: 0-760781 | Loss: 0.420 | 912 ms/step , 6893.53 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-05 21:31:57 | Epoch: 1 | Step: 188170 | Dataset: 0-761101 | Loss: 0.642 | 914 ms/step , 6881.22 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 21:32:06 | Epoch: 1 | Step: 188180 | Dataset: 0-761421 | Loss: 0.751 | 914 ms/step , 6880.63 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 21:32:16 | Epoch: 1 | Step: 188190 | Dataset: 0-761741 | Loss: 0.658 | 914 ms/step , 6879.93 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 21:32:25 | Epoch: 1 | Step: 188200 | Dataset: 0-762061 | Loss: 0.646 | 913 ms/step , 6887.46 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 21:32:26 | Validation | Step: 188200 | Val_loss: 0.811 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:32:35 | Epoch: 1 | Step: 188210 | Dataset: 0-762381 | Loss: 0.682 | 913 ms/step , 6885.24 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-05 21:32:45 | Epoch: 1 | Step: 188220 | Dataset: 0-762701 | Loss: 0.749 | 913 ms/step , 6891.23 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 21:32:54 | Epoch: 1 | Step: 188230 | Dataset: 0-763021 | Loss: 0.620 | 914 ms/step , 6883.23 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 21:33:03 | Epoch: 1 | Step: 188240 | Dataset: 0-763341 | Loss: 0.752 | 913 ms/step , 6890.00 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 21:33:12 | Epoch: 1 | Step: 188250 | Dataset: 0-763661 | Loss: 0.710 | 914 ms/step , 6879.01 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-05 21:33:21 | Epoch: 1 | Step: 188260 | Dataset: 0-763981 | Loss: 0.819 | 914 ms/step , 6879.60 GFLOP/s , 17912.6 tokens/s INFO:__main__:2024-11-05 21:33:30 | Epoch: 1 | Step: 188270 | Dataset: 0-764301 | Loss: 0.660 | 913 ms/step , 6886.00 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-05 21:33:39 | Epoch: 1 | Step: 188280 | Dataset: 0-764621 | Loss: 0.674 | 914 ms/step , 6882.67 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 21:33:49 | Epoch: 1 | Step: 188290 | Dataset: 0-764941 | Loss: 0.674 | 913 ms/step , 6889.51 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 21:33:58 | Epoch: 1 | Step: 188300 | Dataset: 0-765261 | Loss: 0.645 | 913 ms/step , 6887.44 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 21:33:59 | Validation | Step: 188300 | Val_loss: 0.846 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:34:08 | Epoch: 1 | Step: 188310 | Dataset: 0-765581 | Loss: 0.693 | 913 ms/step , 6890.81 GFLOP/s , 15268.1 tokens/s INFO:__main__:2024-11-05 21:34:18 | Epoch: 1 | Step: 188320 | Dataset: 0-765901 | Loss: 0.706 | 915 ms/step , 6875.99 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 21:34:27 | Epoch: 1 | Step: 188330 | Dataset: 0-766221 | Loss: 0.670 | 913 ms/step , 6889.64 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 21:34:36 | Epoch: 1 | Step: 188340 | Dataset: 0-766541 | Loss: 0.706 | 913 ms/step , 6887.22 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 21:34:45 | Epoch: 1 | Step: 188350 | Dataset: 0-766861 | Loss: 0.774 | 914 ms/step , 6881.81 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-05 21:34:54 | Epoch: 1 | Step: 188360 | Dataset: 0-767181 | Loss: 0.725 | 914 ms/step , 6877.70 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 21:35:03 | Epoch: 1 | Step: 188370 | Dataset: 0-767501 | Loss: 0.704 | 913 ms/step , 6891.60 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 21:35:12 | Epoch: 1 | Step: 188380 | Dataset: 0-767821 | Loss: 0.697 | 913 ms/step , 6885.41 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-05 21:35:22 | Epoch: 1 | Step: 188390 | Dataset: 0-768141 | Loss: 0.782 | 915 ms/step , 6875.81 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-05 21:35:31 | Epoch: 1 | Step: 188400 | Dataset: 0-768461 | Loss: 0.650 | 915 ms/step , 6876.54 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 21:35:32 | Validation | Step: 188400 | Val_loss: 0.842 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:35:41 | Epoch: 1 | Step: 188410 | Dataset: 0-768781 | Loss: 0.696 | 916 ms/step , 6868.70 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-05 21:35:51 | Epoch: 1 | Step: 188420 | Dataset: 0-769101 | Loss: 0.634 | 915 ms/step , 6874.94 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 21:36:00 | Epoch: 1 | Step: 188430 | Dataset: 0-769421 | Loss: 0.713 | 913 ms/step , 6889.10 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 21:36:09 | Epoch: 1 | Step: 188440 | Dataset: 0-769741 | Loss: 0.633 | 913 ms/step , 6886.39 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 21:36:18 | Epoch: 1 | Step: 188450 | Dataset: 0-770061 | Loss: 0.578 | 912 ms/step , 6898.97 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 21:36:27 | Epoch: 1 | Step: 188460 | Dataset: 0-770381 | Loss: 0.721 | 913 ms/step , 6886.80 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 21:36:36 | Epoch: 1 | Step: 188470 | Dataset: 0-770701 | Loss: 0.801 | 915 ms/step , 6875.02 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 21:36:45 | Epoch: 1 | Step: 188480 | Dataset: 0-771021 | Loss: 0.743 | 913 ms/step , 6886.21 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 21:36:55 | Epoch: 1 | Step: 188490 | Dataset: 0-771341 | Loss: 0.801 | 914 ms/step , 6883.96 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 21:37:04 | Epoch: 1 | Step: 188500 | Dataset: 0-771661 | Loss: 0.849 | 913 ms/step , 6891.36 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 21:37:05 | Validation | Step: 188500 | Val_loss: 0.785 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:37:14 | Epoch: 1 | Step: 188510 | Dataset: 0-771981 | Loss: 0.713 | 912 ms/step , 6894.35 GFLOP/s , 15281.3 tokens/s INFO:__main__:2024-11-05 21:37:24 | Epoch: 1 | Step: 188520 | Dataset: 0-772301 | Loss: 0.668 | 912 ms/step , 6893.91 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 21:37:33 | Epoch: 1 | Step: 188530 | Dataset: 0-772621 | Loss: 0.826 | 913 ms/step , 6885.72 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 21:37:42 | Epoch: 1 | Step: 188540 | Dataset: 0-772941 | Loss: 0.795 | 914 ms/step , 6880.05 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 21:37:51 | Epoch: 1 | Step: 188550 | Dataset: 0-773261 | Loss: 0.793 | 913 ms/step , 6888.64 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 21:38:00 | Epoch: 1 | Step: 188560 | Dataset: 0-773581 | Loss: 0.608 | 911 ms/step , 6905.28 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 21:38:09 | Epoch: 1 | Step: 188570 | Dataset: 0-773901 | Loss: 0.726 | 913 ms/step , 6887.43 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 21:38:18 | Epoch: 1 | Step: 188580 | Dataset: 0-774221 | Loss: 0.732 | 913 ms/step , 6888.65 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 21:38:28 | Epoch: 1 | Step: 188590 | Dataset: 0-774541 | Loss: 0.675 | 913 ms/step , 6891.70 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 21:38:37 | Epoch: 1 | Step: 188600 | Dataset: 0-774861 | Loss: 0.852 | 915 ms/step , 6875.40 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 21:38:38 | Validation | Step: 188600 | Val_loss: 0.853 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:38:47 | Epoch: 1 | Step: 188610 | Dataset: 0-775181 | Loss: 0.728 | 914 ms/step , 6884.58 GFLOP/s , 15271.2 tokens/s INFO:__main__:2024-11-05 21:38:57 | Epoch: 1 | Step: 188620 | Dataset: 0-775501 | Loss: 0.755 | 913 ms/step , 6886.94 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 21:39:06 | Epoch: 1 | Step: 188630 | Dataset: 0-775821 | Loss: 0.807 | 913 ms/step , 6888.22 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 21:39:15 | Epoch: 1 | Step: 188640 | Dataset: 0-776141 | Loss: 0.730 | 914 ms/step , 6883.98 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 21:39:24 | Epoch: 1 | Step: 188650 | Dataset: 0-776461 | Loss: 0.777 | 912 ms/step , 6896.61 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 21:39:33 | Epoch: 1 | Step: 188660 | Dataset: 0-776781 | Loss: 0.755 | 913 ms/step , 6886.27 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 21:39:42 | Epoch: 1 | Step: 188670 | Dataset: 0-777101 | Loss: 0.607 | 911 ms/step , 6902.67 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 21:39:51 | Epoch: 1 | Step: 188680 | Dataset: 0-777421 | Loss: 0.704 | 913 ms/step , 6892.14 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 21:40:00 | Epoch: 1 | Step: 188690 | Dataset: 0-777741 | Loss: 0.629 | 912 ms/step , 6893.59 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 21:40:10 | Epoch: 1 | Step: 188700 | Dataset: 0-778061 | Loss: 0.686 | 912 ms/step , 6894.05 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 21:40:11 | Validation | Step: 188700 | Val_loss: 0.838 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:40:20 | Epoch: 1 | Step: 188710 | Dataset: 0-778381 | Loss: 0.730 | 913 ms/step , 6886.27 GFLOP/s , 15276.9 tokens/s INFO:__main__:2024-11-05 21:40:29 | Epoch: 1 | Step: 188720 | Dataset: 0-778701 | Loss: 0.670 | 913 ms/step , 6885.23 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 21:40:39 | Epoch: 1 | Step: 188730 | Dataset: 0-779021 | Loss: 0.552 | 913 ms/step , 6892.37 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 21:40:48 | Epoch: 1 | Step: 188740 | Dataset: 0-779341 | Loss: 0.705 | 914 ms/step , 6885.03 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 21:40:57 | Epoch: 1 | Step: 188750 | Dataset: 0-779661 | Loss: 0.606 | 913 ms/step , 6887.58 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 21:41:06 | Epoch: 1 | Step: 188760 | Dataset: 0-779981 | Loss: 0.755 | 912 ms/step , 6896.95 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 21:41:15 | Epoch: 1 | Step: 188770 | Dataset: 0-780301 | Loss: 0.580 | 914 ms/step , 6883.52 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 21:41:24 | Epoch: 1 | Step: 188780 | Dataset: 0-780621 | Loss: 0.675 | 912 ms/step , 6893.44 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 21:41:33 | Epoch: 1 | Step: 188790 | Dataset: 0-780941 | Loss: 0.757 | 912 ms/step , 6895.29 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-05 21:41:43 | Epoch: 1 | Step: 188800 | Dataset: 0-781261 | Loss: 0.710 | 912 ms/step , 6900.02 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 21:41:44 | Validation | Step: 188800 | Val_loss: 0.813 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:41:53 | Epoch: 1 | Step: 188810 | Dataset: 0-781581 | Loss: 0.683 | 913 ms/step , 6889.84 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-05 21:42:02 | Epoch: 1 | Step: 188820 | Dataset: 0-781901 | Loss: 0.781 | 913 ms/step , 6887.40 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 21:42:12 | Epoch: 1 | Step: 188830 | Dataset: 0-782221 | Loss: 0.671 | 913 ms/step , 6888.51 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 21:42:21 | Epoch: 1 | Step: 188840 | Dataset: 0-782541 | Loss: 0.735 | 913 ms/step , 6889.33 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 21:42:30 | Epoch: 1 | Step: 188850 | Dataset: 0-782861 | Loss: 0.836 | 914 ms/step , 6877.80 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 21:42:39 | Epoch: 1 | Step: 188860 | Dataset: 0-783181 | Loss: 0.742 | 914 ms/step , 6883.80 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 21:42:48 | Epoch: 1 | Step: 188870 | Dataset: 0-783501 | Loss: 0.618 | 913 ms/step , 6889.95 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 21:42:57 | Epoch: 1 | Step: 188880 | Dataset: 0-783821 | Loss: 0.693 | 915 ms/step , 6873.93 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 21:43:06 | Epoch: 1 | Step: 188890 | Dataset: 0-784141 | Loss: 0.619 | 913 ms/step , 6890.17 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 21:43:15 | Epoch: 1 | Step: 188900 | Dataset: 0-784461 | Loss: 0.661 | 912 ms/step , 6893.54 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 21:43:17 | Validation | Step: 188900 | Val_loss: 0.826 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:43:26 | Epoch: 1 | Step: 188910 | Dataset: 0-784781 | Loss: 0.790 | 913 ms/step , 6888.46 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 21:43:35 | Epoch: 1 | Step: 188920 | Dataset: 0-785101 | Loss: 0.807 | 913 ms/step , 6891.69 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 21:43:44 | Epoch: 1 | Step: 188930 | Dataset: 0-785421 | Loss: 0.762 | 912 ms/step , 6897.53 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 21:43:54 | Epoch: 1 | Step: 188940 | Dataset: 0-785741 | Loss: 0.775 | 913 ms/step , 6891.94 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 21:44:03 | Epoch: 1 | Step: 188950 | Dataset: 0-786061 | Loss: 0.825 | 913 ms/step , 6890.89 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 21:44:12 | Epoch: 1 | Step: 188960 | Dataset: 0-786381 | Loss: 0.752 | 911 ms/step , 6900.32 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 21:44:21 | Epoch: 1 | Step: 188970 | Dataset: 0-786701 | Loss: 0.746 | 914 ms/step , 6884.94 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 21:44:30 | Epoch: 1 | Step: 188980 | Dataset: 0-787021 | Loss: 0.713 | 914 ms/step , 6880.05 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-05 21:44:39 | Epoch: 1 | Step: 188990 | Dataset: 0-787341 | Loss: 0.739 | 912 ms/step , 6894.91 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 21:44:48 | Epoch: 1 | Step: 189000 | Dataset: 0-787661 | Loss: 0.655 | 913 ms/step , 6891.62 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 21:44:50 | Validation | Step: 189000 | Val_loss: 0.744 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:44:50 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_214450_step_189000.pt` INFO:__main__:2024-11-05 21:45:00 | Epoch: 1 | Step: 189010 | Dataset: 0-787981 | Loss: 0.725 | 914 ms/step , 6880.98 GFLOP/s , 13790.8 tokens/s INFO:__main__:2024-11-05 21:45:09 | Epoch: 1 | Step: 189020 | Dataset: 0-788301 | Loss: 0.753 | 912 ms/step , 6894.26 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-05 21:45:19 | Epoch: 1 | Step: 189030 | Dataset: 0-788621 | Loss: 0.705 | 914 ms/step , 6883.94 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 21:45:28 | Epoch: 1 | Step: 189040 | Dataset: 0-788941 | Loss: 0.769 | 913 ms/step , 6885.28 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 21:45:37 | Epoch: 1 | Step: 189050 | Dataset: 0-789261 | Loss: 0.700 | 913 ms/step , 6889.30 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 21:45:46 | Epoch: 1 | Step: 189060 | Dataset: 0-789581 | Loss: 0.783 | 913 ms/step , 6888.05 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 21:45:55 | Epoch: 1 | Step: 189070 | Dataset: 0-789901 | Loss: 0.720 | 913 ms/step , 6889.55 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 21:46:04 | Epoch: 1 | Step: 189080 | Dataset: 0-790221 | Loss: 0.687 | 913 ms/step , 6887.06 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 21:46:13 | Epoch: 1 | Step: 189090 | Dataset: 0-790541 | Loss: 0.819 | 914 ms/step , 6884.12 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 21:46:23 | Epoch: 1 | Step: 189100 | Dataset: 0-790861 | Loss: 0.717 | 914 ms/step , 6882.94 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 21:46:24 | Validation | Step: 189100 | Val_loss: 0.820 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:46:33 | Epoch: 1 | Step: 189110 | Dataset: 0-791181 | Loss: 0.750 | 913 ms/step , 6887.34 GFLOP/s , 15286.0 tokens/s INFO:__main__:2024-11-05 21:46:42 | Epoch: 1 | Step: 189120 | Dataset: 0-791501 | Loss: 0.637 | 912 ms/step , 6899.68 GFLOP/s , 17947.7 tokens/s INFO:__main__:2024-11-05 21:46:51 | Epoch: 1 | Step: 189130 | Dataset: 0-791821 | Loss: 0.780 | 912 ms/step , 6892.92 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 21:47:01 | Epoch: 1 | Step: 189140 | Dataset: 0-792141 | Loss: 0.714 | 912 ms/step , 6894.29 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 21:47:10 | Epoch: 1 | Step: 189150 | Dataset: 0-792461 | Loss: 0.607 | 914 ms/step , 6881.72 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 21:47:19 | Epoch: 1 | Step: 189160 | Dataset: 0-792781 | Loss: 0.721 | 914 ms/step , 6881.92 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 21:47:28 | Epoch: 1 | Step: 189170 | Dataset: 0-793101 | Loss: 0.765 | 912 ms/step , 6894.98 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 21:47:37 | Epoch: 1 | Step: 189180 | Dataset: 0-793421 | Loss: 0.750 | 913 ms/step , 6886.16 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 21:47:46 | Epoch: 1 | Step: 189190 | Dataset: 0-793741 | Loss: 0.775 | 912 ms/step , 6892.62 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 21:47:55 | Epoch: 1 | Step: 189200 | Dataset: 0-794061 | Loss: 0.847 | 913 ms/step , 6888.14 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 21:47:57 | Validation | Step: 189200 | Val_loss: 0.838 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:48:06 | Epoch: 1 | Step: 189210 | Dataset: 0-794381 | Loss: 0.737 | 913 ms/step , 6888.30 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 21:48:15 | Epoch: 1 | Step: 189220 | Dataset: 0-794701 | Loss: 0.693 | 913 ms/step , 6889.55 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 21:48:24 | Epoch: 1 | Step: 189230 | Dataset: 0-795021 | Loss: 0.840 | 914 ms/step , 6877.91 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 21:48:34 | Epoch: 1 | Step: 189240 | Dataset: 0-795341 | Loss: 0.737 | 912 ms/step , 6893.32 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-05 21:48:43 | Epoch: 1 | Step: 189250 | Dataset: 0-795661 | Loss: 0.562 | 912 ms/step , 6896.37 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 21:48:52 | Epoch: 1 | Step: 189260 | Dataset: 0-795981 | Loss: 0.712 | 912 ms/step , 6897.86 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 21:49:01 | Epoch: 1 | Step: 189270 | Dataset: 0-796301 | Loss: 0.649 | 913 ms/step , 6891.35 GFLOP/s , 17946.5 tokens/s INFO:__main__:2024-11-05 21:49:10 | Epoch: 1 | Step: 189280 | Dataset: 0-796621 | Loss: 0.676 | 913 ms/step , 6887.48 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 21:49:19 | Epoch: 1 | Step: 189290 | Dataset: 0-796941 | Loss: 0.692 | 914 ms/step , 6881.20 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 21:49:28 | Epoch: 1 | Step: 189300 | Dataset: 0-797261 | Loss: 0.767 | 914 ms/step , 6884.48 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 21:49:30 | Validation | Step: 189300 | Val_loss: 0.853 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:49:39 | Epoch: 1 | Step: 189310 | Dataset: 0-797581 | Loss: 0.734 | 914 ms/step , 6879.46 GFLOP/s , 15269.6 tokens/s INFO:__main__:2024-11-05 21:49:48 | Epoch: 1 | Step: 189320 | Dataset: 0-797901 | Loss: 0.703 | 912 ms/step , 6896.76 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 21:49:57 | Epoch: 1 | Step: 189330 | Dataset: 0-798221 | Loss: 0.656 | 914 ms/step , 6883.88 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 21:50:06 | Epoch: 1 | Step: 189340 | Dataset: 0-798541 | Loss: 0.775 | 912 ms/step , 6899.13 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 21:50:16 | Epoch: 1 | Step: 189350 | Dataset: 0-798861 | Loss: 0.741 | 914 ms/step , 6882.91 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 21:50:25 | Epoch: 1 | Step: 189360 | Dataset: 0-799181 | Loss: 0.729 | 912 ms/step , 6897.15 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 21:50:34 | Epoch: 1 | Step: 189370 | Dataset: 0-799501 | Loss: 0.651 | 914 ms/step , 6880.27 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 21:50:43 | Epoch: 1 | Step: 189380 | Dataset: 0-799821 | Loss: 0.693 | 913 ms/step , 6886.00 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 21:50:52 | Epoch: 1 | Step: 189390 | Dataset: 0-800141 | Loss: 0.738 | 912 ms/step , 6898.25 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 21:51:01 | Epoch: 1 | Step: 189400 | Dataset: 0-800461 | Loss: 0.725 | 913 ms/step , 6886.89 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-05 21:51:03 | Validation | Step: 189400 | Val_loss: 0.821 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:51:12 | Epoch: 1 | Step: 189410 | Dataset: 0-800781 | Loss: 0.680 | 913 ms/step , 6889.14 GFLOP/s , 15266.7 tokens/s INFO:__main__:2024-11-05 21:51:21 | Epoch: 1 | Step: 189420 | Dataset: 0-801101 | Loss: 0.691 | 916 ms/step , 6865.65 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-05 21:51:30 | Epoch: 1 | Step: 189430 | Dataset: 0-801421 | Loss: 0.663 | 915 ms/step , 6873.36 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 21:51:39 | Epoch: 1 | Step: 189440 | Dataset: 0-801741 | Loss: 0.682 | 914 ms/step , 6879.95 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 21:51:49 | Epoch: 1 | Step: 189450 | Dataset: 0-802061 | Loss: 0.737 | 917 ms/step , 6862.44 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 21:51:58 | Epoch: 1 | Step: 189460 | Dataset: 0-802381 | Loss: 0.809 | 915 ms/step , 6876.45 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-05 21:52:07 | Epoch: 1 | Step: 189470 | Dataset: 0-802701 | Loss: 0.730 | 922 ms/step , 6824.89 GFLOP/s , 17695.2 tokens/s INFO:__main__:2024-11-05 21:52:16 | Epoch: 1 | Step: 189480 | Dataset: 0-803021 | Loss: 0.740 | 922 ms/step , 6819.45 GFLOP/s , 17749.8 tokens/s INFO:__main__:2024-11-05 21:52:25 | Epoch: 1 | Step: 189490 | Dataset: 0-803341 | Loss: 0.750 | 914 ms/step , 6879.92 GFLOP/s , 17824.1 tokens/s INFO:__main__:2024-11-05 21:52:35 | Epoch: 1 | Step: 189500 | Dataset: 0-803661 | Loss: 0.624 | 914 ms/step , 6878.32 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 21:52:36 | Validation | Step: 189500 | Val_loss: 0.838 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:52:45 | Epoch: 1 | Step: 189510 | Dataset: 0-803981 | Loss: 0.594 | 913 ms/step , 6891.26 GFLOP/s , 15271.6 tokens/s INFO:__main__:2024-11-05 21:52:54 | Epoch: 1 | Step: 189520 | Dataset: 0-804301 | Loss: 0.905 | 914 ms/step , 6881.92 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 21:53:04 | Epoch: 1 | Step: 189530 | Dataset: 0-804621 | Loss: 0.673 | 914 ms/step , 6879.84 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 21:53:13 | Epoch: 1 | Step: 189540 | Dataset: 0-804941 | Loss: 0.797 | 913 ms/step , 6892.08 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-05 21:53:22 | Epoch: 1 | Step: 189550 | Dataset: 0-805261 | Loss: 0.726 | 915 ms/step , 6875.20 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 21:53:31 | Epoch: 1 | Step: 189560 | Dataset: 0-805581 | Loss: 0.737 | 912 ms/step , 6893.07 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 21:53:40 | Epoch: 1 | Step: 189570 | Dataset: 0-805901 | Loss: 0.699 | 913 ms/step , 6886.78 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 21:53:49 | Epoch: 1 | Step: 189580 | Dataset: 0-806221 | Loss: 0.715 | 913 ms/step , 6887.95 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 21:53:58 | Epoch: 1 | Step: 189590 | Dataset: 0-806541 | Loss: 0.683 | 914 ms/step , 6882.22 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 21:54:08 | Epoch: 1 | Step: 189600 | Dataset: 0-806861 | Loss: 0.803 | 914 ms/step , 6881.61 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 21:54:09 | Validation | Step: 189600 | Val_loss: 0.872 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:54:18 | Epoch: 1 | Step: 189610 | Dataset: 0-807181 | Loss: 0.655 | 912 ms/step , 6892.81 GFLOP/s , 15290.5 tokens/s INFO:__main__:2024-11-05 21:54:27 | Epoch: 1 | Step: 189620 | Dataset: 0-807501 | Loss: 0.780 | 913 ms/step , 6887.55 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 21:54:36 | Epoch: 1 | Step: 189630 | Dataset: 0-807821 | Loss: 0.720 | 912 ms/step , 6893.72 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 21:54:46 | Epoch: 1 | Step: 189640 | Dataset: 0-808141 | Loss: 0.672 | 913 ms/step , 6885.47 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 21:54:55 | Epoch: 1 | Step: 189650 | Dataset: 0-808461 | Loss: 0.654 | 912 ms/step , 6892.66 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 21:55:04 | Epoch: 1 | Step: 189660 | Dataset: 0-808781 | Loss: 0.694 | 913 ms/step , 6889.88 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 21:55:13 | Epoch: 1 | Step: 189670 | Dataset: 0-809101 | Loss: 0.602 | 912 ms/step , 6896.66 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 21:55:22 | Epoch: 1 | Step: 189680 | Dataset: 0-809421 | Loss: 0.739 | 914 ms/step , 6880.58 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 21:55:31 | Epoch: 1 | Step: 189690 | Dataset: 0-809741 | Loss: 0.683 | 913 ms/step , 6890.90 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 21:55:40 | Epoch: 1 | Step: 189700 | Dataset: 0-810061 | Loss: 0.627 | 913 ms/step , 6892.32 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 21:55:42 | Validation | Step: 189700 | Val_loss: 0.866 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:55:51 | Epoch: 1 | Step: 189710 | Dataset: 0-810381 | Loss: 0.776 | 914 ms/step , 6883.81 GFLOP/s , 15269.4 tokens/s INFO:__main__:2024-11-05 21:56:00 | Epoch: 1 | Step: 189720 | Dataset: 0-810701 | Loss: 0.596 | 912 ms/step , 6892.67 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 21:56:09 | Epoch: 1 | Step: 189730 | Dataset: 0-811021 | Loss: 0.757 | 913 ms/step , 6889.10 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 21:56:19 | Epoch: 1 | Step: 189740 | Dataset: 0-811341 | Loss: 0.732 | 913 ms/step , 6892.45 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 21:56:28 | Epoch: 1 | Step: 189750 | Dataset: 0-811661 | Loss: 0.669 | 915 ms/step , 6871.06 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 21:56:37 | Epoch: 1 | Step: 189760 | Dataset: 0-811981 | Loss: 0.740 | 913 ms/step , 6888.31 GFLOP/s , 17944.2 tokens/s INFO:__main__:2024-11-05 21:56:46 | Epoch: 1 | Step: 189770 | Dataset: 0-812301 | Loss: 0.730 | 913 ms/step , 6887.18 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 21:56:55 | Epoch: 1 | Step: 189780 | Dataset: 0-812621 | Loss: 0.745 | 912 ms/step , 6896.84 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 21:57:04 | Epoch: 1 | Step: 189790 | Dataset: 0-812941 | Loss: 0.699 | 914 ms/step , 6878.87 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 21:57:13 | Epoch: 1 | Step: 189800 | Dataset: 0-813261 | Loss: 0.679 | 913 ms/step , 6890.89 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 21:57:15 | Validation | Step: 189800 | Val_loss: 0.726 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:57:24 | Epoch: 1 | Step: 189810 | Dataset: 0-813581 | Loss: 0.717 | 913 ms/step , 6885.38 GFLOP/s , 15272.7 tokens/s INFO:__main__:2024-11-05 21:57:33 | Epoch: 1 | Step: 189820 | Dataset: 0-813901 | Loss: 0.663 | 913 ms/step , 6888.82 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 21:57:42 | Epoch: 1 | Step: 189830 | Dataset: 0-814221 | Loss: 0.740 | 914 ms/step , 6884.14 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 21:57:52 | Epoch: 1 | Step: 189840 | Dataset: 0-814541 | Loss: 0.639 | 912 ms/step , 6893.76 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 21:58:01 | Epoch: 1 | Step: 189850 | Dataset: 0-814861 | Loss: 0.779 | 912 ms/step , 6893.12 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 21:58:10 | Epoch: 1 | Step: 189860 | Dataset: 0-815181 | Loss: 0.690 | 916 ms/step , 6867.93 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 21:58:19 | Epoch: 1 | Step: 189870 | Dataset: 0-815501 | Loss: 0.702 | 913 ms/step , 6885.99 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 21:58:28 | Epoch: 1 | Step: 189880 | Dataset: 0-815821 | Loss: 0.827 | 913 ms/step , 6886.11 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 21:58:37 | Epoch: 1 | Step: 189890 | Dataset: 0-816141 | Loss: 0.673 | 912 ms/step , 6894.28 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 21:58:46 | Epoch: 1 | Step: 189900 | Dataset: 0-816461 | Loss: 0.761 | 914 ms/step , 6883.60 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 21:58:48 | Validation | Step: 189900 | Val_loss: 0.713 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 21:58:57 | Epoch: 1 | Step: 189910 | Dataset: 0-816781 | Loss: 0.773 | 914 ms/step , 6881.76 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-05 21:59:06 | Epoch: 1 | Step: 189920 | Dataset: 0-817101 | Loss: 0.731 | 912 ms/step , 6894.00 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 21:59:15 | Epoch: 1 | Step: 189930 | Dataset: 0-817421 | Loss: 0.740 | 914 ms/step , 6882.63 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 21:59:24 | Epoch: 1 | Step: 189940 | Dataset: 0-817741 | Loss: 0.682 | 913 ms/step , 6891.35 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 21:59:34 | Epoch: 1 | Step: 189950 | Dataset: 0-818061 | Loss: 0.769 | 913 ms/step , 6890.65 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 21:59:43 | Epoch: 1 | Step: 189960 | Dataset: 0-818381 | Loss: 0.794 | 913 ms/step , 6887.02 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 21:59:52 | Epoch: 1 | Step: 189970 | Dataset: 0-818701 | Loss: 0.764 | 914 ms/step , 6882.33 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 22:00:01 | Epoch: 1 | Step: 189980 | Dataset: 0-819021 | Loss: 0.639 | 913 ms/step , 6885.70 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 22:00:10 | Epoch: 1 | Step: 189990 | Dataset: 0-819341 | Loss: 0.687 | 913 ms/step , 6887.32 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-05 22:00:19 | Epoch: 1 | Step: 190000 | Dataset: 0-819661 | Loss: 0.696 | 911 ms/step , 6903.92 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 22:00:21 | Validation | Step: 190000 | Val_loss: 0.716 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:00:21 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_220021_step_190000.pt` INFO:__main__:2024-11-05 22:00:31 | Epoch: 1 | Step: 190010 | Dataset: 0-819981 | Loss: 0.720 | 913 ms/step , 6887.20 GFLOP/s , 13762.2 tokens/s INFO:__main__:2024-11-05 22:00:40 | Epoch: 1 | Step: 190020 | Dataset: 0-820301 | Loss: 0.756 | 913 ms/step , 6891.94 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-05 22:00:49 | Epoch: 1 | Step: 190030 | Dataset: 0-820621 | Loss: 0.794 | 913 ms/step , 6886.17 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 22:00:59 | Epoch: 1 | Step: 190040 | Dataset: 0-820941 | Loss: 0.873 | 913 ms/step , 6885.70 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 22:01:08 | Epoch: 1 | Step: 190050 | Dataset: 0-821261 | Loss: 0.757 | 912 ms/step , 6894.38 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 22:01:17 | Epoch: 1 | Step: 190060 | Dataset: 0-821581 | Loss: 0.790 | 915 ms/step , 6872.28 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 22:01:26 | Epoch: 1 | Step: 190070 | Dataset: 0-821901 | Loss: 0.671 | 913 ms/step , 6888.62 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 22:01:35 | Epoch: 1 | Step: 190080 | Dataset: 0-822221 | Loss: 0.739 | 913 ms/step , 6891.44 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 22:01:44 | Epoch: 1 | Step: 190090 | Dataset: 0-822541 | Loss: 0.719 | 913 ms/step , 6889.61 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 22:01:53 | Epoch: 1 | Step: 190100 | Dataset: 0-822861 | Loss: 0.680 | 913 ms/step , 6889.42 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 22:01:55 | Validation | Step: 190100 | Val_loss: 0.711 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:02:04 | Epoch: 1 | Step: 190110 | Dataset: 0-823181 | Loss: 0.745 | 914 ms/step , 6884.92 GFLOP/s , 15275.1 tokens/s INFO:__main__:2024-11-05 22:02:13 | Epoch: 1 | Step: 190120 | Dataset: 0-823501 | Loss: 0.769 | 913 ms/step , 6891.63 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-05 22:02:22 | Epoch: 1 | Step: 190130 | Dataset: 0-823821 | Loss: 0.819 | 913 ms/step , 6886.59 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 22:02:32 | Epoch: 1 | Step: 190140 | Dataset: 0-824141 | Loss: 0.653 | 913 ms/step , 6891.33 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 22:02:41 | Epoch: 1 | Step: 190150 | Dataset: 0-824461 | Loss: 0.805 | 914 ms/step , 6878.22 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 22:02:50 | Epoch: 1 | Step: 190160 | Dataset: 0-824781 | Loss: 0.739 | 913 ms/step , 6892.27 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 22:02:59 | Epoch: 1 | Step: 190170 | Dataset: 0-825101 | Loss: 0.678 | 914 ms/step , 6884.39 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 22:03:08 | Epoch: 1 | Step: 190180 | Dataset: 0-825421 | Loss: 0.716 | 914 ms/step , 6884.49 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 22:03:17 | Epoch: 1 | Step: 190190 | Dataset: 0-825741 | Loss: 0.711 | 913 ms/step , 6889.17 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 22:03:26 | Epoch: 1 | Step: 190200 | Dataset: 0-826061 | Loss: 0.597 | 912 ms/step , 6897.79 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 22:03:28 | Validation | Step: 190200 | Val_loss: 0.691 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:03:37 | Epoch: 1 | Step: 190210 | Dataset: 0-826381 | Loss: 0.550 | 914 ms/step , 6884.93 GFLOP/s , 15280.6 tokens/s INFO:__main__:2024-11-05 22:03:46 | Epoch: 1 | Step: 190220 | Dataset: 0-826701 | Loss: 0.649 | 913 ms/step , 6889.80 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 22:03:55 | Epoch: 1 | Step: 190230 | Dataset: 0-827021 | Loss: 0.649 | 913 ms/step , 6890.14 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 22:04:04 | Epoch: 1 | Step: 190240 | Dataset: 0-827341 | Loss: 0.609 | 912 ms/step , 6894.69 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-05 22:04:14 | Epoch: 1 | Step: 190250 | Dataset: 0-827661 | Loss: 0.759 | 913 ms/step , 6890.56 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 22:04:23 | Epoch: 1 | Step: 190260 | Dataset: 0-827981 | Loss: 0.803 | 915 ms/step , 6874.18 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 22:04:32 | Epoch: 1 | Step: 190270 | Dataset: 0-828301 | Loss: 0.686 | 913 ms/step , 6885.70 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 22:04:41 | Epoch: 1 | Step: 190280 | Dataset: 0-828621 | Loss: 0.723 | 914 ms/step , 6884.32 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 22:04:50 | Epoch: 1 | Step: 190290 | Dataset: 0-828941 | Loss: 0.742 | 913 ms/step , 6887.46 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 22:04:59 | Epoch: 1 | Step: 190300 | Dataset: 0-829261 | Loss: 0.703 | 913 ms/step , 6889.95 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-05 22:05:01 | Validation | Step: 190300 | Val_loss: 0.759 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:05:10 | Epoch: 1 | Step: 190310 | Dataset: 0-829581 | Loss: 0.754 | 916 ms/step , 6869.77 GFLOP/s , 15283.1 tokens/s INFO:__main__:2024-11-05 22:05:19 | Epoch: 1 | Step: 190320 | Dataset: 0-829901 | Loss: 0.599 | 914 ms/step , 6882.89 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 22:05:28 | Epoch: 1 | Step: 190330 | Dataset: 0-830221 | Loss: 0.761 | 914 ms/step , 6881.67 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 22:05:37 | Epoch: 1 | Step: 190340 | Dataset: 0-830541 | Loss: 0.716 | 912 ms/step , 6896.84 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 22:05:47 | Epoch: 1 | Step: 190350 | Dataset: 0-830861 | Loss: 0.652 | 911 ms/step , 6902.70 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-05 22:05:56 | Epoch: 1 | Step: 190360 | Dataset: 0-831181 | Loss: 0.717 | 913 ms/step , 6887.21 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 22:06:05 | Epoch: 1 | Step: 190370 | Dataset: 0-831501 | Loss: 0.772 | 914 ms/step , 6883.14 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 22:06:14 | Epoch: 1 | Step: 190380 | Dataset: 0-831821 | Loss: 0.705 | 912 ms/step , 6896.97 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 22:06:23 | Epoch: 1 | Step: 190390 | Dataset: 0-832141 | Loss: 0.747 | 913 ms/step , 6890.44 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 22:06:32 | Epoch: 1 | Step: 190400 | Dataset: 0-832461 | Loss: 0.652 | 913 ms/step , 6886.69 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 22:06:34 | Validation | Step: 190400 | Val_loss: 0.665 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:06:43 | Epoch: 1 | Step: 190410 | Dataset: 0-832781 | Loss: 0.619 | 913 ms/step , 6885.96 GFLOP/s , 15276.0 tokens/s INFO:__main__:2024-11-05 22:06:52 | Epoch: 1 | Step: 190420 | Dataset: 0-833101 | Loss: 0.705 | 915 ms/step , 6875.18 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 22:07:01 | Epoch: 1 | Step: 190430 | Dataset: 0-833421 | Loss: 0.612 | 914 ms/step , 6881.01 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 22:07:10 | Epoch: 1 | Step: 190440 | Dataset: 0-833741 | Loss: 0.713 | 913 ms/step , 6886.38 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-05 22:07:19 | Epoch: 1 | Step: 190450 | Dataset: 0-834061 | Loss: 0.785 | 914 ms/step , 6880.84 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 22:07:29 | Epoch: 1 | Step: 190460 | Dataset: 0-834381 | Loss: 0.776 | 912 ms/step , 6892.90 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 22:07:38 | Epoch: 1 | Step: 190470 | Dataset: 0-834701 | Loss: 0.600 | 913 ms/step , 6892.15 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 22:07:47 | Epoch: 1 | Step: 190480 | Dataset: 0-835021 | Loss: 0.672 | 914 ms/step , 6884.03 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 22:07:56 | Epoch: 1 | Step: 190490 | Dataset: 0-835341 | Loss: 0.822 | 912 ms/step , 6895.94 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 22:08:05 | Epoch: 1 | Step: 190500 | Dataset: 0-835661 | Loss: 0.705 | 914 ms/step , 6881.48 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 22:08:07 | Validation | Step: 190500 | Val_loss: 0.717 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:08:16 | Epoch: 1 | Step: 190510 | Dataset: 0-835981 | Loss: 0.657 | 912 ms/step , 6896.79 GFLOP/s , 15283.6 tokens/s INFO:__main__:2024-11-05 22:08:25 | Epoch: 1 | Step: 190520 | Dataset: 0-836301 | Loss: 0.698 | 913 ms/step , 6890.31 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 22:08:34 | Epoch: 1 | Step: 190530 | Dataset: 0-836621 | Loss: 0.709 | 913 ms/step , 6891.27 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 22:08:43 | Epoch: 1 | Step: 190540 | Dataset: 0-836941 | Loss: 0.772 | 913 ms/step , 6885.23 GFLOP/s , 17946.8 tokens/s INFO:__main__:2024-11-05 22:08:52 | Epoch: 1 | Step: 190550 | Dataset: 0-837261 | Loss: 0.621 | 912 ms/step , 6893.12 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 22:09:02 | Epoch: 1 | Step: 190560 | Dataset: 0-837581 | Loss: 0.618 | 913 ms/step , 6890.91 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 22:09:11 | Epoch: 1 | Step: 190570 | Dataset: 0-837901 | Loss: 0.646 | 913 ms/step , 6890.36 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 22:09:20 | Epoch: 1 | Step: 190580 | Dataset: 0-838221 | Loss: 0.620 | 913 ms/step , 6890.53 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 22:09:29 | Epoch: 1 | Step: 190590 | Dataset: 0-838541 | Loss: 0.598 | 912 ms/step , 6897.19 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-05 22:09:38 | Epoch: 1 | Step: 190600 | Dataset: 0-838861 | Loss: 0.787 | 915 ms/step , 6875.77 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 22:09:40 | Validation | Step: 190600 | Val_loss: 0.539 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:09:49 | Epoch: 1 | Step: 190610 | Dataset: 0-839181 | Loss: 0.694 | 913 ms/step , 6889.55 GFLOP/s , 15288.1 tokens/s INFO:__main__:2024-11-05 22:09:58 | Epoch: 1 | Step: 190620 | Dataset: 0-839501 | Loss: 0.686 | 913 ms/step , 6890.35 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 22:10:07 | Epoch: 1 | Step: 190630 | Dataset: 0-839821 | Loss: 0.776 | 913 ms/step , 6888.80 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 22:10:16 | Epoch: 1 | Step: 190640 | Dataset: 0-840141 | Loss: 0.825 | 914 ms/step , 6882.45 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 22:10:25 | Epoch: 1 | Step: 190650 | Dataset: 0-840461 | Loss: 0.696 | 912 ms/step , 6893.02 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-05 22:10:34 | Epoch: 1 | Step: 190660 | Dataset: 0-840781 | Loss: 0.707 | 913 ms/step , 6885.71 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 22:10:44 | Epoch: 1 | Step: 190670 | Dataset: 0-841101 | Loss: 0.810 | 913 ms/step , 6885.70 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-05 22:10:53 | Epoch: 1 | Step: 190680 | Dataset: 0-841421 | Loss: 0.662 | 913 ms/step , 6890.82 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 22:11:02 | Epoch: 1 | Step: 190690 | Dataset: 0-841741 | Loss: 0.745 | 912 ms/step , 6893.80 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 22:11:11 | Epoch: 1 | Step: 190700 | Dataset: 0-842061 | Loss: 0.743 | 914 ms/step , 6883.90 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 22:11:13 | Validation | Step: 190700 | Val_loss: 0.355 | Best_val_loss: 0.3570 INFO:__main__:2024-11-05 22:11:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_221113_step_190700.pt` INFO:__main__:2024-11-05 22:11:23 | Epoch: 1 | Step: 190710 | Dataset: 0-842381 | Loss: 0.752 | 914 ms/step , 6882.74 GFLOP/s , 13748.3 tokens/s INFO:__main__:2024-11-05 22:11:32 | Epoch: 1 | Step: 190720 | Dataset: 0-842701 | Loss: 0.663 | 914 ms/step , 6883.99 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 22:11:41 | Epoch: 1 | Step: 190730 | Dataset: 0-843021 | Loss: 0.813 | 913 ms/step , 6887.23 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 22:11:50 | Epoch: 1 | Step: 190740 | Dataset: 0-843341 | Loss: 0.575 | 913 ms/step , 6885.60 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 22:11:59 | Epoch: 1 | Step: 190750 | Dataset: 0-843661 | Loss: 0.774 | 914 ms/step , 6883.24 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-05 22:12:09 | Epoch: 1 | Step: 190760 | Dataset: 0-843981 | Loss: 0.735 | 915 ms/step , 6877.12 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 22:12:18 | Epoch: 1 | Step: 190770 | Dataset: 0-844301 | Loss: 0.791 | 915 ms/step , 6875.33 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 22:12:27 | Epoch: 1 | Step: 190780 | Dataset: 0-844621 | Loss: 0.733 | 912 ms/step , 6896.77 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 22:12:36 | Epoch: 1 | Step: 190790 | Dataset: 0-844941 | Loss: 0.766 | 913 ms/step , 6887.08 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 22:12:45 | Epoch: 1 | Step: 190800 | Dataset: 0-845261 | Loss: 0.786 | 913 ms/step , 6890.23 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 22:12:47 | Validation | Step: 190800 | Val_loss: 0.367 | Best_val_loss: 0.3552 INFO:__main__:2024-11-05 22:12:56 | Epoch: 1 | Step: 190810 | Dataset: 0-845581 | Loss: 0.628 | 912 ms/step , 6898.60 GFLOP/s , 15284.5 tokens/s INFO:__main__:2024-11-05 22:13:05 | Epoch: 1 | Step: 190820 | Dataset: 0-845901 | Loss: 0.750 | 914 ms/step , 6881.34 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 22:13:14 | Epoch: 1 | Step: 190830 | Dataset: 0-846221 | Loss: 0.710 | 913 ms/step , 6885.08 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 22:13:23 | Epoch: 1 | Step: 190840 | Dataset: 0-846541 | Loss: 0.734 | 912 ms/step , 6895.50 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 22:13:32 | Epoch: 1 | Step: 190850 | Dataset: 0-846861 | Loss: 0.705 | 913 ms/step , 6890.19 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 22:13:42 | Epoch: 1 | Step: 190860 | Dataset: 0-847181 | Loss: 0.722 | 913 ms/step , 6885.85 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 22:13:51 | Epoch: 1 | Step: 190870 | Dataset: 0-847501 | Loss: 0.586 | 911 ms/step , 6900.78 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 22:14:00 | Epoch: 1 | Step: 190880 | Dataset: 0-847821 | Loss: 0.830 | 915 ms/step , 6876.32 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 22:14:09 | Epoch: 1 | Step: 190890 | Dataset: 0-848141 | Loss: 0.705 | 913 ms/step , 6891.67 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 22:14:18 | Epoch: 1 | Step: 190900 | Dataset: 0-848461 | Loss: 0.652 | 913 ms/step , 6891.06 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 22:14:20 | Validation | Step: 190900 | Val_loss: 0.347 | Best_val_loss: 0.3552 INFO:__main__:2024-11-05 22:14:20 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_221420_step_190900.pt` INFO:__main__:2024-11-05 22:14:30 | Epoch: 1 | Step: 190910 | Dataset: 0-848781 | Loss: 0.727 | 913 ms/step , 6888.91 GFLOP/s , 13750.5 tokens/s INFO:__main__:2024-11-05 22:14:39 | Epoch: 1 | Step: 190920 | Dataset: 0-849101 | Loss: 0.665 | 913 ms/step , 6885.95 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 22:14:48 | Epoch: 1 | Step: 190930 | Dataset: 0-849421 | Loss: 0.713 | 912 ms/step , 6895.95 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 22:14:57 | Epoch: 1 | Step: 190940 | Dataset: 0-849741 | Loss: 0.706 | 913 ms/step , 6889.67 GFLOP/s , 17889.5 tokens/s INFO:__main__:2024-11-05 22:15:07 | Epoch: 1 | Step: 190950 | Dataset: 0-850061 | Loss: 0.801 | 913 ms/step , 6890.21 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 22:15:16 | Epoch: 1 | Step: 190960 | Dataset: 0-850381 | Loss: 0.652 | 912 ms/step , 6894.30 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 22:15:25 | Epoch: 1 | Step: 190970 | Dataset: 0-850701 | Loss: 0.830 | 913 ms/step , 6885.50 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 22:15:34 | Epoch: 1 | Step: 190980 | Dataset: 0-851021 | Loss: 0.811 | 914 ms/step , 6884.39 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 22:15:43 | Epoch: 1 | Step: 190990 | Dataset: 0-851341 | Loss: 0.910 | 915 ms/step , 6877.31 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 22:15:52 | Epoch: 1 | Step: 191000 | Dataset: 0-851661 | Loss: 0.767 | 913 ms/step , 6890.36 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 22:15:54 | Validation | Step: 191000 | Val_loss: 0.343 | Best_val_loss: 0.3466 INFO:__main__:2024-11-05 22:15:54 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_221554_step_191000.pt` INFO:__main__:2024-11-05 22:16:04 | Epoch: 1 | Step: 191010 | Dataset: 0-851981 | Loss: 0.777 | 914 ms/step , 6882.56 GFLOP/s , 13757.8 tokens/s INFO:__main__:2024-11-05 22:16:13 | Epoch: 1 | Step: 191020 | Dataset: 0-852301 | Loss: 0.836 | 914 ms/step , 6882.42 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 22:16:22 | Epoch: 1 | Step: 191030 | Dataset: 0-852621 | Loss: 0.922 | 913 ms/step , 6891.14 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 22:16:32 | Epoch: 1 | Step: 191040 | Dataset: 0-852941 | Loss: 0.825 | 914 ms/step , 6884.28 GFLOP/s , 17899.4 tokens/s INFO:__main__:2024-11-05 22:16:41 | Epoch: 1 | Step: 191050 | Dataset: 0-853261 | Loss: 0.932 | 914 ms/step , 6884.88 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 22:16:50 | Epoch: 1 | Step: 191060 | Dataset: 0-853581 | Loss: 0.947 | 913 ms/step , 6891.75 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 22:16:59 | Epoch: 1 | Step: 191070 | Dataset: 0-853901 | Loss: 0.869 | 914 ms/step , 6878.20 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-05 22:17:08 | Epoch: 1 | Step: 191080 | Dataset: 0-854221 | Loss: 0.885 | 913 ms/step , 6888.16 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 22:17:17 | Epoch: 1 | Step: 191090 | Dataset: 0-854541 | Loss: 0.700 | 913 ms/step , 6888.98 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 22:17:26 | Epoch: 1 | Step: 191100 | Dataset: 0-854861 | Loss: 0.708 | 912 ms/step , 6896.24 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 22:17:28 | Validation | Step: 191100 | Val_loss: 0.345 | Best_val_loss: 0.3427 INFO:__main__:2024-11-05 22:17:37 | Epoch: 1 | Step: 191110 | Dataset: 0-855181 | Loss: 0.858 | 913 ms/step , 6891.21 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-05 22:17:46 | Epoch: 1 | Step: 191120 | Dataset: 0-855501 | Loss: 0.778 | 912 ms/step , 6894.33 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 22:17:55 | Epoch: 1 | Step: 191130 | Dataset: 0-855821 | Loss: 0.923 | 914 ms/step , 6881.73 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 22:18:05 | Epoch: 1 | Step: 191140 | Dataset: 0-856141 | Loss: 0.710 | 913 ms/step , 6890.91 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 22:18:14 | Epoch: 1 | Step: 191150 | Dataset: 0-856461 | Loss: 0.865 | 913 ms/step , 6886.29 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 22:18:23 | Epoch: 1 | Step: 191160 | Dataset: 0-856781 | Loss: 0.741 | 913 ms/step , 6888.54 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 22:18:32 | Epoch: 1 | Step: 191170 | Dataset: 0-857101 | Loss: 0.884 | 915 ms/step , 6876.15 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:18:41 | Epoch: 1 | Step: 191180 | Dataset: 0-857421 | Loss: 0.784 | 913 ms/step , 6889.24 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 22:18:50 | Epoch: 1 | Step: 191190 | Dataset: 0-857741 | Loss: 0.921 | 915 ms/step , 6873.68 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 22:18:59 | Epoch: 1 | Step: 191200 | Dataset: 0-858061 | Loss: 0.910 | 913 ms/step , 6892.25 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 22:19:01 | Validation | Step: 191200 | Val_loss: 0.355 | Best_val_loss: 0.3427 INFO:__main__:2024-11-05 22:19:10 | Epoch: 1 | Step: 191210 | Dataset: 0-858381 | Loss: 0.723 | 913 ms/step , 6887.15 GFLOP/s , 15279.7 tokens/s INFO:__main__:2024-11-05 22:19:19 | Epoch: 1 | Step: 191220 | Dataset: 0-858701 | Loss: 0.831 | 914 ms/step , 6881.67 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 22:19:28 | Epoch: 1 | Step: 191230 | Dataset: 0-859021 | Loss: 0.839 | 914 ms/step , 6880.08 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 22:19:38 | Epoch: 1 | Step: 191240 | Dataset: 0-859341 | Loss: 0.743 | 913 ms/step , 6886.69 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 22:19:47 | Epoch: 1 | Step: 191250 | Dataset: 0-859661 | Loss: 0.816 | 913 ms/step , 6889.17 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 22:19:56 | Epoch: 1 | Step: 191260 | Dataset: 0-859981 | Loss: 0.759 | 914 ms/step , 6881.93 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 22:20:05 | Epoch: 1 | Step: 191270 | Dataset: 0-860301 | Loss: 0.764 | 913 ms/step , 6885.38 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 22:20:14 | Epoch: 1 | Step: 191280 | Dataset: 0-860621 | Loss: 0.903 | 914 ms/step , 6885.04 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 22:20:23 | Epoch: 1 | Step: 191290 | Dataset: 0-860941 | Loss: 0.874 | 914 ms/step , 6881.52 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 22:20:32 | Epoch: 1 | Step: 191300 | Dataset: 0-861261 | Loss: 0.903 | 915 ms/step , 6871.96 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 22:20:34 | Validation | Step: 191300 | Val_loss: 0.325 | Best_val_loss: 0.3427 INFO:__main__:2024-11-05 22:20:34 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_222034_step_191300.pt` INFO:__main__:2024-11-05 22:20:44 | Epoch: 1 | Step: 191310 | Dataset: 0-861581 | Loss: 0.874 | 913 ms/step , 6889.31 GFLOP/s , 13801.2 tokens/s INFO:__main__:2024-11-05 22:20:53 | Epoch: 1 | Step: 191320 | Dataset: 0-861901 | Loss: 0.852 | 914 ms/step , 6879.03 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-05 22:21:03 | Epoch: 1 | Step: 191330 | Dataset: 0-862221 | Loss: 0.746 | 913 ms/step , 6885.32 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 22:21:12 | Epoch: 1 | Step: 191340 | Dataset: 0-862541 | Loss: 0.731 | 914 ms/step , 6882.71 GFLOP/s , 17891.3 tokens/s INFO:__main__:2024-11-05 22:21:21 | Epoch: 1 | Step: 191350 | Dataset: 0-862861 | Loss: 0.703 | 914 ms/step , 6883.19 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 22:21:30 | Epoch: 1 | Step: 191360 | Dataset: 0-863181 | Loss: 0.782 | 913 ms/step , 6887.32 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 22:21:39 | Epoch: 1 | Step: 191370 | Dataset: 0-863501 | Loss: 0.884 | 913 ms/step , 6886.56 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 22:21:48 | Epoch: 1 | Step: 191380 | Dataset: 0-863821 | Loss: 0.774 | 913 ms/step , 6891.27 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:21:57 | Epoch: 1 | Step: 191390 | Dataset: 0-864141 | Loss: 0.811 | 912 ms/step , 6894.00 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-05 22:22:07 | Epoch: 1 | Step: 191400 | Dataset: 0-864461 | Loss: 0.836 | 914 ms/step , 6880.13 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 22:22:08 | Validation | Step: 191400 | Val_loss: 0.336 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:22:17 | Epoch: 1 | Step: 191410 | Dataset: 0-864781 | Loss: 0.748 | 914 ms/step , 6883.20 GFLOP/s , 15282.0 tokens/s INFO:__main__:2024-11-05 22:22:26 | Epoch: 1 | Step: 191420 | Dataset: 0-865101 | Loss: 0.838 | 912 ms/step , 6893.73 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 22:22:35 | Epoch: 1 | Step: 191430 | Dataset: 0-865421 | Loss: 0.795 | 913 ms/step , 6885.80 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 22:22:45 | Epoch: 1 | Step: 191440 | Dataset: 0-865741 | Loss: 0.905 | 913 ms/step , 6887.43 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 22:22:54 | Epoch: 1 | Step: 191450 | Dataset: 0-866061 | Loss: 0.810 | 914 ms/step , 6878.88 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 22:23:03 | Epoch: 1 | Step: 191460 | Dataset: 0-866381 | Loss: 0.858 | 914 ms/step , 6879.87 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 22:23:12 | Epoch: 1 | Step: 191470 | Dataset: 0-866701 | Loss: 0.899 | 914 ms/step , 6878.18 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 22:23:21 | Epoch: 1 | Step: 191480 | Dataset: 0-867021 | Loss: 0.826 | 916 ms/step , 6869.24 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 22:23:30 | Epoch: 1 | Step: 191490 | Dataset: 0-867341 | Loss: 0.802 | 913 ms/step , 6888.42 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 22:23:39 | Epoch: 1 | Step: 191500 | Dataset: 0-867661 | Loss: 0.638 | 912 ms/step , 6895.84 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-05 22:23:41 | Validation | Step: 191500 | Val_loss: 0.333 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:23:50 | Epoch: 1 | Step: 191510 | Dataset: 0-867981 | Loss: 0.784 | 913 ms/step , 6888.20 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-05 22:23:59 | Epoch: 1 | Step: 191520 | Dataset: 0-868301 | Loss: 1.017 | 914 ms/step , 6882.47 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 22:24:08 | Epoch: 1 | Step: 191530 | Dataset: 0-868621 | Loss: 0.934 | 913 ms/step , 6892.34 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 22:24:18 | Epoch: 1 | Step: 191540 | Dataset: 0-868941 | Loss: 0.877 | 913 ms/step , 6886.32 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 22:24:27 | Epoch: 1 | Step: 191550 | Dataset: 0-869261 | Loss: 0.682 | 911 ms/step , 6901.58 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 22:24:36 | Epoch: 1 | Step: 191560 | Dataset: 0-869581 | Loss: 0.761 | 913 ms/step , 6885.77 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 22:24:45 | Epoch: 1 | Step: 191570 | Dataset: 0-869901 | Loss: 0.738 | 914 ms/step , 6878.79 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 22:24:54 | Epoch: 1 | Step: 191580 | Dataset: 0-870221 | Loss: 0.953 | 914 ms/step , 6877.96 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 22:25:03 | Epoch: 1 | Step: 191590 | Dataset: 0-870541 | Loss: 0.772 | 912 ms/step , 6896.20 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:25:12 | Epoch: 1 | Step: 191600 | Dataset: 0-870861 | Loss: 0.411 | 911 ms/step , 6903.66 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:25:14 | Validation | Step: 191600 | Val_loss: 0.346 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:25:23 | Epoch: 1 | Step: 191610 | Dataset: 0-871181 | Loss: 0.882 | 914 ms/step , 6880.65 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 22:25:32 | Epoch: 1 | Step: 191620 | Dataset: 0-871501 | Loss: 0.703 | 914 ms/step , 6878.24 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 22:25:41 | Epoch: 1 | Step: 191630 | Dataset: 0-871821 | Loss: 0.814 | 914 ms/step , 6880.36 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 22:25:51 | Epoch: 1 | Step: 191640 | Dataset: 0-872141 | Loss: 0.752 | 913 ms/step , 6886.12 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-05 22:26:00 | Epoch: 1 | Step: 191650 | Dataset: 0-872461 | Loss: 0.652 | 913 ms/step , 6885.78 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-05 22:26:09 | Epoch: 1 | Step: 191660 | Dataset: 0-872781 | Loss: 0.687 | 914 ms/step , 6884.24 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 22:26:18 | Epoch: 1 | Step: 191670 | Dataset: 0-873101 | Loss: 0.728 | 913 ms/step , 6887.40 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 22:26:27 | Epoch: 1 | Step: 191680 | Dataset: 0-873421 | Loss: 0.871 | 915 ms/step , 6876.20 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 22:26:36 | Epoch: 1 | Step: 191690 | Dataset: 0-873741 | Loss: 0.838 | 915 ms/step , 6873.90 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 22:26:45 | Epoch: 1 | Step: 191700 | Dataset: 0-874061 | Loss: 0.660 | 912 ms/step , 6894.99 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 22:26:47 | Validation | Step: 191700 | Val_loss: 0.348 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:26:56 | Epoch: 1 | Step: 191710 | Dataset: 0-874381 | Loss: 0.797 | 914 ms/step , 6879.82 GFLOP/s , 15271.7 tokens/s INFO:__main__:2024-11-05 22:27:05 | Epoch: 1 | Step: 191720 | Dataset: 0-874701 | Loss: 0.863 | 914 ms/step , 6881.91 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 22:27:14 | Epoch: 1 | Step: 191730 | Dataset: 0-875021 | Loss: 0.834 | 912 ms/step , 6895.30 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 22:27:24 | Epoch: 1 | Step: 191740 | Dataset: 0-875341 | Loss: 0.816 | 913 ms/step , 6891.52 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 22:27:33 | Epoch: 1 | Step: 191750 | Dataset: 0-875661 | Loss: 0.746 | 912 ms/step , 6898.40 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 22:27:42 | Epoch: 1 | Step: 191760 | Dataset: 0-875981 | Loss: 0.829 | 913 ms/step , 6891.02 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 22:27:51 | Epoch: 1 | Step: 191770 | Dataset: 0-876301 | Loss: 0.802 | 913 ms/step , 6892.36 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 22:28:00 | Epoch: 1 | Step: 191780 | Dataset: 0-876621 | Loss: 0.759 | 913 ms/step , 6891.66 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 22:28:09 | Epoch: 1 | Step: 191790 | Dataset: 0-876941 | Loss: 0.965 | 915 ms/step , 6870.62 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 22:28:18 | Epoch: 1 | Step: 191800 | Dataset: 0-877261 | Loss: 0.819 | 914 ms/step , 6880.51 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 22:28:20 | Validation | Step: 191800 | Val_loss: 0.353 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:28:29 | Epoch: 1 | Step: 191810 | Dataset: 0-877581 | Loss: 0.901 | 913 ms/step , 6888.91 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-05 22:28:38 | Epoch: 1 | Step: 191820 | Dataset: 0-877901 | Loss: 0.711 | 914 ms/step , 6878.47 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 22:28:47 | Epoch: 1 | Step: 191830 | Dataset: 0-878221 | Loss: 0.827 | 915 ms/step , 6876.12 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-05 22:28:56 | Epoch: 1 | Step: 191840 | Dataset: 0-878541 | Loss: 0.624 | 913 ms/step , 6890.03 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-05 22:29:06 | Epoch: 1 | Step: 191850 | Dataset: 0-878861 | Loss: 0.859 | 914 ms/step , 6883.02 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 22:29:15 | Epoch: 1 | Step: 191860 | Dataset: 0-879181 | Loss: 0.893 | 913 ms/step , 6887.85 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 22:29:24 | Epoch: 1 | Step: 191870 | Dataset: 0-879501 | Loss: 0.838 | 915 ms/step , 6872.35 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 22:29:33 | Epoch: 1 | Step: 191880 | Dataset: 0-879821 | Loss: 0.747 | 913 ms/step , 6887.20 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 22:29:42 | Epoch: 1 | Step: 191890 | Dataset: 0-880141 | Loss: 0.795 | 913 ms/step , 6889.43 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 22:29:51 | Epoch: 1 | Step: 191900 | Dataset: 0-880461 | Loss: 0.850 | 915 ms/step , 6871.29 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 22:29:53 | Validation | Step: 191900 | Val_loss: 0.354 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:30:02 | Epoch: 1 | Step: 191910 | Dataset: 0-880781 | Loss: 0.902 | 914 ms/step , 6884.41 GFLOP/s , 15276.8 tokens/s INFO:__main__:2024-11-05 22:30:11 | Epoch: 1 | Step: 191920 | Dataset: 0-881101 | Loss: 0.867 | 913 ms/step , 6886.55 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 22:30:20 | Epoch: 1 | Step: 191930 | Dataset: 0-881421 | Loss: 0.656 | 913 ms/step , 6890.18 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 22:30:29 | Epoch: 1 | Step: 191940 | Dataset: 0-881741 | Loss: 0.835 | 914 ms/step , 6883.70 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 22:30:39 | Epoch: 1 | Step: 191950 | Dataset: 0-882061 | Loss: 0.768 | 915 ms/step , 6877.42 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-05 22:30:48 | Epoch: 1 | Step: 191960 | Dataset: 0-882381 | Loss: 0.781 | 914 ms/step , 6883.44 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 22:30:57 | Epoch: 1 | Step: 191970 | Dataset: 0-882701 | Loss: 0.786 | 913 ms/step , 6891.16 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 22:31:06 | Epoch: 1 | Step: 191980 | Dataset: 0-883021 | Loss: 0.752 | 913 ms/step , 6888.12 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 22:31:15 | Epoch: 1 | Step: 191990 | Dataset: 0-883341 | Loss: 0.895 | 913 ms/step , 6890.99 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 22:31:24 | Epoch: 1 | Step: 192000 | Dataset: 0-883661 | Loss: 0.838 | 915 ms/step , 6874.38 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 22:31:26 | Validation | Step: 192000 | Val_loss: 0.366 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:31:26 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_223126_step_192000.pt` INFO:__main__:2024-11-05 22:31:36 | Epoch: 1 | Step: 192010 | Dataset: 0-883981 | Loss: 0.853 | 914 ms/step , 6883.69 GFLOP/s , 13801.1 tokens/s INFO:__main__:2024-11-05 22:31:45 | Epoch: 1 | Step: 192020 | Dataset: 0-884301 | Loss: 0.852 | 913 ms/step , 6885.47 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 22:31:54 | Epoch: 1 | Step: 192030 | Dataset: 0-884621 | Loss: 0.785 | 913 ms/step , 6889.08 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 22:32:04 | Epoch: 1 | Step: 192040 | Dataset: 0-884941 | Loss: 0.929 | 912 ms/step , 6893.65 GFLOP/s , 17846.0 tokens/s INFO:__main__:2024-11-05 22:32:13 | Epoch: 1 | Step: 192050 | Dataset: 0-885261 | Loss: 0.971 | 915 ms/step , 6871.71 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-05 22:32:22 | Epoch: 1 | Step: 192060 | Dataset: 0-885581 | Loss: 0.856 | 913 ms/step , 6886.19 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 22:32:31 | Epoch: 1 | Step: 192070 | Dataset: 0-885901 | Loss: 0.523 | 914 ms/step , 6882.83 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 22:32:40 | Epoch: 1 | Step: 192080 | Dataset: 0-886221 | Loss: 0.841 | 914 ms/step , 6880.98 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-05 22:32:49 | Epoch: 1 | Step: 192090 | Dataset: 0-886541 | Loss: 0.819 | 913 ms/step , 6889.15 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-05 22:32:58 | Epoch: 1 | Step: 192100 | Dataset: 0-886861 | Loss: 0.891 | 914 ms/step , 6879.90 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 22:33:00 | Validation | Step: 192100 | Val_loss: 0.330 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:33:09 | Epoch: 1 | Step: 192110 | Dataset: 0-887181 | Loss: 0.848 | 913 ms/step , 6892.14 GFLOP/s , 15271.0 tokens/s INFO:__main__:2024-11-05 22:33:18 | Epoch: 1 | Step: 192120 | Dataset: 0-887501 | Loss: 0.546 | 913 ms/step , 6889.25 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 22:33:27 | Epoch: 1 | Step: 192130 | Dataset: 0-887821 | Loss: 0.851 | 914 ms/step , 6883.43 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 22:33:37 | Epoch: 1 | Step: 192140 | Dataset: 0-888141 | Loss: 0.699 | 913 ms/step , 6886.89 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-05 22:33:46 | Epoch: 1 | Step: 192150 | Dataset: 0-888461 | Loss: 0.862 | 914 ms/step , 6881.33 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 22:33:55 | Epoch: 1 | Step: 192160 | Dataset: 0-888781 | Loss: 0.753 | 915 ms/step , 6877.08 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-05 22:34:04 | Epoch: 1 | Step: 192170 | Dataset: 0-889101 | Loss: 0.878 | 914 ms/step , 6884.73 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 22:34:13 | Epoch: 1 | Step: 192180 | Dataset: 0-889421 | Loss: 0.621 | 913 ms/step , 6887.06 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 22:34:22 | Epoch: 1 | Step: 192190 | Dataset: 0-889741 | Loss: 0.851 | 913 ms/step , 6889.07 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 22:34:31 | Epoch: 1 | Step: 192200 | Dataset: 0-890061 | Loss: 0.862 | 913 ms/step , 6891.22 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 22:34:33 | Validation | Step: 192200 | Val_loss: 0.388 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:34:42 | Epoch: 1 | Step: 192210 | Dataset: 0-890381 | Loss: 0.661 | 913 ms/step , 6887.35 GFLOP/s , 15267.4 tokens/s INFO:__main__:2024-11-05 22:34:51 | Epoch: 1 | Step: 192220 | Dataset: 0-890701 | Loss: 0.895 | 914 ms/step , 6881.56 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 22:35:00 | Epoch: 1 | Step: 192230 | Dataset: 0-891021 | Loss: 0.631 | 913 ms/step , 6887.51 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 22:35:10 | Epoch: 1 | Step: 192240 | Dataset: 0-891341 | Loss: 0.869 | 914 ms/step , 6878.65 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-05 22:35:19 | Epoch: 1 | Step: 192250 | Dataset: 0-891661 | Loss: 0.736 | 913 ms/step , 6889.07 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 22:35:28 | Epoch: 1 | Step: 192260 | Dataset: 0-891981 | Loss: 0.750 | 914 ms/step , 6884.71 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 22:35:37 | Epoch: 1 | Step: 192270 | Dataset: 0-892301 | Loss: 0.768 | 914 ms/step , 6880.41 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 22:35:46 | Epoch: 1 | Step: 192280 | Dataset: 0-892621 | Loss: 0.827 | 914 ms/step , 6882.40 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 22:35:55 | Epoch: 1 | Step: 192290 | Dataset: 0-892941 | Loss: 0.885 | 914 ms/step , 6881.59 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 22:36:04 | Epoch: 1 | Step: 192300 | Dataset: 0-893261 | Loss: 0.860 | 915 ms/step , 6874.22 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 22:36:06 | Validation | Step: 192300 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:36:15 | Epoch: 1 | Step: 192310 | Dataset: 0-893581 | Loss: 0.827 | 914 ms/step , 6879.76 GFLOP/s , 15275.9 tokens/s INFO:__main__:2024-11-05 22:36:24 | Epoch: 1 | Step: 192320 | Dataset: 0-893901 | Loss: 0.788 | 914 ms/step , 6883.66 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-05 22:36:33 | Epoch: 1 | Step: 192330 | Dataset: 0-894221 | Loss: 0.791 | 912 ms/step , 6892.90 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 22:36:43 | Epoch: 1 | Step: 192340 | Dataset: 0-894541 | Loss: 0.604 | 913 ms/step , 6886.22 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-05 22:36:52 | Epoch: 1 | Step: 192350 | Dataset: 0-894861 | Loss: 0.935 | 914 ms/step , 6881.58 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-05 22:37:01 | Epoch: 1 | Step: 192360 | Dataset: 0-895181 | Loss: 0.741 | 913 ms/step , 6887.65 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:37:10 | Epoch: 1 | Step: 192370 | Dataset: 0-895501 | Loss: 0.890 | 912 ms/step , 6893.50 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 22:37:19 | Epoch: 1 | Step: 192380 | Dataset: 0-895821 | Loss: 0.796 | 913 ms/step , 6887.13 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-05 22:37:28 | Epoch: 1 | Step: 192390 | Dataset: 0-896141 | Loss: 0.866 | 914 ms/step , 6884.12 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-05 22:37:37 | Epoch: 1 | Step: 192400 | Dataset: 0-896461 | Loss: 0.713 | 912 ms/step , 6892.70 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 22:37:39 | Validation | Step: 192400 | Val_loss: 0.714 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:37:48 | Epoch: 1 | Step: 192410 | Dataset: 0-896781 | Loss: 0.860 | 913 ms/step , 6891.83 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-05 22:37:57 | Epoch: 1 | Step: 192420 | Dataset: 0-897101 | Loss: 0.855 | 912 ms/step , 6894.81 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 22:38:06 | Epoch: 1 | Step: 192430 | Dataset: 0-897421 | Loss: 0.874 | 914 ms/step , 6884.24 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 22:38:16 | Epoch: 1 | Step: 192440 | Dataset: 0-897741 | Loss: 0.756 | 913 ms/step , 6887.45 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 22:38:25 | Epoch: 1 | Step: 192450 | Dataset: 0-898061 | Loss: 0.914 | 916 ms/step , 6869.93 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 22:38:34 | Epoch: 1 | Step: 192460 | Dataset: 0-898381 | Loss: 0.928 | 914 ms/step , 6881.69 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 22:38:43 | Epoch: 1 | Step: 192470 | Dataset: 0-898701 | Loss: 0.823 | 914 ms/step , 6880.51 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 22:38:52 | Epoch: 1 | Step: 192480 | Dataset: 0-899021 | Loss: 0.750 | 914 ms/step , 6884.65 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 22:39:01 | Epoch: 1 | Step: 192490 | Dataset: 0-899341 | Loss: 0.814 | 913 ms/step , 6891.26 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 22:39:10 | Epoch: 1 | Step: 192500 | Dataset: 0-899661 | Loss: 0.905 | 913 ms/step , 6891.75 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 22:39:12 | Validation | Step: 192500 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:39:21 | Epoch: 1 | Step: 192510 | Dataset: 0-899981 | Loss: 0.726 | 914 ms/step , 6884.13 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 22:39:30 | Epoch: 1 | Step: 192520 | Dataset: 0-900301 | Loss: 0.887 | 915 ms/step , 6877.44 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 22:39:39 | Epoch: 1 | Step: 192530 | Dataset: 0-900621 | Loss: 0.821 | 912 ms/step , 6895.69 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 22:39:48 | Epoch: 1 | Step: 192540 | Dataset: 0-900941 | Loss: 0.708 | 913 ms/step , 6891.63 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 22:39:58 | Epoch: 1 | Step: 192550 | Dataset: 0-901261 | Loss: 0.805 | 912 ms/step , 6894.70 GFLOP/s , 17943.8 tokens/s INFO:__main__:2024-11-05 22:40:07 | Epoch: 1 | Step: 192560 | Dataset: 0-901581 | Loss: 0.845 | 915 ms/step , 6877.40 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-05 22:40:16 | Epoch: 1 | Step: 192570 | Dataset: 0-901901 | Loss: 0.512 | 913 ms/step , 6891.23 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 22:40:25 | Epoch: 1 | Step: 192580 | Dataset: 0-902221 | Loss: 0.778 | 913 ms/step , 6888.66 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-05 22:40:34 | Epoch: 1 | Step: 192590 | Dataset: 0-902541 | Loss: 0.883 | 914 ms/step , 6884.31 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 22:40:43 | Epoch: 1 | Step: 192600 | Dataset: 0-902861 | Loss: 0.788 | 914 ms/step , 6884.00 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-05 22:40:45 | Validation | Step: 192600 | Val_loss: 0.748 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:40:54 | Epoch: 1 | Step: 192610 | Dataset: 0-903181 | Loss: 0.680 | 912 ms/step , 6893.19 GFLOP/s , 15276.4 tokens/s INFO:__main__:2024-11-05 22:41:03 | Epoch: 1 | Step: 192620 | Dataset: 0-903501 | Loss: 0.912 | 914 ms/step , 6883.45 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 22:41:12 | Epoch: 1 | Step: 192630 | Dataset: 0-903821 | Loss: 0.682 | 913 ms/step , 6892.28 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 22:41:21 | Epoch: 1 | Step: 192640 | Dataset: 0-904141 | Loss: 0.862 | 913 ms/step , 6886.34 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 22:41:31 | Epoch: 1 | Step: 192650 | Dataset: 0-904461 | Loss: 0.874 | 914 ms/step , 6884.40 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-05 22:41:40 | Epoch: 1 | Step: 192660 | Dataset: 0-904781 | Loss: 0.919 | 915 ms/step , 6873.99 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 22:41:49 | Epoch: 1 | Step: 192670 | Dataset: 0-905101 | Loss: 0.841 | 915 ms/step , 6873.00 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 22:41:58 | Epoch: 1 | Step: 192680 | Dataset: 0-905421 | Loss: 0.890 | 913 ms/step , 6887.70 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 22:42:07 | Epoch: 1 | Step: 192690 | Dataset: 0-905741 | Loss: 0.858 | 914 ms/step , 6881.30 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-05 22:42:16 | Epoch: 1 | Step: 192700 | Dataset: 0-906061 | Loss: 0.699 | 914 ms/step , 6880.51 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 22:42:18 | Validation | Step: 192700 | Val_loss: 0.686 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:42:27 | Epoch: 1 | Step: 192710 | Dataset: 0-906381 | Loss: 0.837 | 914 ms/step , 6883.52 GFLOP/s , 15281.0 tokens/s INFO:__main__:2024-11-05 22:42:36 | Epoch: 1 | Step: 192720 | Dataset: 0-906701 | Loss: 0.695 | 913 ms/step , 6889.77 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 22:42:45 | Epoch: 1 | Step: 192730 | Dataset: 0-907021 | Loss: 0.801 | 913 ms/step , 6892.45 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-05 22:42:54 | Epoch: 1 | Step: 192740 | Dataset: 0-907341 | Loss: 0.746 | 913 ms/step , 6892.39 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 22:43:04 | Epoch: 1 | Step: 192750 | Dataset: 0-907661 | Loss: 0.709 | 914 ms/step , 6884.03 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 22:43:13 | Epoch: 1 | Step: 192760 | Dataset: 0-907981 | Loss: 0.603 | 913 ms/step , 6891.55 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 22:43:22 | Epoch: 1 | Step: 192770 | Dataset: 0-908301 | Loss: 0.850 | 914 ms/step , 6882.20 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 22:43:31 | Epoch: 1 | Step: 192780 | Dataset: 0-908621 | Loss: 0.735 | 913 ms/step , 6886.99 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 22:43:40 | Epoch: 1 | Step: 192790 | Dataset: 0-908941 | Loss: 0.881 | 916 ms/step , 6864.45 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 22:43:49 | Epoch: 1 | Step: 192800 | Dataset: 0-909261 | Loss: 0.726 | 912 ms/step , 6893.30 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 22:43:51 | Validation | Step: 192800 | Val_loss: 0.742 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:44:00 | Epoch: 1 | Step: 192810 | Dataset: 0-909581 | Loss: 0.785 | 913 ms/step , 6891.85 GFLOP/s , 15272.6 tokens/s INFO:__main__:2024-11-05 22:44:09 | Epoch: 1 | Step: 192820 | Dataset: 0-909901 | Loss: 0.860 | 913 ms/step , 6886.94 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 22:44:18 | Epoch: 1 | Step: 192830 | Dataset: 0-910221 | Loss: 0.756 | 914 ms/step , 6878.76 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 22:44:27 | Epoch: 1 | Step: 192840 | Dataset: 0-910541 | Loss: 0.916 | 914 ms/step , 6883.96 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-05 22:44:36 | Epoch: 1 | Step: 192850 | Dataset: 0-910861 | Loss: 0.846 | 912 ms/step , 6894.11 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 22:44:46 | Epoch: 1 | Step: 192860 | Dataset: 0-911181 | Loss: 0.717 | 912 ms/step , 6893.56 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 22:44:55 | Epoch: 1 | Step: 192870 | Dataset: 0-911501 | Loss: 0.773 | 914 ms/step , 6878.28 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 22:45:04 | Epoch: 1 | Step: 192880 | Dataset: 0-911821 | Loss: 0.792 | 913 ms/step , 6891.96 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 22:45:13 | Epoch: 1 | Step: 192890 | Dataset: 0-912141 | Loss: 0.837 | 914 ms/step , 6881.28 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-05 22:45:22 | Epoch: 1 | Step: 192900 | Dataset: 0-912461 | Loss: 0.732 | 913 ms/step , 6885.39 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 22:45:24 | Validation | Step: 192900 | Val_loss: 0.661 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:45:33 | Epoch: 1 | Step: 192910 | Dataset: 0-912781 | Loss: 0.753 | 912 ms/step , 6894.14 GFLOP/s , 15286.4 tokens/s INFO:__main__:2024-11-05 22:45:42 | Epoch: 1 | Step: 192920 | Dataset: 0-913101 | Loss: 0.770 | 920 ms/step , 6838.31 GFLOP/s , 17807.6 tokens/s INFO:__main__:2024-11-05 22:45:51 | Epoch: 1 | Step: 192930 | Dataset: 0-913421 | Loss: 0.802 | 913 ms/step , 6890.62 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 22:46:00 | Epoch: 1 | Step: 192940 | Dataset: 0-913741 | Loss: 0.826 | 913 ms/step , 6889.56 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 22:46:10 | Epoch: 1 | Step: 192950 | Dataset: 0-914061 | Loss: 0.779 | 913 ms/step , 6885.99 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-05 22:46:19 | Epoch: 1 | Step: 192960 | Dataset: 0-914381 | Loss: 0.649 | 912 ms/step , 6894.80 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 22:46:28 | Epoch: 1 | Step: 192970 | Dataset: 0-914701 | Loss: 0.747 | 914 ms/step , 6878.77 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 22:46:37 | Epoch: 1 | Step: 192980 | Dataset: 0-915021 | Loss: 0.774 | 913 ms/step , 6892.01 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 22:46:46 | Epoch: 1 | Step: 192990 | Dataset: 0-915341 | Loss: 0.778 | 913 ms/step , 6888.75 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 22:46:55 | Epoch: 1 | Step: 193000 | Dataset: 0-915661 | Loss: 0.775 | 915 ms/step , 6877.24 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-05 22:46:57 | Validation | Step: 193000 | Val_loss: 0.687 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:46:57 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_224657_step_193000.pt` INFO:__main__:2024-11-05 22:47:07 | Epoch: 1 | Step: 193010 | Dataset: 0-915981 | Loss: 0.761 | 913 ms/step , 6888.40 GFLOP/s , 13805.3 tokens/s INFO:__main__:2024-11-05 22:47:16 | Epoch: 1 | Step: 193020 | Dataset: 0-916301 | Loss: 0.766 | 913 ms/step , 6885.87 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 22:47:25 | Epoch: 1 | Step: 193030 | Dataset: 0-916621 | Loss: 0.648 | 912 ms/step , 6896.63 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 22:47:35 | Epoch: 1 | Step: 193040 | Dataset: 0-916941 | Loss: 0.762 | 915 ms/step , 6871.99 GFLOP/s , 17871.1 tokens/s INFO:__main__:2024-11-05 22:47:44 | Epoch: 1 | Step: 193050 | Dataset: 0-917261 | Loss: 0.677 | 913 ms/step , 6885.80 GFLOP/s , 17904.4 tokens/s INFO:__main__:2024-11-05 22:47:53 | Epoch: 1 | Step: 193060 | Dataset: 0-917581 | Loss: 0.811 | 914 ms/step , 6881.44 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-05 22:48:02 | Epoch: 1 | Step: 193070 | Dataset: 0-917901 | Loss: 0.874 | 912 ms/step , 6893.58 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-05 22:48:11 | Epoch: 1 | Step: 193080 | Dataset: 0-918221 | Loss: 0.885 | 913 ms/step , 6888.17 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 22:48:20 | Epoch: 1 | Step: 193090 | Dataset: 0-918541 | Loss: 0.799 | 913 ms/step , 6885.98 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 22:48:29 | Epoch: 1 | Step: 193100 | Dataset: 0-918861 | Loss: 0.836 | 914 ms/step , 6884.67 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 22:48:31 | Validation | Step: 193100 | Val_loss: 0.697 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:48:40 | Epoch: 1 | Step: 193110 | Dataset: 0-919181 | Loss: 0.894 | 913 ms/step , 6888.84 GFLOP/s , 15266.7 tokens/s INFO:__main__:2024-11-05 22:48:49 | Epoch: 1 | Step: 193120 | Dataset: 0-919501 | Loss: 0.797 | 914 ms/step , 6877.96 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-05 22:48:58 | Epoch: 1 | Step: 193130 | Dataset: 0-919821 | Loss: 0.869 | 914 ms/step , 6879.19 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 22:49:08 | Epoch: 1 | Step: 193140 | Dataset: 0-920141 | Loss: 0.886 | 914 ms/step , 6878.95 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 22:49:17 | Epoch: 1 | Step: 193150 | Dataset: 0-920461 | Loss: 0.823 | 913 ms/step , 6891.17 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 22:49:26 | Epoch: 1 | Step: 193160 | Dataset: 0-920781 | Loss: 0.799 | 913 ms/step , 6892.03 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 22:49:35 | Epoch: 1 | Step: 193170 | Dataset: 0-921101 | Loss: 0.937 | 915 ms/step , 6874.70 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 22:49:44 | Epoch: 1 | Step: 193180 | Dataset: 0-921421 | Loss: 0.780 | 913 ms/step , 6885.09 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 22:49:53 | Epoch: 1 | Step: 193190 | Dataset: 0-921741 | Loss: 0.968 | 914 ms/step , 6879.77 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 22:50:02 | Epoch: 1 | Step: 193200 | Dataset: 0-922061 | Loss: 0.813 | 913 ms/step , 6889.29 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 22:50:04 | Validation | Step: 193200 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:50:13 | Epoch: 1 | Step: 193210 | Dataset: 0-922381 | Loss: 0.810 | 914 ms/step , 6879.94 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-05 22:50:22 | Epoch: 1 | Step: 193220 | Dataset: 0-922701 | Loss: 0.882 | 915 ms/step , 6872.41 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 22:50:31 | Epoch: 1 | Step: 193230 | Dataset: 0-923021 | Loss: 0.857 | 914 ms/step , 6880.54 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 22:50:41 | Epoch: 1 | Step: 193240 | Dataset: 0-923341 | Loss: 0.569 | 914 ms/step , 6880.73 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-05 22:50:50 | Epoch: 1 | Step: 193250 | Dataset: 0-923661 | Loss: 0.772 | 914 ms/step , 6882.97 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 22:50:59 | Epoch: 1 | Step: 193260 | Dataset: 0-923981 | Loss: 0.745 | 915 ms/step , 6872.80 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 22:51:08 | Epoch: 1 | Step: 193270 | Dataset: 0-924301 | Loss: 0.870 | 913 ms/step , 6890.64 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 22:51:17 | Epoch: 1 | Step: 193280 | Dataset: 0-924621 | Loss: 0.837 | 914 ms/step , 6883.01 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 22:51:26 | Epoch: 1 | Step: 193290 | Dataset: 0-924941 | Loss: 0.857 | 913 ms/step , 6886.59 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 22:51:35 | Epoch: 1 | Step: 193300 | Dataset: 0-925261 | Loss: 0.846 | 914 ms/step , 6882.06 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 22:51:37 | Validation | Step: 193300 | Val_loss: 0.713 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:51:46 | Epoch: 1 | Step: 193310 | Dataset: 0-925581 | Loss: 0.899 | 913 ms/step , 6890.61 GFLOP/s , 15276.3 tokens/s INFO:__main__:2024-11-05 22:51:55 | Epoch: 1 | Step: 193320 | Dataset: 0-925901 | Loss: 0.763 | 911 ms/step , 6900.66 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 22:52:04 | Epoch: 1 | Step: 193330 | Dataset: 0-926221 | Loss: 0.814 | 914 ms/step , 6883.55 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 22:52:13 | Epoch: 1 | Step: 193340 | Dataset: 0-926541 | Loss: 0.819 | 914 ms/step , 6880.40 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 22:52:23 | Epoch: 1 | Step: 193350 | Dataset: 0-926861 | Loss: 0.654 | 912 ms/step , 6892.76 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-05 22:52:32 | Epoch: 1 | Step: 193360 | Dataset: 0-927181 | Loss: 0.798 | 913 ms/step , 6891.09 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 22:52:41 | Epoch: 1 | Step: 193370 | Dataset: 0-927501 | Loss: 0.563 | 912 ms/step , 6892.93 GFLOP/s , 17944.3 tokens/s INFO:__main__:2024-11-05 22:52:50 | Epoch: 1 | Step: 193380 | Dataset: 0-927821 | Loss: 0.608 | 912 ms/step , 6894.85 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 22:52:59 | Epoch: 1 | Step: 193390 | Dataset: 0-928141 | Loss: 0.768 | 912 ms/step , 6892.64 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 22:53:08 | Epoch: 1 | Step: 193400 | Dataset: 0-928461 | Loss: 0.793 | 913 ms/step , 6887.26 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 22:53:10 | Validation | Step: 193400 | Val_loss: 0.726 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:53:19 | Epoch: 1 | Step: 193410 | Dataset: 0-928781 | Loss: 0.835 | 913 ms/step , 6888.97 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-05 22:53:28 | Epoch: 1 | Step: 193420 | Dataset: 0-929101 | Loss: 0.777 | 915 ms/step , 6876.45 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 22:53:37 | Epoch: 1 | Step: 193430 | Dataset: 0-929421 | Loss: 0.839 | 913 ms/step , 6885.32 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 22:53:46 | Epoch: 1 | Step: 193440 | Dataset: 0-929741 | Loss: 0.805 | 914 ms/step , 6883.80 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 22:53:56 | Epoch: 1 | Step: 193450 | Dataset: 0-930061 | Loss: 0.847 | 916 ms/step , 6869.96 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 22:54:05 | Epoch: 1 | Step: 193460 | Dataset: 0-930381 | Loss: 0.751 | 913 ms/step , 6886.71 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-05 22:54:14 | Epoch: 1 | Step: 193470 | Dataset: 0-930701 | Loss: 0.842 | 915 ms/step , 6874.03 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 22:54:23 | Epoch: 1 | Step: 193480 | Dataset: 0-931021 | Loss: 0.806 | 912 ms/step , 6893.33 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-05 22:54:32 | Epoch: 1 | Step: 193490 | Dataset: 0-931341 | Loss: 0.493 | 913 ms/step , 6889.16 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-05 22:54:41 | Epoch: 1 | Step: 193500 | Dataset: 0-931661 | Loss: 0.748 | 913 ms/step , 6888.25 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 22:54:43 | Validation | Step: 193500 | Val_loss: 0.717 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:54:52 | Epoch: 1 | Step: 193510 | Dataset: 0-931981 | Loss: 0.687 | 913 ms/step , 6890.98 GFLOP/s , 15276.4 tokens/s INFO:__main__:2024-11-05 22:55:01 | Epoch: 1 | Step: 193520 | Dataset: 0-932301 | Loss: 0.802 | 914 ms/step , 6883.22 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 22:55:10 | Epoch: 1 | Step: 193530 | Dataset: 0-932621 | Loss: 0.666 | 913 ms/step , 6888.57 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 22:55:19 | Epoch: 1 | Step: 193540 | Dataset: 0-932941 | Loss: 0.794 | 914 ms/step , 6882.19 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 22:55:29 | Epoch: 1 | Step: 193550 | Dataset: 0-933261 | Loss: 0.786 | 914 ms/step , 6879.23 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 22:55:38 | Epoch: 1 | Step: 193560 | Dataset: 0-933581 | Loss: 0.722 | 914 ms/step , 6881.88 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 22:55:47 | Epoch: 1 | Step: 193570 | Dataset: 0-933901 | Loss: 0.704 | 916 ms/step , 6869.35 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 22:55:56 | Epoch: 1 | Step: 193580 | Dataset: 0-934221 | Loss: 0.771 | 914 ms/step , 6881.64 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 22:56:05 | Epoch: 1 | Step: 193590 | Dataset: 0-934541 | Loss: 0.654 | 912 ms/step , 6894.65 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 22:56:14 | Epoch: 1 | Step: 193600 | Dataset: 0-934861 | Loss: 0.755 | 913 ms/step , 6886.18 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 22:56:16 | Validation | Step: 193600 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:56:25 | Epoch: 1 | Step: 193610 | Dataset: 0-935181 | Loss: 0.689 | 914 ms/step , 6880.87 GFLOP/s , 15277.9 tokens/s INFO:__main__:2024-11-05 22:56:34 | Epoch: 1 | Step: 193620 | Dataset: 0-935501 | Loss: 0.772 | 913 ms/step , 6887.38 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-05 22:56:43 | Epoch: 1 | Step: 193630 | Dataset: 0-935821 | Loss: 0.686 | 913 ms/step , 6885.76 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-05 22:56:52 | Epoch: 1 | Step: 193640 | Dataset: 0-936141 | Loss: 0.739 | 913 ms/step , 6888.89 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 22:57:01 | Epoch: 1 | Step: 193650 | Dataset: 0-936461 | Loss: 0.586 | 912 ms/step , 6893.53 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 22:57:11 | Epoch: 1 | Step: 193660 | Dataset: 0-936781 | Loss: 0.850 | 914 ms/step , 6879.35 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 22:57:20 | Epoch: 1 | Step: 193670 | Dataset: 0-937101 | Loss: 0.735 | 915 ms/step , 6871.11 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 22:57:29 | Epoch: 1 | Step: 193680 | Dataset: 0-937421 | Loss: 0.717 | 913 ms/step , 6892.25 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-05 22:57:38 | Epoch: 1 | Step: 193690 | Dataset: 0-937741 | Loss: 0.664 | 914 ms/step , 6885.02 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 22:57:47 | Epoch: 1 | Step: 193700 | Dataset: 0-938061 | Loss: 0.719 | 913 ms/step , 6887.35 GFLOP/s , 17939.7 tokens/s INFO:__main__:2024-11-05 22:57:49 | Validation | Step: 193700 | Val_loss: 0.734 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:57:58 | Epoch: 1 | Step: 193710 | Dataset: 0-938381 | Loss: 0.773 | 914 ms/step , 6883.11 GFLOP/s , 15293.0 tokens/s INFO:__main__:2024-11-05 22:58:07 | Epoch: 1 | Step: 193720 | Dataset: 0-938701 | Loss: 0.599 | 911 ms/step , 6902.32 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 22:58:16 | Epoch: 1 | Step: 193730 | Dataset: 0-939021 | Loss: 0.816 | 915 ms/step , 6870.83 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 22:58:25 | Epoch: 1 | Step: 193740 | Dataset: 0-939341 | Loss: 0.827 | 912 ms/step , 6893.08 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 22:58:34 | Epoch: 1 | Step: 193750 | Dataset: 0-939661 | Loss: 0.407 | 912 ms/step , 6895.65 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 22:58:44 | Epoch: 1 | Step: 193760 | Dataset: 0-939981 | Loss: 0.836 | 914 ms/step , 6883.88 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 22:58:53 | Epoch: 1 | Step: 193770 | Dataset: 0-940301 | Loss: 0.664 | 912 ms/step , 6896.37 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 22:59:02 | Epoch: 1 | Step: 193780 | Dataset: 0-940621 | Loss: 0.706 | 914 ms/step , 6881.85 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 22:59:11 | Epoch: 1 | Step: 193790 | Dataset: 0-940941 | Loss: 0.727 | 913 ms/step , 6890.61 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 22:59:20 | Epoch: 1 | Step: 193800 | Dataset: 0-941261 | Loss: 0.609 | 913 ms/step , 6890.43 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 22:59:22 | Validation | Step: 193800 | Val_loss: 0.691 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 22:59:31 | Epoch: 1 | Step: 193810 | Dataset: 0-941581 | Loss: 0.729 | 914 ms/step , 6882.70 GFLOP/s , 15276.2 tokens/s INFO:__main__:2024-11-05 22:59:40 | Epoch: 1 | Step: 193820 | Dataset: 0-941901 | Loss: 0.632 | 915 ms/step , 6871.46 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 22:59:49 | Epoch: 1 | Step: 193830 | Dataset: 0-942221 | Loss: 0.743 | 913 ms/step , 6885.94 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 22:59:58 | Epoch: 1 | Step: 193840 | Dataset: 0-942541 | Loss: 0.780 | 914 ms/step , 6880.57 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:00:07 | Epoch: 1 | Step: 193850 | Dataset: 0-942861 | Loss: 0.612 | 913 ms/step , 6885.58 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 23:00:17 | Epoch: 1 | Step: 193860 | Dataset: 0-943181 | Loss: 0.765 | 913 ms/step , 6889.76 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 23:00:26 | Epoch: 1 | Step: 193870 | Dataset: 0-943501 | Loss: 0.793 | 915 ms/step , 6874.76 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 23:00:35 | Epoch: 1 | Step: 193880 | Dataset: 0-943821 | Loss: 0.753 | 914 ms/step , 6877.63 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 23:00:44 | Epoch: 1 | Step: 193890 | Dataset: 0-944141 | Loss: 0.734 | 913 ms/step , 6891.43 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 23:00:53 | Epoch: 1 | Step: 193900 | Dataset: 0-944461 | Loss: 0.721 | 913 ms/step , 6889.86 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 23:00:55 | Validation | Step: 193900 | Val_loss: 0.748 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:01:04 | Epoch: 1 | Step: 193910 | Dataset: 0-944781 | Loss: 0.574 | 912 ms/step , 6898.29 GFLOP/s , 15278.1 tokens/s INFO:__main__:2024-11-05 23:01:13 | Epoch: 1 | Step: 193920 | Dataset: 0-945101 | Loss: 0.658 | 913 ms/step , 6890.45 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 23:01:22 | Epoch: 1 | Step: 193930 | Dataset: 0-945421 | Loss: 0.688 | 913 ms/step , 6885.74 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 23:01:31 | Epoch: 1 | Step: 193940 | Dataset: 0-945741 | Loss: 0.755 | 913 ms/step , 6885.66 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-05 23:01:40 | Epoch: 1 | Step: 193950 | Dataset: 0-946061 | Loss: 0.644 | 914 ms/step , 6883.76 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-05 23:01:49 | Epoch: 1 | Step: 193960 | Dataset: 0-946381 | Loss: 0.533 | 913 ms/step , 6890.58 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-05 23:01:59 | Epoch: 1 | Step: 193970 | Dataset: 0-946701 | Loss: 0.740 | 912 ms/step , 6893.82 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-05 23:02:08 | Epoch: 1 | Step: 193980 | Dataset: 0-947021 | Loss: 0.705 | 913 ms/step , 6890.02 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 23:02:17 | Epoch: 1 | Step: 193990 | Dataset: 0-947341 | Loss: 0.700 | 914 ms/step , 6884.34 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 23:02:26 | Epoch: 1 | Step: 194000 | Dataset: 0-947661 | Loss: 0.762 | 915 ms/step , 6876.82 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:02:28 | Validation | Step: 194000 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:02:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_230228_step_194000.pt` INFO:__main__:2024-11-05 23:02:38 | Epoch: 1 | Step: 194010 | Dataset: 0-947981 | Loss: 0.809 | 914 ms/step , 6882.29 GFLOP/s , 13814.0 tokens/s INFO:__main__:2024-11-05 23:02:47 | Epoch: 1 | Step: 194020 | Dataset: 0-948301 | Loss: 0.769 | 914 ms/step , 6884.35 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-05 23:02:56 | Epoch: 1 | Step: 194030 | Dataset: 0-948621 | Loss: 0.623 | 914 ms/step , 6883.14 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 23:03:05 | Epoch: 1 | Step: 194040 | Dataset: 0-948941 | Loss: 0.735 | 912 ms/step , 6893.74 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-05 23:03:14 | Epoch: 1 | Step: 194050 | Dataset: 0-949261 | Loss: 0.648 | 913 ms/step , 6889.37 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 23:03:24 | Epoch: 1 | Step: 194060 | Dataset: 0-949581 | Loss: 0.458 | 913 ms/step , 6889.38 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 23:03:33 | Epoch: 1 | Step: 194070 | Dataset: 0-949901 | Loss: 0.828 | 914 ms/step , 6883.07 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 23:03:42 | Epoch: 1 | Step: 194080 | Dataset: 0-950221 | Loss: 0.512 | 912 ms/step , 6899.29 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-05 23:03:51 | Epoch: 1 | Step: 194090 | Dataset: 0-950541 | Loss: 0.723 | 913 ms/step , 6887.51 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 23:04:00 | Epoch: 1 | Step: 194100 | Dataset: 0-950861 | Loss: 0.855 | 915 ms/step , 6876.41 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 23:04:02 | Validation | Step: 194100 | Val_loss: 0.632 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:04:11 | Epoch: 1 | Step: 194110 | Dataset: 0-951181 | Loss: 0.715 | 914 ms/step , 6880.81 GFLOP/s , 15276.3 tokens/s INFO:__main__:2024-11-05 23:04:20 | Epoch: 1 | Step: 194120 | Dataset: 0-951501 | Loss: 0.673 | 913 ms/step , 6892.04 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 23:04:29 | Epoch: 1 | Step: 194130 | Dataset: 0-951821 | Loss: 0.732 | 914 ms/step , 6881.79 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-05 23:04:38 | Epoch: 1 | Step: 194140 | Dataset: 0-952141 | Loss: 0.765 | 912 ms/step , 6892.68 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 23:04:47 | Epoch: 1 | Step: 194150 | Dataset: 0-952461 | Loss: 0.567 | 912 ms/step , 6892.97 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 23:04:57 | Epoch: 1 | Step: 194160 | Dataset: 0-952781 | Loss: 0.443 | 912 ms/step , 6894.95 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 23:05:06 | Epoch: 1 | Step: 194170 | Dataset: 0-953101 | Loss: 0.789 | 913 ms/step , 6890.90 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-05 23:05:15 | Epoch: 1 | Step: 194180 | Dataset: 0-953421 | Loss: 0.624 | 913 ms/step , 6887.69 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-05 23:05:24 | Epoch: 1 | Step: 194190 | Dataset: 0-953741 | Loss: 0.744 | 913 ms/step , 6887.35 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-05 23:05:33 | Epoch: 1 | Step: 194200 | Dataset: 0-954061 | Loss: 0.727 | 913 ms/step , 6889.83 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-05 23:05:35 | Validation | Step: 194200 | Val_loss: 0.634 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:05:44 | Epoch: 1 | Step: 194210 | Dataset: 0-954381 | Loss: 0.540 | 912 ms/step , 6892.75 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 23:05:53 | Epoch: 1 | Step: 194220 | Dataset: 0-954701 | Loss: 0.671 | 913 ms/step , 6888.13 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-05 23:06:02 | Epoch: 1 | Step: 194230 | Dataset: 0-955021 | Loss: 0.750 | 913 ms/step , 6886.65 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-05 23:06:11 | Epoch: 1 | Step: 194240 | Dataset: 0-955341 | Loss: 0.768 | 914 ms/step , 6878.75 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 23:06:20 | Epoch: 1 | Step: 194250 | Dataset: 0-955661 | Loss: 0.681 | 913 ms/step , 6891.43 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-05 23:06:29 | Epoch: 1 | Step: 194260 | Dataset: 0-955981 | Loss: 0.760 | 913 ms/step , 6885.18 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-05 23:06:39 | Epoch: 1 | Step: 194270 | Dataset: 0-956301 | Loss: 0.694 | 913 ms/step , 6889.92 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-05 23:06:48 | Epoch: 1 | Step: 194280 | Dataset: 0-956621 | Loss: 0.755 | 914 ms/step , 6880.05 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 23:06:57 | Epoch: 1 | Step: 194290 | Dataset: 0-956941 | Loss: 0.782 | 913 ms/step , 6886.50 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 23:07:06 | Epoch: 1 | Step: 194300 | Dataset: 0-957261 | Loss: 0.483 | 912 ms/step , 6893.53 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 23:07:08 | Validation | Step: 194300 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:07:17 | Epoch: 1 | Step: 194310 | Dataset: 0-957581 | Loss: 0.833 | 913 ms/step , 6889.58 GFLOP/s , 15281.9 tokens/s INFO:__main__:2024-11-05 23:07:26 | Epoch: 1 | Step: 194320 | Dataset: 0-957901 | Loss: 0.721 | 913 ms/step , 6890.14 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 23:07:35 | Epoch: 1 | Step: 194330 | Dataset: 0-958221 | Loss: 0.800 | 913 ms/step , 6886.78 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-05 23:07:44 | Epoch: 1 | Step: 194340 | Dataset: 0-958541 | Loss: 0.716 | 913 ms/step , 6885.58 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-05 23:07:53 | Epoch: 1 | Step: 194350 | Dataset: 0-958861 | Loss: 0.722 | 914 ms/step , 6882.24 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 23:08:02 | Epoch: 1 | Step: 194360 | Dataset: 0-959181 | Loss: 0.790 | 913 ms/step , 6885.95 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-05 23:08:12 | Epoch: 1 | Step: 194370 | Dataset: 0-959501 | Loss: 0.773 | 913 ms/step , 6890.74 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 23:08:21 | Epoch: 1 | Step: 194380 | Dataset: 0-959821 | Loss: 0.853 | 913 ms/step , 6889.33 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 23:08:30 | Epoch: 1 | Step: 194390 | Dataset: 0-960141 | Loss: 0.828 | 913 ms/step , 6887.61 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-05 23:08:39 | Epoch: 1 | Step: 194400 | Dataset: 0-960461 | Loss: 0.766 | 915 ms/step , 6873.07 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-05 23:08:40 | Validation | Step: 194400 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:08:50 | Epoch: 1 | Step: 194410 | Dataset: 0-960781 | Loss: 0.758 | 915 ms/step , 6875.89 GFLOP/s , 15279.4 tokens/s INFO:__main__:2024-11-05 23:08:59 | Epoch: 1 | Step: 194420 | Dataset: 0-961101 | Loss: 0.711 | 913 ms/step , 6889.20 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 23:09:08 | Epoch: 1 | Step: 194430 | Dataset: 0-961421 | Loss: 0.810 | 913 ms/step , 6886.42 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-05 23:09:17 | Epoch: 1 | Step: 194440 | Dataset: 0-961741 | Loss: 0.832 | 914 ms/step , 6880.52 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 23:09:26 | Epoch: 1 | Step: 194450 | Dataset: 0-962061 | Loss: 0.704 | 913 ms/step , 6892.00 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 23:09:35 | Epoch: 1 | Step: 194460 | Dataset: 0-962381 | Loss: 0.765 | 914 ms/step , 6879.37 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 23:09:44 | Epoch: 1 | Step: 194470 | Dataset: 0-962701 | Loss: 0.749 | 913 ms/step , 6887.98 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 23:09:54 | Epoch: 1 | Step: 194480 | Dataset: 0-963021 | Loss: 0.799 | 915 ms/step , 6877.18 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 23:10:03 | Epoch: 1 | Step: 194490 | Dataset: 0-963341 | Loss: 0.678 | 913 ms/step , 6885.75 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 23:10:12 | Epoch: 1 | Step: 194500 | Dataset: 0-963661 | Loss: 0.674 | 913 ms/step , 6891.44 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 23:10:13 | Validation | Step: 194500 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:10:23 | Epoch: 1 | Step: 194510 | Dataset: 0-963981 | Loss: 0.790 | 912 ms/step , 6896.30 GFLOP/s , 15286.5 tokens/s INFO:__main__:2024-11-05 23:10:32 | Epoch: 1 | Step: 194520 | Dataset: 0-964301 | Loss: 0.550 | 912 ms/step , 6894.53 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 23:10:41 | Epoch: 1 | Step: 194530 | Dataset: 0-964621 | Loss: 0.681 | 914 ms/step , 6882.71 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 23:10:50 | Epoch: 1 | Step: 194540 | Dataset: 0-964941 | Loss: 0.804 | 914 ms/step , 6878.56 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 23:10:59 | Epoch: 1 | Step: 194550 | Dataset: 0-965261 | Loss: 0.781 | 914 ms/step , 6881.17 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 23:11:08 | Epoch: 1 | Step: 194560 | Dataset: 0-965581 | Loss: 0.779 | 913 ms/step , 6890.05 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 23:11:17 | Epoch: 1 | Step: 194570 | Dataset: 0-965901 | Loss: 0.764 | 914 ms/step , 6882.07 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-05 23:11:27 | Epoch: 1 | Step: 194580 | Dataset: 0-966221 | Loss: 0.591 | 912 ms/step , 6892.85 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 23:11:36 | Epoch: 1 | Step: 194590 | Dataset: 0-966541 | Loss: 0.662 | 912 ms/step , 6895.99 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:11:45 | Epoch: 1 | Step: 194600 | Dataset: 0-966861 | Loss: 0.510 | 912 ms/step , 6896.77 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 23:11:46 | Validation | Step: 194600 | Val_loss: 0.668 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:11:56 | Epoch: 1 | Step: 194610 | Dataset: 0-967181 | Loss: 0.686 | 913 ms/step , 6885.24 GFLOP/s , 15282.0 tokens/s INFO:__main__:2024-11-05 23:12:05 | Epoch: 1 | Step: 194620 | Dataset: 0-967501 | Loss: 0.922 | 913 ms/step , 6886.89 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:12:14 | Epoch: 1 | Step: 194630 | Dataset: 0-967821 | Loss: 0.587 | 913 ms/step , 6890.09 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 23:12:23 | Epoch: 1 | Step: 194640 | Dataset: 0-968141 | Loss: 0.680 | 913 ms/step , 6888.67 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 23:12:32 | Epoch: 1 | Step: 194650 | Dataset: 0-968461 | Loss: 0.423 | 914 ms/step , 6880.11 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 23:12:41 | Epoch: 1 | Step: 194660 | Dataset: 0-968781 | Loss: 0.641 | 913 ms/step , 6892.31 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 23:12:50 | Epoch: 1 | Step: 194670 | Dataset: 0-969101 | Loss: 0.615 | 913 ms/step , 6889.35 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 23:13:00 | Epoch: 1 | Step: 194680 | Dataset: 0-969421 | Loss: 0.809 | 914 ms/step , 6882.40 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 23:13:09 | Epoch: 1 | Step: 194690 | Dataset: 0-969741 | Loss: 0.693 | 914 ms/step , 6883.57 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 23:13:18 | Epoch: 1 | Step: 194700 | Dataset: 0-970061 | Loss: 0.632 | 913 ms/step , 6887.27 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 23:13:19 | Validation | Step: 194700 | Val_loss: 0.720 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:13:29 | Epoch: 1 | Step: 194710 | Dataset: 0-970381 | Loss: 0.763 | 913 ms/step , 6885.30 GFLOP/s , 15270.8 tokens/s INFO:__main__:2024-11-05 23:13:38 | Epoch: 1 | Step: 194720 | Dataset: 0-970701 | Loss: 0.753 | 915 ms/step , 6875.01 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-05 23:13:47 | Epoch: 1 | Step: 194730 | Dataset: 0-971021 | Loss: 0.776 | 913 ms/step , 6889.46 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 23:13:56 | Epoch: 1 | Step: 194740 | Dataset: 0-971341 | Loss: 0.678 | 913 ms/step , 6890.18 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 23:14:05 | Epoch: 1 | Step: 194750 | Dataset: 0-971661 | Loss: 0.649 | 913 ms/step , 6885.16 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 23:14:14 | Epoch: 1 | Step: 194760 | Dataset: 0-971981 | Loss: 0.651 | 913 ms/step , 6887.81 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 23:14:23 | Epoch: 1 | Step: 194770 | Dataset: 0-972301 | Loss: 0.768 | 914 ms/step , 6883.78 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 23:14:32 | Epoch: 1 | Step: 194780 | Dataset: 0-972621 | Loss: 0.751 | 914 ms/step , 6883.74 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-05 23:14:42 | Epoch: 1 | Step: 194790 | Dataset: 0-972941 | Loss: 0.775 | 915 ms/step , 6875.72 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 23:14:51 | Epoch: 1 | Step: 194800 | Dataset: 0-973261 | Loss: 0.829 | 915 ms/step , 6873.31 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 23:14:52 | Validation | Step: 194800 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:15:01 | Epoch: 1 | Step: 194810 | Dataset: 0-973581 | Loss: 0.640 | 914 ms/step , 6882.19 GFLOP/s , 15274.9 tokens/s INFO:__main__:2024-11-05 23:15:11 | Epoch: 1 | Step: 194820 | Dataset: 0-973901 | Loss: 0.789 | 914 ms/step , 6884.09 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 23:15:20 | Epoch: 1 | Step: 194830 | Dataset: 0-974221 | Loss: 0.801 | 914 ms/step , 6878.26 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 23:15:29 | Epoch: 1 | Step: 194840 | Dataset: 0-974541 | Loss: 0.800 | 914 ms/step , 6883.90 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 23:15:38 | Epoch: 1 | Step: 194850 | Dataset: 0-974861 | Loss: 0.740 | 914 ms/step , 6884.42 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 23:15:47 | Epoch: 1 | Step: 194860 | Dataset: 0-975181 | Loss: 0.574 | 912 ms/step , 6896.27 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-05 23:15:56 | Epoch: 1 | Step: 194870 | Dataset: 0-975501 | Loss: 0.781 | 914 ms/step , 6884.33 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 23:16:05 | Epoch: 1 | Step: 194880 | Dataset: 0-975821 | Loss: 0.737 | 913 ms/step , 6887.16 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-05 23:16:15 | Epoch: 1 | Step: 194890 | Dataset: 0-976141 | Loss: 0.741 | 913 ms/step , 6888.62 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-05 23:16:24 | Epoch: 1 | Step: 194900 | Dataset: 0-976461 | Loss: 0.778 | 914 ms/step , 6878.55 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 23:16:25 | Validation | Step: 194900 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:16:34 | Epoch: 1 | Step: 194910 | Dataset: 0-976781 | Loss: 0.685 | 913 ms/step , 6892.29 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-05 23:16:44 | Epoch: 1 | Step: 194920 | Dataset: 0-977101 | Loss: 0.671 | 913 ms/step , 6890.47 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 23:16:53 | Epoch: 1 | Step: 194930 | Dataset: 0-977421 | Loss: 0.755 | 914 ms/step , 6884.64 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:17:02 | Epoch: 1 | Step: 194940 | Dataset: 0-977741 | Loss: 0.774 | 914 ms/step , 6879.97 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 23:17:11 | Epoch: 1 | Step: 194950 | Dataset: 0-978061 | Loss: 0.792 | 913 ms/step , 6891.30 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 23:17:20 | Epoch: 1 | Step: 194960 | Dataset: 0-978381 | Loss: 0.720 | 913 ms/step , 6889.90 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 23:17:29 | Epoch: 1 | Step: 194970 | Dataset: 0-978701 | Loss: 0.683 | 913 ms/step , 6886.55 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 23:17:38 | Epoch: 1 | Step: 194980 | Dataset: 0-979021 | Loss: 0.716 | 913 ms/step , 6889.34 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 23:17:48 | Epoch: 1 | Step: 194990 | Dataset: 0-979341 | Loss: 0.759 | 913 ms/step , 6889.13 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 23:17:57 | Epoch: 1 | Step: 195000 | Dataset: 0-979661 | Loss: 0.713 | 912 ms/step , 6894.13 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:17:58 | Validation | Step: 195000 | Val_loss: 0.722 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:17:58 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_231758_step_195000.pt` INFO:__main__:2024-11-05 23:18:09 | Epoch: 1 | Step: 195010 | Dataset: 0-979981 | Loss: 0.620 | 914 ms/step , 6880.74 GFLOP/s , 13822.3 tokens/s INFO:__main__:2024-11-05 23:18:18 | Epoch: 1 | Step: 195020 | Dataset: 0-980301 | Loss: 0.669 | 913 ms/step , 6885.45 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-05 23:18:27 | Epoch: 1 | Step: 195030 | Dataset: 0-980621 | Loss: 0.665 | 913 ms/step , 6891.71 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 23:18:36 | Epoch: 1 | Step: 195040 | Dataset: 0-980941 | Loss: 0.719 | 914 ms/step , 6883.99 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-05 23:18:45 | Epoch: 1 | Step: 195050 | Dataset: 0-981261 | Loss: 0.693 | 913 ms/step , 6887.18 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 23:18:54 | Epoch: 1 | Step: 195060 | Dataset: 0-981581 | Loss: 0.640 | 912 ms/step , 6899.08 GFLOP/s , 17940.9 tokens/s INFO:__main__:2024-11-05 23:19:03 | Epoch: 1 | Step: 195070 | Dataset: 0-981901 | Loss: 0.803 | 915 ms/step , 6876.65 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 23:19:13 | Epoch: 1 | Step: 195080 | Dataset: 0-982221 | Loss: 0.772 | 914 ms/step , 6883.14 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-05 23:19:22 | Epoch: 1 | Step: 195090 | Dataset: 0-982541 | Loss: 0.698 | 913 ms/step , 6886.43 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-05 23:19:31 | Epoch: 1 | Step: 195100 | Dataset: 0-982861 | Loss: 0.712 | 913 ms/step , 6885.21 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-05 23:19:32 | Validation | Step: 195100 | Val_loss: 0.727 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:19:42 | Epoch: 1 | Step: 195110 | Dataset: 0-983181 | Loss: 0.613 | 913 ms/step , 6887.17 GFLOP/s , 15285.0 tokens/s INFO:__main__:2024-11-05 23:19:51 | Epoch: 1 | Step: 195120 | Dataset: 0-983501 | Loss: 0.767 | 913 ms/step , 6886.59 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-05 23:20:00 | Epoch: 1 | Step: 195130 | Dataset: 0-983821 | Loss: 0.732 | 914 ms/step , 6882.35 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-05 23:20:09 | Epoch: 1 | Step: 195140 | Dataset: 0-984141 | Loss: 0.828 | 913 ms/step , 6887.30 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 23:20:18 | Epoch: 1 | Step: 195150 | Dataset: 0-984461 | Loss: 0.630 | 914 ms/step , 6879.78 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 23:20:27 | Epoch: 1 | Step: 195160 | Dataset: 0-984781 | Loss: 0.804 | 915 ms/step , 6877.47 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 23:20:36 | Epoch: 1 | Step: 195170 | Dataset: 0-985101 | Loss: 0.668 | 912 ms/step , 6894.28 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 23:20:45 | Epoch: 1 | Step: 195180 | Dataset: 0-985421 | Loss: 0.815 | 914 ms/step , 6884.35 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-05 23:20:55 | Epoch: 1 | Step: 195190 | Dataset: 0-985741 | Loss: 0.738 | 914 ms/step , 6884.25 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 23:21:04 | Epoch: 1 | Step: 195200 | Dataset: 0-986061 | Loss: 0.740 | 913 ms/step , 6891.10 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 23:21:05 | Validation | Step: 195200 | Val_loss: 0.757 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:21:14 | Epoch: 1 | Step: 195210 | Dataset: 0-986381 | Loss: 0.644 | 913 ms/step , 6892.23 GFLOP/s , 15279.6 tokens/s INFO:__main__:2024-11-05 23:21:24 | Epoch: 1 | Step: 195220 | Dataset: 0-986701 | Loss: 0.754 | 914 ms/step , 6877.68 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-05 23:21:33 | Epoch: 1 | Step: 195230 | Dataset: 0-987021 | Loss: 0.818 | 915 ms/step , 6876.66 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-05 23:21:42 | Epoch: 1 | Step: 195240 | Dataset: 0-987341 | Loss: 0.786 | 913 ms/step , 6885.79 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-05 23:21:51 | Epoch: 1 | Step: 195250 | Dataset: 0-987661 | Loss: 0.754 | 914 ms/step , 6878.05 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 23:22:00 | Epoch: 1 | Step: 195260 | Dataset: 0-987981 | Loss: 0.710 | 914 ms/step , 6884.39 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:22:09 | Epoch: 1 | Step: 195270 | Dataset: 0-988301 | Loss: 0.746 | 914 ms/step , 6882.49 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-05 23:22:18 | Epoch: 1 | Step: 195280 | Dataset: 0-988621 | Loss: 0.741 | 913 ms/step , 6888.49 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-05 23:22:28 | Epoch: 1 | Step: 195290 | Dataset: 0-988941 | Loss: 0.723 | 913 ms/step , 6891.81 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-05 23:22:37 | Epoch: 1 | Step: 195300 | Dataset: 0-989261 | Loss: 0.663 | 913 ms/step , 6888.14 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 23:22:38 | Validation | Step: 195300 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:22:47 | Epoch: 1 | Step: 195310 | Dataset: 0-989581 | Loss: 0.750 | 912 ms/step , 6892.74 GFLOP/s , 15262.8 tokens/s INFO:__main__:2024-11-05 23:22:57 | Epoch: 1 | Step: 195320 | Dataset: 0-989901 | Loss: 0.804 | 913 ms/step , 6889.15 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 23:23:06 | Epoch: 1 | Step: 195330 | Dataset: 0-990221 | Loss: 0.705 | 914 ms/step , 6884.22 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-05 23:23:15 | Epoch: 1 | Step: 195340 | Dataset: 0-990541 | Loss: 0.712 | 913 ms/step , 6891.14 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-05 23:23:24 | Epoch: 1 | Step: 195350 | Dataset: 0-990861 | Loss: 0.728 | 915 ms/step , 6877.04 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-05 23:23:33 | Epoch: 1 | Step: 195360 | Dataset: 0-991181 | Loss: 0.723 | 914 ms/step , 6885.02 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 23:23:42 | Epoch: 1 | Step: 195370 | Dataset: 0-991501 | Loss: 0.644 | 912 ms/step , 6892.84 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 23:23:51 | Epoch: 1 | Step: 195380 | Dataset: 0-991821 | Loss: 0.795 | 914 ms/step , 6880.77 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 23:24:01 | Epoch: 1 | Step: 195390 | Dataset: 0-992141 | Loss: 0.757 | 913 ms/step , 6888.08 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 23:24:10 | Epoch: 1 | Step: 195400 | Dataset: 0-992461 | Loss: 0.686 | 913 ms/step , 6885.79 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 23:24:11 | Validation | Step: 195400 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:24:20 | Epoch: 1 | Step: 195410 | Dataset: 0-992781 | Loss: 0.699 | 914 ms/step , 6878.18 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-05 23:24:30 | Epoch: 1 | Step: 195420 | Dataset: 0-993101 | Loss: 0.787 | 915 ms/step , 6875.82 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-05 23:24:39 | Epoch: 1 | Step: 195430 | Dataset: 0-993421 | Loss: 0.743 | 913 ms/step , 6888.02 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 23:24:48 | Epoch: 1 | Step: 195440 | Dataset: 0-993741 | Loss: 0.747 | 914 ms/step , 6882.35 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-05 23:24:57 | Epoch: 1 | Step: 195450 | Dataset: 0-994061 | Loss: 0.681 | 914 ms/step , 6879.51 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-05 23:25:06 | Epoch: 1 | Step: 195460 | Dataset: 0-994381 | Loss: 0.649 | 914 ms/step , 6884.34 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 23:25:15 | Epoch: 1 | Step: 195470 | Dataset: 0-994701 | Loss: 0.763 | 914 ms/step , 6882.59 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-05 23:25:24 | Epoch: 1 | Step: 195480 | Dataset: 0-995021 | Loss: 0.750 | 913 ms/step , 6890.37 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:25:34 | Epoch: 1 | Step: 195490 | Dataset: 0-995341 | Loss: 0.729 | 913 ms/step , 6885.68 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-05 23:25:43 | Epoch: 1 | Step: 195500 | Dataset: 0-995661 | Loss: 0.778 | 914 ms/step , 6882.40 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 23:25:44 | Validation | Step: 195500 | Val_loss: 0.704 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:25:53 | Epoch: 1 | Step: 195510 | Dataset: 0-995981 | Loss: 0.723 | 913 ms/step , 6885.84 GFLOP/s , 15265.9 tokens/s INFO:__main__:2024-11-05 23:26:03 | Epoch: 1 | Step: 195520 | Dataset: 0-996301 | Loss: 0.809 | 915 ms/step , 6877.16 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 23:26:12 | Epoch: 1 | Step: 195530 | Dataset: 0-996621 | Loss: 0.795 | 915 ms/step , 6876.61 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-05 23:26:21 | Epoch: 1 | Step: 195540 | Dataset: 0-996941 | Loss: 0.672 | 914 ms/step , 6883.34 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-05 23:26:30 | Epoch: 1 | Step: 195550 | Dataset: 0-997261 | Loss: 0.731 | 914 ms/step , 6880.38 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-05 23:26:39 | Epoch: 1 | Step: 195560 | Dataset: 0-997581 | Loss: 0.650 | 913 ms/step , 6886.11 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 23:26:48 | Epoch: 1 | Step: 195570 | Dataset: 0-997901 | Loss: 0.677 | 913 ms/step , 6885.31 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 23:26:57 | Epoch: 1 | Step: 195580 | Dataset: 0-998221 | Loss: 0.644 | 912 ms/step , 6895.35 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 23:27:07 | Epoch: 1 | Step: 195590 | Dataset: 0-998541 | Loss: 0.711 | 912 ms/step , 6893.53 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 23:27:16 | Epoch: 1 | Step: 195600 | Dataset: 0-998861 | Loss: 0.809 | 914 ms/step , 6878.79 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 23:27:17 | Validation | Step: 195600 | Val_loss: 0.686 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:27:26 | Epoch: 1 | Step: 195610 | Dataset: 0-999181 | Loss: 0.692 | 914 ms/step , 6880.25 GFLOP/s , 15263.5 tokens/s INFO:__main__:2024-11-05 23:27:36 | Epoch: 1 | Step: 195620 | Dataset: 0-999501 | Loss: 0.746 | 915 ms/step , 6873.89 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-05 23:27:45 | Epoch: 1 | Step: 195630 | Dataset: 0-999821 | Loss: 0.648 | 912 ms/step , 6896.01 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-05 23:27:54 | Epoch: 1 | Step: 195640 | Dataset: 0-1000141 | Loss: 0.705 | 916 ms/step , 6867.96 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 23:28:03 | Epoch: 1 | Step: 195650 | Dataset: 0-1000461 | Loss: 0.725 | 915 ms/step , 6875.85 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-05 23:28:12 | Epoch: 1 | Step: 195660 | Dataset: 0-1000781 | Loss: 0.755 | 913 ms/step , 6889.95 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 23:28:21 | Epoch: 1 | Step: 195670 | Dataset: 0-1001101 | Loss: 0.707 | 914 ms/step , 6884.03 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 23:28:30 | Epoch: 1 | Step: 195680 | Dataset: 0-1001421 | Loss: 0.680 | 914 ms/step , 6881.73 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 23:28:40 | Epoch: 1 | Step: 195690 | Dataset: 0-1001741 | Loss: 0.758 | 914 ms/step , 6882.79 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:28:49 | Epoch: 1 | Step: 195700 | Dataset: 0-1002061 | Loss: 0.758 | 914 ms/step , 6879.25 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 23:28:50 | Validation | Step: 195700 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:28:59 | Epoch: 1 | Step: 195710 | Dataset: 0-1002381 | Loss: 0.805 | 914 ms/step , 6881.27 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-05 23:29:09 | Epoch: 1 | Step: 195720 | Dataset: 0-1002701 | Loss: 0.711 | 913 ms/step , 6887.36 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-05 23:29:18 | Epoch: 1 | Step: 195730 | Dataset: 0-1003021 | Loss: 0.691 | 913 ms/step , 6888.92 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 23:29:27 | Epoch: 1 | Step: 195740 | Dataset: 0-1003341 | Loss: 0.677 | 913 ms/step , 6890.16 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 23:29:36 | Epoch: 1 | Step: 195750 | Dataset: 0-1003661 | Loss: 0.700 | 914 ms/step , 6884.17 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 23:29:45 | Epoch: 1 | Step: 195760 | Dataset: 0-1003981 | Loss: 0.738 | 914 ms/step , 6878.00 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-05 23:29:54 | Epoch: 1 | Step: 195770 | Dataset: 0-1004301 | Loss: 0.738 | 914 ms/step , 6883.49 GFLOP/s , 17911.8 tokens/s INFO:__main__:2024-11-05 23:30:03 | Epoch: 1 | Step: 195780 | Dataset: 0-1004621 | Loss: 0.697 | 915 ms/step , 6876.44 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-05 23:30:13 | Epoch: 1 | Step: 195790 | Dataset: 0-1004941 | Loss: 0.646 | 914 ms/step , 6880.85 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-05 23:30:22 | Epoch: 1 | Step: 195800 | Dataset: 0-1005261 | Loss: 0.748 | 913 ms/step , 6888.97 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-05 23:30:23 | Validation | Step: 195800 | Val_loss: 0.726 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:30:32 | Epoch: 1 | Step: 195810 | Dataset: 0-1005581 | Loss: 0.715 | 913 ms/step , 6887.18 GFLOP/s , 15281.4 tokens/s INFO:__main__:2024-11-05 23:30:42 | Epoch: 1 | Step: 195820 | Dataset: 0-1005901 | Loss: 0.657 | 914 ms/step , 6882.63 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-05 23:30:51 | Epoch: 1 | Step: 195830 | Dataset: 0-1006221 | Loss: 0.724 | 914 ms/step , 6881.68 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-05 23:31:00 | Epoch: 1 | Step: 195840 | Dataset: 0-1006541 | Loss: 0.693 | 914 ms/step , 6881.37 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-05 23:31:09 | Epoch: 1 | Step: 195850 | Dataset: 0-1006861 | Loss: 0.600 | 914 ms/step , 6882.68 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-05 23:31:18 | Epoch: 1 | Step: 195860 | Dataset: 0-1007181 | Loss: 0.721 | 915 ms/step , 6876.22 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-05 23:31:27 | Epoch: 1 | Step: 195870 | Dataset: 0-1007501 | Loss: 0.729 | 915 ms/step , 6874.00 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 23:31:36 | Epoch: 1 | Step: 195880 | Dataset: 0-1007821 | Loss: 0.766 | 914 ms/step , 6879.57 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-05 23:31:46 | Epoch: 1 | Step: 195890 | Dataset: 0-1008141 | Loss: 0.675 | 914 ms/step , 6877.58 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-05 23:31:55 | Epoch: 1 | Step: 195900 | Dataset: 0-1008461 | Loss: 0.653 | 914 ms/step , 6879.86 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-05 23:31:56 | Validation | Step: 195900 | Val_loss: 0.683 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:32:05 | Epoch: 1 | Step: 195910 | Dataset: 0-1008781 | Loss: 0.703 | 914 ms/step , 6877.85 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-05 23:32:15 | Epoch: 1 | Step: 195920 | Dataset: 0-1009101 | Loss: 0.708 | 914 ms/step , 6882.90 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 23:32:24 | Epoch: 1 | Step: 195930 | Dataset: 0-1009421 | Loss: 0.702 | 913 ms/step , 6888.81 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-05 23:32:33 | Epoch: 1 | Step: 195940 | Dataset: 0-1009741 | Loss: 0.792 | 915 ms/step , 6873.67 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-05 23:32:42 | Epoch: 1 | Step: 195950 | Dataset: 0-1010061 | Loss: 0.646 | 913 ms/step , 6885.92 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-05 23:32:51 | Epoch: 1 | Step: 195960 | Dataset: 0-1010381 | Loss: 0.624 | 913 ms/step , 6890.19 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-05 23:33:00 | Epoch: 1 | Step: 195970 | Dataset: 0-1010701 | Loss: 0.755 | 914 ms/step , 6880.76 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-05 23:33:09 | Epoch: 1 | Step: 195980 | Dataset: 0-1011021 | Loss: 0.661 | 914 ms/step , 6879.08 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-05 23:33:19 | Epoch: 1 | Step: 195990 | Dataset: 0-1011341 | Loss: 0.672 | 914 ms/step , 6883.08 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-05 23:33:28 | Epoch: 1 | Step: 196000 | Dataset: 0-1011661 | Loss: 0.742 | 914 ms/step , 6881.37 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-05 23:33:29 | Validation | Step: 196000 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:33:29 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_233329_step_196000.pt` INFO:__main__:2024-11-05 23:33:40 | Epoch: 1 | Step: 196010 | Dataset: 0-1011981 | Loss: 0.773 | 914 ms/step , 6877.92 GFLOP/s , 13816.1 tokens/s INFO:__main__:2024-11-05 23:33:49 | Epoch: 1 | Step: 196020 | Dataset: 0-1012301 | Loss: 0.749 | 914 ms/step , 6878.52 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-05 23:33:58 | Epoch: 1 | Step: 196030 | Dataset: 0-1012621 | Loss: 0.787 | 914 ms/step , 6883.27 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:34:07 | Epoch: 1 | Step: 196040 | Dataset: 0-1012941 | Loss: 0.638 | 913 ms/step , 6886.01 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 23:34:16 | Epoch: 1 | Step: 196050 | Dataset: 0-1013261 | Loss: 0.677 | 913 ms/step , 6885.22 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-05 23:34:25 | Epoch: 1 | Step: 196060 | Dataset: 0-1013581 | Loss: 0.729 | 914 ms/step , 6883.50 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-05 23:34:34 | Epoch: 1 | Step: 196070 | Dataset: 0-1013901 | Loss: 0.736 | 914 ms/step , 6878.98 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-05 23:34:44 | Epoch: 1 | Step: 196080 | Dataset: 0-1014221 | Loss: 0.738 | 915 ms/step , 6873.15 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 23:34:53 | Epoch: 1 | Step: 196090 | Dataset: 0-1014541 | Loss: 0.703 | 914 ms/step , 6879.26 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-05 23:35:02 | Epoch: 1 | Step: 196100 | Dataset: 0-1014861 | Loss: 0.656 | 914 ms/step , 6884.98 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 23:35:03 | Validation | Step: 196100 | Val_loss: 0.755 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:35:13 | Epoch: 1 | Step: 196110 | Dataset: 0-1015181 | Loss: 0.807 | 913 ms/step , 6886.70 GFLOP/s , 15272.8 tokens/s INFO:__main__:2024-11-05 23:35:22 | Epoch: 1 | Step: 196120 | Dataset: 0-1015501 | Loss: 0.601 | 913 ms/step , 6892.22 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 23:35:31 | Epoch: 1 | Step: 196130 | Dataset: 0-1015821 | Loss: 0.682 | 913 ms/step , 6889.73 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-05 23:35:40 | Epoch: 1 | Step: 196140 | Dataset: 0-1016141 | Loss: 0.616 | 913 ms/step , 6892.10 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 23:35:49 | Epoch: 1 | Step: 196150 | Dataset: 0-1016461 | Loss: 0.740 | 912 ms/step , 6895.37 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-05 23:35:58 | Epoch: 1 | Step: 196160 | Dataset: 0-1016781 | Loss: 0.658 | 915 ms/step , 6873.74 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 23:36:07 | Epoch: 1 | Step: 196170 | Dataset: 0-1017101 | Loss: 0.688 | 913 ms/step , 6891.71 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:36:17 | Epoch: 1 | Step: 196180 | Dataset: 0-1017421 | Loss: 0.650 | 913 ms/step , 6885.29 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-05 23:36:26 | Epoch: 1 | Step: 196190 | Dataset: 0-1017741 | Loss: 0.663 | 913 ms/step , 6886.83 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-05 23:36:35 | Epoch: 1 | Step: 196200 | Dataset: 0-1018061 | Loss: 0.728 | 914 ms/step , 6884.69 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-05 23:36:36 | Validation | Step: 196200 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:36:46 | Epoch: 1 | Step: 196210 | Dataset: 0-1018381 | Loss: 0.829 | 914 ms/step , 6882.83 GFLOP/s , 15269.9 tokens/s INFO:__main__:2024-11-05 23:36:55 | Epoch: 1 | Step: 196220 | Dataset: 0-1018701 | Loss: 0.702 | 914 ms/step , 6878.15 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 23:37:04 | Epoch: 1 | Step: 196230 | Dataset: 0-1019021 | Loss: 0.823 | 912 ms/step , 6894.14 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-05 23:37:13 | Epoch: 1 | Step: 196240 | Dataset: 0-1019341 | Loss: 0.777 | 913 ms/step , 6885.33 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-05 23:37:22 | Epoch: 1 | Step: 196250 | Dataset: 0-1019661 | Loss: 0.647 | 913 ms/step , 6886.21 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:37:31 | Epoch: 1 | Step: 196260 | Dataset: 0-1019981 | Loss: 0.737 | 913 ms/step , 6889.82 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-05 23:37:40 | Epoch: 1 | Step: 196270 | Dataset: 0-1020301 | Loss: 0.841 | 913 ms/step , 6887.39 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-05 23:37:50 | Epoch: 1 | Step: 196280 | Dataset: 0-1020621 | Loss: 0.673 | 913 ms/step , 6887.29 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-05 23:37:59 | Epoch: 1 | Step: 196290 | Dataset: 0-1020941 | Loss: 0.734 | 914 ms/step , 6883.04 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-05 23:38:08 | Epoch: 1 | Step: 196300 | Dataset: 0-1021261 | Loss: 0.513 | 913 ms/step , 6890.51 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 23:38:09 | Validation | Step: 196300 | Val_loss: 0.670 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:38:19 | Epoch: 1 | Step: 196310 | Dataset: 0-1021581 | Loss: 0.768 | 913 ms/step , 6889.24 GFLOP/s , 15287.9 tokens/s INFO:__main__:2024-11-05 23:38:28 | Epoch: 1 | Step: 196320 | Dataset: 0-1021901 | Loss: 0.747 | 912 ms/step , 6892.94 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 23:38:37 | Epoch: 1 | Step: 196330 | Dataset: 0-1022221 | Loss: 0.811 | 912 ms/step , 6894.56 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-05 23:38:46 | Epoch: 1 | Step: 196340 | Dataset: 0-1022541 | Loss: 0.542 | 912 ms/step , 6893.81 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-05 23:38:55 | Epoch: 1 | Step: 196350 | Dataset: 0-1022861 | Loss: 0.829 | 914 ms/step , 6884.60 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-05 23:39:04 | Epoch: 1 | Step: 196360 | Dataset: 0-1023181 | Loss: 0.521 | 912 ms/step , 6893.33 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 23:39:13 | Epoch: 1 | Step: 196370 | Dataset: 0-1023501 | Loss: 0.779 | 913 ms/step , 6890.59 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-05 23:39:22 | Epoch: 1 | Step: 196380 | Dataset: 0-1023821 | Loss: 0.715 | 913 ms/step , 6887.33 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-05 23:39:32 | Epoch: 1 | Step: 196390 | Dataset: 0-1024141 | Loss: 0.762 | 913 ms/step , 6888.21 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 23:39:41 | Epoch: 1 | Step: 196400 | Dataset: 0-1024461 | Loss: 0.676 | 913 ms/step , 6890.87 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-05 23:39:42 | Validation | Step: 196400 | Val_loss: 0.626 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:39:51 | Epoch: 1 | Step: 196410 | Dataset: 0-1024781 | Loss: 0.765 | 914 ms/step , 6881.09 GFLOP/s , 15265.1 tokens/s INFO:__main__:2024-11-05 23:40:01 | Epoch: 1 | Step: 196420 | Dataset: 0-1025101 | Loss: 0.739 | 914 ms/step , 6878.93 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:40:10 | Epoch: 1 | Step: 196430 | Dataset: 0-1025421 | Loss: 0.765 | 913 ms/step , 6889.42 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 23:40:19 | Epoch: 1 | Step: 196440 | Dataset: 0-1025741 | Loss: 0.639 | 913 ms/step , 6886.21 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 23:40:28 | Epoch: 1 | Step: 196450 | Dataset: 0-1026061 | Loss: 0.704 | 914 ms/step , 6881.39 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-05 23:40:37 | Epoch: 1 | Step: 196460 | Dataset: 0-1026381 | Loss: 0.743 | 914 ms/step , 6880.27 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 23:40:46 | Epoch: 1 | Step: 196470 | Dataset: 0-1026701 | Loss: 0.640 | 913 ms/step , 6889.95 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-05 23:40:55 | Epoch: 1 | Step: 196480 | Dataset: 0-1027021 | Loss: 0.721 | 914 ms/step , 6881.79 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-05 23:41:05 | Epoch: 1 | Step: 196490 | Dataset: 0-1027341 | Loss: 0.782 | 913 ms/step , 6885.77 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 23:41:14 | Epoch: 1 | Step: 196500 | Dataset: 0-1027661 | Loss: 0.695 | 914 ms/step , 6882.29 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 23:41:15 | Validation | Step: 196500 | Val_loss: 0.618 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:41:24 | Epoch: 1 | Step: 196510 | Dataset: 0-1027981 | Loss: 0.710 | 913 ms/step , 6890.54 GFLOP/s , 15266.6 tokens/s INFO:__main__:2024-11-05 23:41:34 | Epoch: 1 | Step: 196520 | Dataset: 0-1028301 | Loss: 0.829 | 913 ms/step , 6885.91 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-05 23:41:43 | Epoch: 1 | Step: 196530 | Dataset: 0-1028621 | Loss: 0.677 | 913 ms/step , 6890.57 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 23:41:52 | Epoch: 1 | Step: 196540 | Dataset: 0-1028941 | Loss: 0.785 | 913 ms/step , 6887.23 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:42:01 | Epoch: 1 | Step: 196550 | Dataset: 0-1029261 | Loss: 0.730 | 913 ms/step , 6885.29 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 23:42:10 | Epoch: 1 | Step: 196560 | Dataset: 0-1029581 | Loss: 0.734 | 914 ms/step , 6884.98 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-05 23:42:19 | Epoch: 1 | Step: 196570 | Dataset: 0-1029901 | Loss: 0.798 | 914 ms/step , 6880.33 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 23:42:28 | Epoch: 1 | Step: 196580 | Dataset: 0-1030221 | Loss: 0.632 | 912 ms/step , 6894.24 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-05 23:42:38 | Epoch: 1 | Step: 196590 | Dataset: 0-1030541 | Loss: 0.802 | 912 ms/step , 6892.76 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 23:42:47 | Epoch: 1 | Step: 196600 | Dataset: 0-1030861 | Loss: 0.732 | 913 ms/step , 6890.00 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 23:42:48 | Validation | Step: 196600 | Val_loss: 0.643 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:42:57 | Epoch: 1 | Step: 196610 | Dataset: 0-1031181 | Loss: 0.729 | 914 ms/step , 6884.16 GFLOP/s , 15283.4 tokens/s INFO:__main__:2024-11-05 23:43:07 | Epoch: 1 | Step: 196620 | Dataset: 0-1031501 | Loss: 0.800 | 912 ms/step , 6895.41 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-05 23:43:16 | Epoch: 1 | Step: 196630 | Dataset: 0-1031821 | Loss: 0.863 | 914 ms/step , 6882.30 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-05 23:43:25 | Epoch: 1 | Step: 196640 | Dataset: 0-1032141 | Loss: 0.854 | 915 ms/step , 6876.20 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 23:43:34 | Epoch: 1 | Step: 196650 | Dataset: 0-1032461 | Loss: 0.689 | 913 ms/step , 6886.91 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-05 23:43:43 | Epoch: 1 | Step: 196660 | Dataset: 0-1032781 | Loss: 0.739 | 914 ms/step , 6882.25 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-05 23:43:52 | Epoch: 1 | Step: 196670 | Dataset: 0-1033101 | Loss: 0.655 | 913 ms/step , 6892.53 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 23:44:01 | Epoch: 1 | Step: 196680 | Dataset: 0-1033421 | Loss: 0.728 | 913 ms/step , 6886.50 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-05 23:44:11 | Epoch: 1 | Step: 196690 | Dataset: 0-1033741 | Loss: 0.823 | 914 ms/step , 6878.96 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:44:20 | Epoch: 1 | Step: 196700 | Dataset: 0-1034061 | Loss: 0.728 | 913 ms/step , 6887.10 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 23:44:21 | Validation | Step: 196700 | Val_loss: 0.647 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:44:30 | Epoch: 1 | Step: 196710 | Dataset: 0-1034381 | Loss: 0.768 | 913 ms/step , 6890.76 GFLOP/s , 15278.7 tokens/s INFO:__main__:2024-11-05 23:44:40 | Epoch: 1 | Step: 196720 | Dataset: 0-1034701 | Loss: 0.885 | 913 ms/step , 6887.14 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-05 23:44:49 | Epoch: 1 | Step: 196730 | Dataset: 0-1035021 | Loss: 0.712 | 913 ms/step , 6891.28 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-05 23:44:58 | Epoch: 1 | Step: 196740 | Dataset: 0-1035341 | Loss: 0.813 | 914 ms/step , 6881.69 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 23:45:07 | Epoch: 1 | Step: 196750 | Dataset: 0-1035661 | Loss: 0.679 | 913 ms/step , 6887.70 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 23:45:16 | Epoch: 1 | Step: 196760 | Dataset: 0-1035981 | Loss: 0.659 | 913 ms/step , 6889.01 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-05 23:45:25 | Epoch: 1 | Step: 196770 | Dataset: 0-1036301 | Loss: 0.638 | 913 ms/step , 6892.18 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-05 23:45:34 | Epoch: 1 | Step: 196780 | Dataset: 0-1036621 | Loss: 0.724 | 914 ms/step , 6882.65 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-05 23:45:43 | Epoch: 1 | Step: 196790 | Dataset: 0-1036941 | Loss: 0.649 | 913 ms/step , 6886.49 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-05 23:45:53 | Epoch: 1 | Step: 196800 | Dataset: 0-1037261 | Loss: 0.649 | 912 ms/step , 6892.71 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 23:45:54 | Validation | Step: 196800 | Val_loss: 0.645 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:46:03 | Epoch: 1 | Step: 196810 | Dataset: 0-1037581 | Loss: 0.842 | 914 ms/step , 6879.24 GFLOP/s , 15280.2 tokens/s INFO:__main__:2024-11-05 23:46:12 | Epoch: 1 | Step: 196820 | Dataset: 0-1037901 | Loss: 0.736 | 913 ms/step , 6889.84 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-05 23:46:22 | Epoch: 1 | Step: 196830 | Dataset: 0-1038221 | Loss: 0.795 | 914 ms/step , 6884.61 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 23:46:31 | Epoch: 1 | Step: 196840 | Dataset: 0-1038541 | Loss: 0.547 | 913 ms/step , 6892.10 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 23:46:40 | Epoch: 1 | Step: 196850 | Dataset: 0-1038861 | Loss: 0.760 | 915 ms/step , 6877.43 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-05 23:46:49 | Epoch: 1 | Step: 196860 | Dataset: 0-1039181 | Loss: 0.593 | 912 ms/step , 6894.05 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-05 23:46:58 | Epoch: 1 | Step: 196870 | Dataset: 0-1039501 | Loss: 0.778 | 914 ms/step , 6878.99 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:47:07 | Epoch: 1 | Step: 196880 | Dataset: 0-1039821 | Loss: 0.714 | 913 ms/step , 6885.82 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 23:47:16 | Epoch: 1 | Step: 196890 | Dataset: 0-1040141 | Loss: 0.680 | 912 ms/step , 6896.40 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-05 23:47:26 | Epoch: 1 | Step: 196900 | Dataset: 0-1040461 | Loss: 0.713 | 914 ms/step , 6881.00 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 23:47:27 | Validation | Step: 196900 | Val_loss: 0.642 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:47:36 | Epoch: 1 | Step: 196910 | Dataset: 0-1040781 | Loss: 0.671 | 913 ms/step , 6891.01 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-05 23:47:45 | Epoch: 1 | Step: 196920 | Dataset: 0-1041101 | Loss: 0.882 | 914 ms/step , 6878.38 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-05 23:47:55 | Epoch: 1 | Step: 196930 | Dataset: 0-1041421 | Loss: 0.771 | 914 ms/step , 6880.68 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-05 23:48:04 | Epoch: 1 | Step: 196940 | Dataset: 0-1041741 | Loss: 0.798 | 914 ms/step , 6884.19 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-05 23:48:13 | Epoch: 1 | Step: 196950 | Dataset: 0-1042061 | Loss: 0.635 | 913 ms/step , 6891.32 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-05 23:48:22 | Epoch: 1 | Step: 196960 | Dataset: 0-1042381 | Loss: 0.725 | 914 ms/step , 6883.82 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-05 23:48:31 | Epoch: 1 | Step: 196970 | Dataset: 0-1042701 | Loss: 0.871 | 915 ms/step , 6875.75 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-05 23:48:40 | Epoch: 1 | Step: 196980 | Dataset: 0-1043021 | Loss: 0.690 | 914 ms/step , 6879.73 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-05 23:48:49 | Epoch: 1 | Step: 196990 | Dataset: 0-1043341 | Loss: 0.790 | 916 ms/step , 6867.11 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-05 23:48:59 | Epoch: 1 | Step: 197000 | Dataset: 0-1043661 | Loss: 0.692 | 913 ms/step , 6886.59 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-05 23:49:00 | Validation | Step: 197000 | Val_loss: 0.631 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:49:00 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241105_234900_step_197000.pt` INFO:__main__:2024-11-05 23:49:10 | Epoch: 1 | Step: 197010 | Dataset: 0-1043981 | Loss: 0.758 | 915 ms/step , 6875.56 GFLOP/s , 13770.2 tokens/s INFO:__main__:2024-11-05 23:49:20 | Epoch: 1 | Step: 197020 | Dataset: 0-1044301 | Loss: 0.734 | 912 ms/step , 6895.95 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 23:49:29 | Epoch: 1 | Step: 197030 | Dataset: 0-1044621 | Loss: 0.722 | 914 ms/step , 6885.04 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-05 23:49:38 | Epoch: 1 | Step: 197040 | Dataset: 0-1044941 | Loss: 0.662 | 912 ms/step , 6896.64 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-05 23:49:47 | Epoch: 1 | Step: 197050 | Dataset: 0-1045261 | Loss: 0.860 | 915 ms/step , 6876.88 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-05 23:49:56 | Epoch: 1 | Step: 197060 | Dataset: 0-1045581 | Loss: 0.626 | 914 ms/step , 6883.04 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-05 23:50:05 | Epoch: 1 | Step: 197070 | Dataset: 0-1045901 | Loss: 0.788 | 913 ms/step , 6888.56 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-05 23:50:14 | Epoch: 1 | Step: 197080 | Dataset: 0-1046221 | Loss: 0.701 | 914 ms/step , 6878.72 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-05 23:50:24 | Epoch: 1 | Step: 197090 | Dataset: 0-1046541 | Loss: 0.865 | 914 ms/step , 6883.01 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 23:50:33 | Epoch: 1 | Step: 197100 | Dataset: 0-1046861 | Loss: 0.690 | 914 ms/step , 6881.65 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 23:50:34 | Validation | Step: 197100 | Val_loss: 0.644 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:50:43 | Epoch: 1 | Step: 197110 | Dataset: 0-1047181 | Loss: 0.714 | 913 ms/step , 6886.14 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-05 23:50:53 | Epoch: 1 | Step: 197120 | Dataset: 0-1047501 | Loss: 0.820 | 913 ms/step , 6886.15 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-05 23:51:02 | Epoch: 1 | Step: 197130 | Dataset: 0-1047821 | Loss: 0.494 | 912 ms/step , 6894.78 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-05 23:51:11 | Epoch: 1 | Step: 197140 | Dataset: 0-1048141 | Loss: 0.780 | 914 ms/step , 6879.82 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-05 23:51:20 | Epoch: 1 | Step: 197150 | Dataset: 0-1048461 | Loss: 0.620 | 912 ms/step , 6898.19 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 23:51:29 | Epoch: 1 | Step: 197160 | Dataset: 0-1048781 | Loss: 0.772 | 913 ms/step , 6892.23 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 23:51:38 | Epoch: 1 | Step: 197170 | Dataset: 0-1049101 | Loss: 0.546 | 913 ms/step , 6890.39 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-05 23:51:47 | Epoch: 1 | Step: 197180 | Dataset: 0-1049421 | Loss: 0.736 | 913 ms/step , 6891.39 GFLOP/s , 17943.2 tokens/s INFO:__main__:2024-11-05 23:51:56 | Epoch: 1 | Step: 197190 | Dataset: 0-1049741 | Loss: 0.801 | 914 ms/step , 6881.41 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 23:52:06 | Epoch: 1 | Step: 197200 | Dataset: 0-1050061 | Loss: 0.726 | 913 ms/step , 6887.37 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-05 23:52:07 | Validation | Step: 197200 | Val_loss: 0.635 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:52:16 | Epoch: 1 | Step: 197210 | Dataset: 0-1050381 | Loss: 0.648 | 914 ms/step , 6883.17 GFLOP/s , 15279.0 tokens/s INFO:__main__:2024-11-05 23:52:25 | Epoch: 1 | Step: 197220 | Dataset: 0-1050701 | Loss: 0.760 | 915 ms/step , 6876.48 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-05 23:52:35 | Epoch: 1 | Step: 197230 | Dataset: 0-1051021 | Loss: 0.784 | 914 ms/step , 6882.39 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-05 23:52:44 | Epoch: 1 | Step: 197240 | Dataset: 0-1051341 | Loss: 0.648 | 913 ms/step , 6892.06 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 23:52:53 | Epoch: 1 | Step: 197250 | Dataset: 0-1051661 | Loss: 0.608 | 913 ms/step , 6887.94 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-05 23:53:02 | Epoch: 1 | Step: 197260 | Dataset: 0-1051981 | Loss: 0.771 | 912 ms/step , 6894.51 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-05 23:53:11 | Epoch: 1 | Step: 197270 | Dataset: 0-1052301 | Loss: 0.776 | 914 ms/step , 6882.73 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-05 23:53:20 | Epoch: 1 | Step: 197280 | Dataset: 0-1052621 | Loss: 0.648 | 913 ms/step , 6889.80 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-05 23:53:29 | Epoch: 1 | Step: 197290 | Dataset: 0-1052941 | Loss: 0.844 | 915 ms/step , 6874.74 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-05 23:53:39 | Epoch: 1 | Step: 197300 | Dataset: 0-1053261 | Loss: 0.760 | 914 ms/step , 6884.88 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-05 23:53:40 | Validation | Step: 197300 | Val_loss: 0.634 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:53:49 | Epoch: 1 | Step: 197310 | Dataset: 0-1053581 | Loss: 0.740 | 913 ms/step , 6891.82 GFLOP/s , 15289.7 tokens/s INFO:__main__:2024-11-05 23:53:58 | Epoch: 1 | Step: 197320 | Dataset: 0-1053901 | Loss: 0.747 | 912 ms/step , 6892.60 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-05 23:54:08 | Epoch: 1 | Step: 197330 | Dataset: 0-1054221 | Loss: 0.675 | 913 ms/step , 6885.59 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-05 23:54:17 | Epoch: 1 | Step: 197340 | Dataset: 0-1054541 | Loss: 0.577 | 912 ms/step , 6898.43 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-05 23:54:26 | Epoch: 1 | Step: 197350 | Dataset: 0-1054861 | Loss: 0.546 | 912 ms/step , 6895.34 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-05 23:54:35 | Epoch: 1 | Step: 197360 | Dataset: 0-1055181 | Loss: 0.692 | 913 ms/step , 6889.89 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-05 23:54:44 | Epoch: 1 | Step: 197370 | Dataset: 0-1055501 | Loss: 0.812 | 913 ms/step , 6889.11 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-05 23:54:53 | Epoch: 1 | Step: 197380 | Dataset: 0-1055821 | Loss: 0.525 | 914 ms/step , 6884.19 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-05 23:55:02 | Epoch: 1 | Step: 197390 | Dataset: 0-1056141 | Loss: 0.668 | 913 ms/step , 6888.11 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-05 23:55:12 | Epoch: 1 | Step: 197400 | Dataset: 0-1056461 | Loss: 0.690 | 913 ms/step , 6890.78 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 23:55:13 | Validation | Step: 197400 | Val_loss: 0.632 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:55:22 | Epoch: 1 | Step: 197410 | Dataset: 0-1056781 | Loss: 0.687 | 913 ms/step , 6890.47 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-05 23:55:31 | Epoch: 1 | Step: 197420 | Dataset: 0-1057101 | Loss: 0.745 | 913 ms/step , 6890.50 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-05 23:55:41 | Epoch: 1 | Step: 197430 | Dataset: 0-1057421 | Loss: 0.729 | 913 ms/step , 6886.98 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-05 23:55:50 | Epoch: 1 | Step: 197440 | Dataset: 0-1057741 | Loss: 0.785 | 915 ms/step , 6874.85 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-05 23:55:59 | Epoch: 1 | Step: 197450 | Dataset: 0-1058061 | Loss: 0.677 | 913 ms/step , 6886.66 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-05 23:56:08 | Epoch: 1 | Step: 197460 | Dataset: 0-1058381 | Loss: 0.795 | 912 ms/step , 6897.21 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-05 23:56:17 | Epoch: 1 | Step: 197470 | Dataset: 0-1058701 | Loss: 0.749 | 914 ms/step , 6884.60 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 23:56:26 | Epoch: 1 | Step: 197480 | Dataset: 0-1059021 | Loss: 0.762 | 911 ms/step , 6900.36 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 23:56:35 | Epoch: 1 | Step: 197490 | Dataset: 0-1059341 | Loss: 0.526 | 912 ms/step , 6893.58 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-05 23:56:44 | Epoch: 1 | Step: 197500 | Dataset: 0-1059661 | Loss: 0.765 | 912 ms/step , 6893.41 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-05 23:56:46 | Validation | Step: 197500 | Val_loss: 0.609 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:56:55 | Epoch: 1 | Step: 197510 | Dataset: 0-1059981 | Loss: 0.889 | 913 ms/step , 6890.16 GFLOP/s , 15287.1 tokens/s INFO:__main__:2024-11-05 23:57:04 | Epoch: 1 | Step: 197520 | Dataset: 0-1060301 | Loss: 0.667 | 914 ms/step , 6883.69 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-05 23:57:13 | Epoch: 1 | Step: 197530 | Dataset: 0-1060621 | Loss: 0.874 | 913 ms/step , 6890.98 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-05 23:57:23 | Epoch: 1 | Step: 197540 | Dataset: 0-1060941 | Loss: 0.801 | 912 ms/step , 6896.44 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-05 23:57:32 | Epoch: 1 | Step: 197550 | Dataset: 0-1061261 | Loss: 0.737 | 914 ms/step , 6883.40 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-05 23:57:41 | Epoch: 1 | Step: 197560 | Dataset: 0-1061581 | Loss: 0.735 | 913 ms/step , 6892.23 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-05 23:57:50 | Epoch: 1 | Step: 197570 | Dataset: 0-1061901 | Loss: 0.760 | 913 ms/step , 6885.36 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-05 23:57:59 | Epoch: 1 | Step: 197580 | Dataset: 0-1062221 | Loss: 0.582 | 912 ms/step , 6897.21 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-05 23:58:08 | Epoch: 1 | Step: 197590 | Dataset: 0-1062541 | Loss: 0.832 | 913 ms/step , 6890.78 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-05 23:58:17 | Epoch: 1 | Step: 197600 | Dataset: 0-1062861 | Loss: 0.654 | 913 ms/step , 6890.03 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-05 23:58:19 | Validation | Step: 197600 | Val_loss: 0.666 | Best_val_loss: 0.3249 INFO:__main__:2024-11-05 23:58:28 | Epoch: 1 | Step: 197610 | Dataset: 0-1063181 | Loss: 0.618 | 913 ms/step , 6888.52 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-05 23:58:37 | Epoch: 1 | Step: 197620 | Dataset: 0-1063501 | Loss: 0.575 | 912 ms/step , 6899.08 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-05 23:58:46 | Epoch: 1 | Step: 197630 | Dataset: 0-1063821 | Loss: 0.753 | 914 ms/step , 6879.95 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-05 23:58:56 | Epoch: 1 | Step: 197640 | Dataset: 0-1064141 | Loss: 0.757 | 915 ms/step , 6877.44 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-05 23:59:05 | Epoch: 1 | Step: 197650 | Dataset: 0-1064461 | Loss: 0.760 | 913 ms/step , 6887.64 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-05 23:59:14 | Epoch: 1 | Step: 197660 | Dataset: 0-1064781 | Loss: 0.699 | 913 ms/step , 6886.73 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-05 23:59:23 | Epoch: 1 | Step: 197670 | Dataset: 0-1065101 | Loss: 0.773 | 915 ms/step , 6877.33 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-05 23:59:32 | Epoch: 1 | Step: 197680 | Dataset: 0-1065421 | Loss: 0.703 | 914 ms/step , 6881.99 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-05 23:59:41 | Epoch: 1 | Step: 197690 | Dataset: 0-1065741 | Loss: 0.767 | 914 ms/step , 6878.83 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-05 23:59:50 | Epoch: 1 | Step: 197700 | Dataset: 0-1066061 | Loss: 0.709 | 911 ms/step , 6900.70 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-05 23:59:52 | Validation | Step: 197700 | Val_loss: 0.667 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:00:01 | Epoch: 1 | Step: 197710 | Dataset: 0-1066381 | Loss: 0.782 | 915 ms/step , 6874.19 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-06 00:00:10 | Epoch: 1 | Step: 197720 | Dataset: 0-1066701 | Loss: 0.754 | 914 ms/step , 6884.18 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-06 00:00:19 | Epoch: 1 | Step: 197730 | Dataset: 0-1067021 | Loss: 0.783 | 912 ms/step , 6893.53 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 00:00:28 | Epoch: 1 | Step: 197740 | Dataset: 0-1067341 | Loss: 0.741 | 914 ms/step , 6884.08 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-06 00:00:38 | Epoch: 1 | Step: 197750 | Dataset: 0-1067661 | Loss: 0.767 | 914 ms/step , 6879.99 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 00:00:47 | Epoch: 1 | Step: 197760 | Dataset: 0-1067981 | Loss: 0.751 | 914 ms/step , 6884.32 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 00:00:56 | Epoch: 1 | Step: 197770 | Dataset: 0-1068301 | Loss: 0.766 | 914 ms/step , 6878.20 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 00:01:05 | Epoch: 1 | Step: 197780 | Dataset: 0-1068621 | Loss: 0.745 | 914 ms/step , 6878.72 GFLOP/s , 17910.9 tokens/s INFO:__main__:2024-11-06 00:01:14 | Epoch: 1 | Step: 197790 | Dataset: 0-1068941 | Loss: 0.701 | 913 ms/step , 6885.71 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-06 00:01:23 | Epoch: 1 | Step: 197800 | Dataset: 0-1069261 | Loss: 0.710 | 914 ms/step , 6880.47 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-06 00:01:25 | Validation | Step: 197800 | Val_loss: 0.638 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:01:34 | Epoch: 1 | Step: 197810 | Dataset: 0-1069581 | Loss: 0.759 | 913 ms/step , 6888.18 GFLOP/s , 15266.6 tokens/s INFO:__main__:2024-11-06 00:01:43 | Epoch: 1 | Step: 197820 | Dataset: 0-1069901 | Loss: 0.798 | 916 ms/step , 6867.00 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-06 00:01:52 | Epoch: 1 | Step: 197830 | Dataset: 0-1070221 | Loss: 0.739 | 914 ms/step , 6881.58 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-06 00:02:02 | Epoch: 1 | Step: 197840 | Dataset: 0-1070541 | Loss: 0.769 | 914 ms/step , 6879.80 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 00:02:11 | Epoch: 1 | Step: 197850 | Dataset: 0-1070861 | Loss: 0.761 | 915 ms/step , 6870.36 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-06 00:02:20 | Epoch: 1 | Step: 197860 | Dataset: 0-1071181 | Loss: 0.684 | 915 ms/step , 6877.48 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 00:02:29 | Epoch: 1 | Step: 197870 | Dataset: 0-1071501 | Loss: 0.720 | 914 ms/step , 6884.18 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 00:02:38 | Epoch: 1 | Step: 197880 | Dataset: 0-1071821 | Loss: 0.784 | 914 ms/step , 6877.68 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 00:02:47 | Epoch: 1 | Step: 197890 | Dataset: 0-1072141 | Loss: 0.774 | 914 ms/step , 6879.91 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-06 00:02:56 | Epoch: 1 | Step: 197900 | Dataset: 0-1072461 | Loss: 0.676 | 915 ms/step , 6874.40 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 00:02:58 | Validation | Step: 197900 | Val_loss: 0.667 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:03:07 | Epoch: 1 | Step: 197910 | Dataset: 0-1072781 | Loss: 0.827 | 914 ms/step , 6880.40 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-06 00:03:16 | Epoch: 1 | Step: 197920 | Dataset: 0-1073101 | Loss: 0.697 | 913 ms/step , 6890.07 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 00:03:25 | Epoch: 1 | Step: 197930 | Dataset: 0-1073421 | Loss: 0.746 | 913 ms/step , 6890.46 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 00:03:35 | Epoch: 1 | Step: 197940 | Dataset: 0-1073741 | Loss: 0.743 | 913 ms/step , 6890.58 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 00:03:44 | Epoch: 1 | Step: 197950 | Dataset: 0-1074061 | Loss: 0.730 | 913 ms/step , 6892.09 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 00:03:53 | Epoch: 1 | Step: 197960 | Dataset: 0-1074381 | Loss: 0.716 | 914 ms/step , 6883.17 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 00:04:02 | Epoch: 1 | Step: 197970 | Dataset: 0-1074701 | Loss: 0.716 | 913 ms/step , 6885.69 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 00:04:11 | Epoch: 1 | Step: 197980 | Dataset: 0-1075021 | Loss: 0.708 | 913 ms/step , 6890.02 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 00:04:20 | Epoch: 1 | Step: 197990 | Dataset: 0-1075341 | Loss: 0.705 | 915 ms/step , 6875.60 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 00:04:29 | Epoch: 1 | Step: 198000 | Dataset: 0-1075661 | Loss: 0.739 | 915 ms/step , 6876.17 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 00:04:31 | Validation | Step: 198000 | Val_loss: 0.638 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:04:31 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_000431_step_198000.pt` INFO:__main__:2024-11-06 00:04:41 | Epoch: 1 | Step: 198010 | Dataset: 0-1075981 | Loss: 0.758 | 914 ms/step , 6880.80 GFLOP/s , 13751.5 tokens/s INFO:__main__:2024-11-06 00:04:50 | Epoch: 1 | Step: 198020 | Dataset: 0-1076301 | Loss: 0.659 | 915 ms/step , 6872.11 GFLOP/s , 17903.3 tokens/s INFO:__main__:2024-11-06 00:05:00 | Epoch: 1 | Step: 198030 | Dataset: 0-1076621 | Loss: 0.738 | 915 ms/step , 6871.73 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-06 00:05:09 | Epoch: 1 | Step: 198040 | Dataset: 0-1076941 | Loss: 0.782 | 913 ms/step , 6885.72 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 00:05:18 | Epoch: 1 | Step: 198050 | Dataset: 0-1077261 | Loss: 0.783 | 916 ms/step , 6869.30 GFLOP/s , 17907.5 tokens/s INFO:__main__:2024-11-06 00:05:27 | Epoch: 1 | Step: 198060 | Dataset: 0-1077581 | Loss: 0.726 | 913 ms/step , 6887.34 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 00:05:36 | Epoch: 1 | Step: 198070 | Dataset: 0-1077901 | Loss: 0.755 | 914 ms/step , 6883.50 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 00:05:45 | Epoch: 1 | Step: 198080 | Dataset: 0-1078221 | Loss: 0.686 | 914 ms/step , 6879.22 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 00:05:54 | Epoch: 1 | Step: 198090 | Dataset: 0-1078541 | Loss: 0.758 | 912 ms/step , 6893.76 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-06 00:06:04 | Epoch: 1 | Step: 198100 | Dataset: 0-1078861 | Loss: 0.759 | 914 ms/step , 6880.35 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 00:06:05 | Validation | Step: 198100 | Val_loss: 0.657 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:06:14 | Epoch: 1 | Step: 198110 | Dataset: 0-1079181 | Loss: 0.759 | 914 ms/step , 6878.95 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-06 00:06:23 | Epoch: 1 | Step: 198120 | Dataset: 0-1079501 | Loss: 0.774 | 914 ms/step , 6879.22 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 00:06:33 | Epoch: 1 | Step: 198130 | Dataset: 0-1079821 | Loss: 0.656 | 914 ms/step , 6884.27 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 00:06:42 | Epoch: 1 | Step: 198140 | Dataset: 0-1080141 | Loss: 0.778 | 912 ms/step , 6893.75 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-06 00:06:51 | Epoch: 1 | Step: 198150 | Dataset: 0-1080461 | Loss: 0.743 | 914 ms/step , 6881.81 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 00:07:00 | Epoch: 1 | Step: 198160 | Dataset: 0-1080781 | Loss: 0.639 | 913 ms/step , 6885.98 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-06 00:07:09 | Epoch: 1 | Step: 198170 | Dataset: 0-1081101 | Loss: 0.721 | 914 ms/step , 6880.86 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 00:07:18 | Epoch: 1 | Step: 198180 | Dataset: 0-1081421 | Loss: 0.743 | 913 ms/step , 6887.68 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-06 00:07:27 | Epoch: 1 | Step: 198190 | Dataset: 0-1081741 | Loss: 0.718 | 913 ms/step , 6886.73 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-06 00:07:37 | Epoch: 1 | Step: 198200 | Dataset: 0-1082061 | Loss: 0.740 | 914 ms/step , 6878.68 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-06 00:07:38 | Validation | Step: 198200 | Val_loss: 0.637 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:07:47 | Epoch: 1 | Step: 198210 | Dataset: 0-1082381 | Loss: 0.748 | 914 ms/step , 6879.57 GFLOP/s , 15284.3 tokens/s INFO:__main__:2024-11-06 00:07:56 | Epoch: 1 | Step: 198220 | Dataset: 0-1082701 | Loss: 0.640 | 913 ms/step , 6889.03 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 00:08:06 | Epoch: 1 | Step: 198230 | Dataset: 0-1083021 | Loss: 0.783 | 914 ms/step , 6883.93 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 00:08:15 | Epoch: 1 | Step: 198240 | Dataset: 0-1083341 | Loss: 0.770 | 913 ms/step , 6886.92 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 00:08:24 | Epoch: 1 | Step: 198250 | Dataset: 0-1083661 | Loss: 0.753 | 914 ms/step , 6880.51 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-06 00:08:33 | Epoch: 1 | Step: 198260 | Dataset: 0-1083981 | Loss: 0.666 | 915 ms/step , 6872.93 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 00:08:42 | Epoch: 1 | Step: 198270 | Dataset: 0-1084301 | Loss: 0.680 | 913 ms/step , 6885.89 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 00:08:51 | Epoch: 1 | Step: 198280 | Dataset: 0-1084621 | Loss: 0.723 | 914 ms/step , 6879.40 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 00:09:00 | Epoch: 1 | Step: 198290 | Dataset: 0-1084941 | Loss: 0.736 | 914 ms/step , 6878.39 GFLOP/s , 17910.1 tokens/s INFO:__main__:2024-11-06 00:09:10 | Epoch: 1 | Step: 198300 | Dataset: 0-1085261 | Loss: 0.729 | 915 ms/step , 6872.74 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 00:09:11 | Validation | Step: 198300 | Val_loss: 0.640 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:09:20 | Epoch: 1 | Step: 198310 | Dataset: 0-1085581 | Loss: 0.667 | 913 ms/step , 6885.34 GFLOP/s , 15270.9 tokens/s INFO:__main__:2024-11-06 00:09:29 | Epoch: 1 | Step: 198320 | Dataset: 0-1085901 | Loss: 0.770 | 913 ms/step , 6888.46 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-06 00:09:39 | Epoch: 1 | Step: 198330 | Dataset: 0-1086221 | Loss: 0.733 | 912 ms/step , 6894.37 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 00:09:48 | Epoch: 1 | Step: 198340 | Dataset: 0-1086541 | Loss: 0.800 | 914 ms/step , 6881.22 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 00:09:57 | Epoch: 1 | Step: 198350 | Dataset: 0-1086861 | Loss: 0.683 | 913 ms/step , 6886.06 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 00:10:06 | Epoch: 1 | Step: 198360 | Dataset: 0-1087181 | Loss: 0.727 | 914 ms/step , 6880.19 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-06 00:10:15 | Epoch: 1 | Step: 198370 | Dataset: 0-1087501 | Loss: 0.745 | 914 ms/step , 6882.02 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 00:10:24 | Epoch: 1 | Step: 198380 | Dataset: 0-1087821 | Loss: 0.700 | 913 ms/step , 6886.75 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 00:10:33 | Epoch: 1 | Step: 198390 | Dataset: 0-1088141 | Loss: 0.689 | 914 ms/step , 6882.42 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 00:10:43 | Epoch: 1 | Step: 198400 | Dataset: 0-1088461 | Loss: 0.818 | 914 ms/step , 6882.69 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 00:10:44 | Validation | Step: 198400 | Val_loss: 0.637 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:10:53 | Epoch: 1 | Step: 198410 | Dataset: 0-1088781 | Loss: 0.674 | 914 ms/step , 6880.53 GFLOP/s , 15267.4 tokens/s INFO:__main__:2024-11-06 00:11:03 | Epoch: 1 | Step: 198420 | Dataset: 0-1089101 | Loss: 0.808 | 915 ms/step , 6877.17 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-06 00:11:12 | Epoch: 1 | Step: 198430 | Dataset: 0-1089421 | Loss: 0.798 | 915 ms/step , 6877.03 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-06 00:11:21 | Epoch: 1 | Step: 198440 | Dataset: 0-1089741 | Loss: 0.740 | 915 ms/step , 6876.48 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 00:11:30 | Epoch: 1 | Step: 198450 | Dataset: 0-1090061 | Loss: 0.689 | 914 ms/step , 6884.83 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 00:11:39 | Epoch: 1 | Step: 198460 | Dataset: 0-1090381 | Loss: 0.812 | 915 ms/step , 6873.16 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 00:11:48 | Epoch: 1 | Step: 198470 | Dataset: 0-1090701 | Loss: 0.698 | 913 ms/step , 6886.53 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 00:11:57 | Epoch: 1 | Step: 198480 | Dataset: 0-1091021 | Loss: 0.763 | 914 ms/step , 6881.57 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 00:12:07 | Epoch: 1 | Step: 198490 | Dataset: 0-1091341 | Loss: 0.787 | 914 ms/step , 6879.57 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 00:12:16 | Epoch: 1 | Step: 198500 | Dataset: 0-1091661 | Loss: 0.793 | 915 ms/step , 6872.55 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 00:12:17 | Validation | Step: 198500 | Val_loss: 0.625 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:12:26 | Epoch: 1 | Step: 198510 | Dataset: 0-1091981 | Loss: 0.709 | 912 ms/step , 6898.54 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-06 00:12:36 | Epoch: 1 | Step: 198520 | Dataset: 0-1092301 | Loss: 0.771 | 914 ms/step , 6877.75 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 00:12:45 | Epoch: 1 | Step: 198530 | Dataset: 0-1092621 | Loss: 0.743 | 915 ms/step , 6870.98 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-06 00:12:54 | Epoch: 1 | Step: 198540 | Dataset: 0-1092941 | Loss: 0.767 | 913 ms/step , 6888.43 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 00:13:03 | Epoch: 1 | Step: 198550 | Dataset: 0-1093261 | Loss: 0.760 | 913 ms/step , 6887.88 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 00:13:12 | Epoch: 1 | Step: 198560 | Dataset: 0-1093581 | Loss: 0.769 | 913 ms/step , 6887.77 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 00:13:21 | Epoch: 1 | Step: 198570 | Dataset: 0-1093901 | Loss: 0.719 | 914 ms/step , 6878.16 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 00:13:30 | Epoch: 1 | Step: 198580 | Dataset: 0-1094221 | Loss: 0.724 | 914 ms/step , 6882.88 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:13:40 | Epoch: 1 | Step: 198590 | Dataset: 0-1094541 | Loss: 0.756 | 916 ms/step , 6863.34 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-06 00:13:49 | Epoch: 1 | Step: 198600 | Dataset: 0-1094861 | Loss: 0.690 | 913 ms/step , 6888.37 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-06 00:13:50 | Validation | Step: 198600 | Val_loss: 0.655 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:13:59 | Epoch: 1 | Step: 198610 | Dataset: 0-1095181 | Loss: 0.803 | 914 ms/step , 6880.29 GFLOP/s , 15265.0 tokens/s INFO:__main__:2024-11-06 00:14:09 | Epoch: 1 | Step: 198620 | Dataset: 0-1095501 | Loss: 0.716 | 913 ms/step , 6888.06 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 00:14:18 | Epoch: 1 | Step: 198630 | Dataset: 0-1095821 | Loss: 0.717 | 914 ms/step , 6881.63 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-06 00:14:27 | Epoch: 1 | Step: 198640 | Dataset: 0-1096141 | Loss: 0.731 | 915 ms/step , 6874.56 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 00:14:36 | Epoch: 1 | Step: 198650 | Dataset: 0-1096461 | Loss: 0.770 | 915 ms/step , 6875.87 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 00:14:45 | Epoch: 1 | Step: 198660 | Dataset: 0-1096781 | Loss: 0.699 | 913 ms/step , 6885.80 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 00:14:54 | Epoch: 1 | Step: 198670 | Dataset: 0-1097101 | Loss: 0.695 | 914 ms/step , 6880.55 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 00:15:03 | Epoch: 1 | Step: 198680 | Dataset: 0-1097421 | Loss: 0.694 | 914 ms/step , 6882.63 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-06 00:15:13 | Epoch: 1 | Step: 198690 | Dataset: 0-1097741 | Loss: 0.808 | 914 ms/step , 6881.83 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-06 00:15:22 | Epoch: 1 | Step: 198700 | Dataset: 0-1098061 | Loss: 0.731 | 915 ms/step , 6877.10 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 00:15:23 | Validation | Step: 198700 | Val_loss: 0.683 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:15:32 | Epoch: 1 | Step: 198710 | Dataset: 0-1098381 | Loss: 0.794 | 913 ms/step , 6891.13 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-06 00:15:42 | Epoch: 1 | Step: 198720 | Dataset: 0-1098701 | Loss: 0.686 | 914 ms/step , 6883.96 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 00:15:51 | Epoch: 1 | Step: 198730 | Dataset: 0-1099021 | Loss: 0.771 | 914 ms/step , 6882.07 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 00:16:00 | Epoch: 1 | Step: 198740 | Dataset: 0-1099341 | Loss: 0.685 | 915 ms/step , 6873.42 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 00:16:09 | Epoch: 1 | Step: 198750 | Dataset: 0-1099661 | Loss: 0.762 | 915 ms/step , 6874.94 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:16:18 | Epoch: 1 | Step: 198760 | Dataset: 0-1099981 | Loss: 0.739 | 913 ms/step , 6889.38 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 00:16:27 | Epoch: 1 | Step: 198770 | Dataset: 0-1100301 | Loss: 0.671 | 913 ms/step , 6889.46 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-06 00:16:36 | Epoch: 1 | Step: 198780 | Dataset: 0-1100621 | Loss: 0.804 | 915 ms/step , 6873.68 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 00:16:46 | Epoch: 1 | Step: 198790 | Dataset: 0-1100941 | Loss: 0.717 | 913 ms/step , 6887.97 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-06 00:16:55 | Epoch: 1 | Step: 198800 | Dataset: 0-1101261 | Loss: 0.771 | 915 ms/step , 6876.18 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-06 00:16:56 | Validation | Step: 198800 | Val_loss: 0.652 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:17:05 | Epoch: 1 | Step: 198810 | Dataset: 0-1101581 | Loss: 0.762 | 917 ms/step , 6861.32 GFLOP/s , 15269.2 tokens/s INFO:__main__:2024-11-06 00:17:15 | Epoch: 1 | Step: 198820 | Dataset: 0-1101901 | Loss: 0.724 | 914 ms/step , 6879.92 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 00:17:24 | Epoch: 1 | Step: 198830 | Dataset: 0-1102221 | Loss: 0.769 | 915 ms/step , 6876.90 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 00:17:33 | Epoch: 1 | Step: 198840 | Dataset: 0-1102541 | Loss: 0.761 | 915 ms/step , 6873.63 GFLOP/s , 17906.5 tokens/s INFO:__main__:2024-11-06 00:17:42 | Epoch: 1 | Step: 198850 | Dataset: 0-1102861 | Loss: 0.751 | 913 ms/step , 6885.60 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-06 00:17:51 | Epoch: 1 | Step: 198860 | Dataset: 0-1103181 | Loss: 0.735 | 914 ms/step , 6881.07 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-06 00:18:00 | Epoch: 1 | Step: 198870 | Dataset: 0-1103501 | Loss: 0.817 | 915 ms/step , 6871.64 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 00:18:09 | Epoch: 1 | Step: 198880 | Dataset: 0-1103821 | Loss: 0.729 | 913 ms/step , 6885.77 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 00:18:19 | Epoch: 1 | Step: 198890 | Dataset: 0-1104141 | Loss: 0.704 | 914 ms/step , 6883.46 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 00:18:28 | Epoch: 1 | Step: 198900 | Dataset: 0-1104461 | Loss: 0.682 | 914 ms/step , 6878.77 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-06 00:18:29 | Validation | Step: 198900 | Val_loss: 0.655 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:18:38 | Epoch: 1 | Step: 198910 | Dataset: 0-1104781 | Loss: 0.744 | 914 ms/step , 6884.81 GFLOP/s , 15263.7 tokens/s INFO:__main__:2024-11-06 00:18:48 | Epoch: 1 | Step: 198920 | Dataset: 0-1105101 | Loss: 0.712 | 914 ms/step , 6877.67 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 00:18:57 | Epoch: 1 | Step: 198930 | Dataset: 0-1105421 | Loss: 0.698 | 913 ms/step , 6886.44 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 00:19:06 | Epoch: 1 | Step: 198940 | Dataset: 0-1105741 | Loss: 0.766 | 914 ms/step , 6884.25 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-06 00:19:15 | Epoch: 1 | Step: 198950 | Dataset: 0-1106061 | Loss: 0.649 | 914 ms/step , 6881.02 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-06 00:19:24 | Epoch: 1 | Step: 198960 | Dataset: 0-1106381 | Loss: 0.758 | 915 ms/step , 6876.24 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-06 00:19:33 | Epoch: 1 | Step: 198970 | Dataset: 0-1106701 | Loss: 0.680 | 914 ms/step , 6883.13 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:19:43 | Epoch: 1 | Step: 198980 | Dataset: 0-1107021 | Loss: 0.736 | 913 ms/step , 6885.50 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 00:19:52 | Epoch: 1 | Step: 198990 | Dataset: 0-1107341 | Loss: 0.788 | 915 ms/step , 6871.73 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 00:20:01 | Epoch: 1 | Step: 199000 | Dataset: 0-1107661 | Loss: 0.779 | 914 ms/step , 6884.77 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-06 00:20:02 | Validation | Step: 199000 | Val_loss: 0.666 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:20:02 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_002002_step_199000.pt` INFO:__main__:2024-11-06 00:20:13 | Epoch: 1 | Step: 199010 | Dataset: 0-1107981 | Loss: 0.711 | 914 ms/step , 6882.80 GFLOP/s , 13808.0 tokens/s INFO:__main__:2024-11-06 00:20:22 | Epoch: 1 | Step: 199020 | Dataset: 0-1108301 | Loss: 0.778 | 914 ms/step , 6879.85 GFLOP/s , 17862.9 tokens/s INFO:__main__:2024-11-06 00:20:31 | Epoch: 1 | Step: 199030 | Dataset: 0-1108621 | Loss: 0.694 | 914 ms/step , 6879.62 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-06 00:20:40 | Epoch: 1 | Step: 199040 | Dataset: 0-1108941 | Loss: 0.687 | 915 ms/step , 6873.99 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-06 00:20:49 | Epoch: 1 | Step: 199050 | Dataset: 0-1109261 | Loss: 0.729 | 914 ms/step , 6882.43 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 00:20:58 | Epoch: 1 | Step: 199060 | Dataset: 0-1109581 | Loss: 0.719 | 914 ms/step , 6880.14 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 00:21:08 | Epoch: 1 | Step: 199070 | Dataset: 0-1109901 | Loss: 0.792 | 914 ms/step , 6879.04 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-06 00:21:17 | Epoch: 1 | Step: 199080 | Dataset: 0-1110221 | Loss: 0.706 | 913 ms/step , 6885.22 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 00:21:26 | Epoch: 1 | Step: 199090 | Dataset: 0-1110541 | Loss: 0.813 | 913 ms/step , 6885.11 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 00:21:35 | Epoch: 1 | Step: 199100 | Dataset: 0-1110861 | Loss: 0.773 | 914 ms/step , 6884.35 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 00:21:37 | Validation | Step: 199100 | Val_loss: 0.647 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:21:46 | Epoch: 1 | Step: 199110 | Dataset: 0-1111181 | Loss: 0.746 | 914 ms/step , 6881.32 GFLOP/s , 15272.8 tokens/s INFO:__main__:2024-11-06 00:21:55 | Epoch: 1 | Step: 199120 | Dataset: 0-1111501 | Loss: 0.778 | 913 ms/step , 6888.55 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:22:04 | Epoch: 1 | Step: 199130 | Dataset: 0-1111821 | Loss: 0.716 | 915 ms/step , 6876.30 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:22:13 | Epoch: 1 | Step: 199140 | Dataset: 0-1112141 | Loss: 0.713 | 915 ms/step , 6873.43 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 00:22:22 | Epoch: 1 | Step: 199150 | Dataset: 0-1112461 | Loss: 0.723 | 914 ms/step , 6881.35 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-06 00:22:31 | Epoch: 1 | Step: 199160 | Dataset: 0-1112781 | Loss: 0.742 | 915 ms/step , 6874.65 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 00:22:41 | Epoch: 1 | Step: 199170 | Dataset: 0-1113101 | Loss: 0.703 | 914 ms/step , 6881.98 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-06 00:22:50 | Epoch: 1 | Step: 199180 | Dataset: 0-1113421 | Loss: 0.755 | 913 ms/step , 6886.93 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 00:22:59 | Epoch: 1 | Step: 199190 | Dataset: 0-1113741 | Loss: 0.728 | 914 ms/step , 6877.82 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 00:23:08 | Epoch: 1 | Step: 199200 | Dataset: 0-1114061 | Loss: 0.703 | 915 ms/step , 6875.73 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 00:23:10 | Validation | Step: 199200 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:23:19 | Epoch: 1 | Step: 199210 | Dataset: 0-1114381 | Loss: 0.722 | 915 ms/step , 6875.69 GFLOP/s , 15267.2 tokens/s INFO:__main__:2024-11-06 00:23:28 | Epoch: 1 | Step: 199220 | Dataset: 0-1114701 | Loss: 0.776 | 914 ms/step , 6881.57 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 00:23:37 | Epoch: 1 | Step: 199230 | Dataset: 0-1115021 | Loss: 0.776 | 915 ms/step , 6873.50 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-06 00:23:46 | Epoch: 1 | Step: 199240 | Dataset: 0-1115341 | Loss: 0.664 | 914 ms/step , 6881.56 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 00:23:55 | Epoch: 1 | Step: 199250 | Dataset: 0-1115661 | Loss: 0.758 | 914 ms/step , 6881.20 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 00:24:04 | Epoch: 1 | Step: 199260 | Dataset: 0-1115981 | Loss: 0.751 | 914 ms/step , 6884.22 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-06 00:24:14 | Epoch: 1 | Step: 199270 | Dataset: 0-1116301 | Loss: 0.679 | 912 ms/step , 6892.81 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 00:24:23 | Epoch: 1 | Step: 199280 | Dataset: 0-1116621 | Loss: 0.732 | 914 ms/step , 6878.56 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-06 00:24:32 | Epoch: 1 | Step: 199290 | Dataset: 0-1116941 | Loss: 0.705 | 915 ms/step , 6876.34 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 00:24:41 | Epoch: 1 | Step: 199300 | Dataset: 0-1117261 | Loss: 0.750 | 914 ms/step , 6880.66 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-06 00:24:43 | Validation | Step: 199300 | Val_loss: 0.664 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:24:52 | Epoch: 1 | Step: 199310 | Dataset: 0-1117581 | Loss: 0.725 | 913 ms/step , 6889.89 GFLOP/s , 15272.2 tokens/s INFO:__main__:2024-11-06 00:25:01 | Epoch: 1 | Step: 199320 | Dataset: 0-1117901 | Loss: 0.718 | 913 ms/step , 6892.31 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 00:25:10 | Epoch: 1 | Step: 199330 | Dataset: 0-1118221 | Loss: 0.675 | 912 ms/step , 6894.65 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-06 00:25:19 | Epoch: 1 | Step: 199340 | Dataset: 0-1118541 | Loss: 0.672 | 913 ms/step , 6886.73 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 00:25:28 | Epoch: 1 | Step: 199350 | Dataset: 0-1118861 | Loss: 0.791 | 915 ms/step , 6876.92 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 00:25:37 | Epoch: 1 | Step: 199360 | Dataset: 0-1119181 | Loss: 0.716 | 914 ms/step , 6877.77 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 00:25:47 | Epoch: 1 | Step: 199370 | Dataset: 0-1119501 | Loss: 0.732 | 914 ms/step , 6884.19 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 00:25:56 | Epoch: 1 | Step: 199380 | Dataset: 0-1119821 | Loss: 0.757 | 914 ms/step , 6879.44 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-06 00:26:05 | Epoch: 1 | Step: 199390 | Dataset: 0-1120141 | Loss: 0.737 | 915 ms/step , 6875.06 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-06 00:26:14 | Epoch: 1 | Step: 199400 | Dataset: 0-1120461 | Loss: 0.756 | 914 ms/step , 6878.68 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 00:26:16 | Validation | Step: 199400 | Val_loss: 0.712 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:26:25 | Epoch: 1 | Step: 199410 | Dataset: 0-1120781 | Loss: 0.727 | 914 ms/step , 6884.27 GFLOP/s , 15260.2 tokens/s INFO:__main__:2024-11-06 00:26:34 | Epoch: 1 | Step: 199420 | Dataset: 0-1121101 | Loss: 0.778 | 915 ms/step , 6873.72 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-06 00:26:43 | Epoch: 1 | Step: 199430 | Dataset: 0-1121421 | Loss: 0.771 | 915 ms/step , 6875.66 GFLOP/s , 17908.1 tokens/s INFO:__main__:2024-11-06 00:26:52 | Epoch: 1 | Step: 199440 | Dataset: 0-1121741 | Loss: 0.702 | 915 ms/step , 6870.02 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-06 00:27:01 | Epoch: 1 | Step: 199450 | Dataset: 0-1122061 | Loss: 0.757 | 915 ms/step , 6875.26 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-06 00:27:11 | Epoch: 1 | Step: 199460 | Dataset: 0-1122381 | Loss: 0.726 | 915 ms/step , 6875.92 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-06 00:27:20 | Epoch: 1 | Step: 199470 | Dataset: 0-1122701 | Loss: 0.700 | 913 ms/step , 6886.77 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 00:27:29 | Epoch: 1 | Step: 199480 | Dataset: 0-1123021 | Loss: 0.772 | 914 ms/step , 6877.86 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 00:27:38 | Epoch: 1 | Step: 199490 | Dataset: 0-1123341 | Loss: 0.742 | 915 ms/step , 6873.11 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-06 00:27:47 | Epoch: 1 | Step: 199500 | Dataset: 0-1123661 | Loss: 0.726 | 915 ms/step , 6874.94 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-06 00:27:49 | Validation | Step: 199500 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:27:58 | Epoch: 1 | Step: 199510 | Dataset: 0-1123981 | Loss: 0.813 | 915 ms/step , 6877.36 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-06 00:28:07 | Epoch: 1 | Step: 199520 | Dataset: 0-1124301 | Loss: 0.751 | 913 ms/step , 6886.62 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-06 00:28:16 | Epoch: 1 | Step: 199530 | Dataset: 0-1124621 | Loss: 0.767 | 914 ms/step , 6881.77 GFLOP/s , 17917.2 tokens/s INFO:__main__:2024-11-06 00:28:25 | Epoch: 1 | Step: 199540 | Dataset: 0-1124941 | Loss: 0.800 | 914 ms/step , 6879.91 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-06 00:28:34 | Epoch: 1 | Step: 199550 | Dataset: 0-1125261 | Loss: 0.729 | 913 ms/step , 6885.58 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 00:28:44 | Epoch: 1 | Step: 199560 | Dataset: 0-1125581 | Loss: 0.735 | 914 ms/step , 6883.81 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-06 00:28:53 | Epoch: 1 | Step: 199570 | Dataset: 0-1125901 | Loss: 0.759 | 914 ms/step , 6879.22 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 00:29:02 | Epoch: 1 | Step: 199580 | Dataset: 0-1126221 | Loss: 0.761 | 913 ms/step , 6885.66 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-06 00:29:11 | Epoch: 1 | Step: 199590 | Dataset: 0-1126541 | Loss: 0.707 | 914 ms/step , 6880.60 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-06 00:29:20 | Epoch: 1 | Step: 199600 | Dataset: 0-1126861 | Loss: 0.750 | 914 ms/step , 6880.77 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 00:29:22 | Validation | Step: 199600 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:29:31 | Epoch: 1 | Step: 199610 | Dataset: 0-1127181 | Loss: 0.768 | 914 ms/step , 6881.01 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-06 00:29:40 | Epoch: 1 | Step: 199620 | Dataset: 0-1127501 | Loss: 0.691 | 914 ms/step , 6881.07 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-06 00:29:49 | Epoch: 1 | Step: 199630 | Dataset: 0-1127821 | Loss: 0.768 | 913 ms/step , 6886.73 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-06 00:29:58 | Epoch: 1 | Step: 199640 | Dataset: 0-1128141 | Loss: 0.743 | 914 ms/step , 6879.47 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-06 00:30:07 | Epoch: 1 | Step: 199650 | Dataset: 0-1128461 | Loss: 0.701 | 913 ms/step , 6890.31 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-06 00:30:17 | Epoch: 1 | Step: 199660 | Dataset: 0-1128781 | Loss: 0.722 | 914 ms/step , 6878.52 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-06 00:30:26 | Epoch: 1 | Step: 199670 | Dataset: 0-1129101 | Loss: 0.727 | 915 ms/step , 6872.12 GFLOP/s , 17898.7 tokens/s INFO:__main__:2024-11-06 00:30:35 | Epoch: 1 | Step: 199680 | Dataset: 0-1129421 | Loss: 0.684 | 914 ms/step , 6877.99 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-06 00:30:44 | Epoch: 1 | Step: 199690 | Dataset: 0-1129741 | Loss: 0.718 | 914 ms/step , 6879.11 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 00:30:53 | Epoch: 1 | Step: 199700 | Dataset: 0-1130061 | Loss: 0.740 | 916 ms/step , 6866.10 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-06 00:30:55 | Validation | Step: 199700 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:31:04 | Epoch: 1 | Step: 199710 | Dataset: 0-1130381 | Loss: 0.732 | 915 ms/step , 6871.64 GFLOP/s , 15270.5 tokens/s INFO:__main__:2024-11-06 00:31:13 | Epoch: 1 | Step: 199720 | Dataset: 0-1130701 | Loss: 0.807 | 917 ms/step , 6860.41 GFLOP/s , 17901.7 tokens/s INFO:__main__:2024-11-06 00:31:22 | Epoch: 1 | Step: 199730 | Dataset: 0-1131021 | Loss: 0.741 | 914 ms/step , 6881.20 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-06 00:31:32 | Epoch: 1 | Step: 199740 | Dataset: 0-1131341 | Loss: 0.718 | 943 ms/step , 6667.73 GFLOP/s , 17579.2 tokens/s INFO:__main__:2024-11-06 00:31:41 | Epoch: 1 | Step: 199750 | Dataset: 0-1131661 | Loss: 0.763 | 915 ms/step , 6875.39 GFLOP/s , 17818.6 tokens/s INFO:__main__:2024-11-06 00:31:50 | Epoch: 1 | Step: 199760 | Dataset: 0-1131981 | Loss: 0.785 | 913 ms/step , 6887.80 GFLOP/s , 17873.4 tokens/s INFO:__main__:2024-11-06 00:31:59 | Epoch: 1 | Step: 199770 | Dataset: 0-1132301 | Loss: 0.752 | 914 ms/step , 6883.78 GFLOP/s , 17841.9 tokens/s INFO:__main__:2024-11-06 00:32:08 | Epoch: 1 | Step: 199780 | Dataset: 0-1132621 | Loss: 0.657 | 914 ms/step , 6880.40 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 00:32:17 | Epoch: 1 | Step: 199790 | Dataset: 0-1132941 | Loss: 0.708 | 915 ms/step , 6872.33 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 00:32:27 | Epoch: 1 | Step: 199800 | Dataset: 0-1133261 | Loss: 0.730 | 914 ms/step , 6878.24 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 00:32:28 | Validation | Step: 199800 | Val_loss: 0.688 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:32:37 | Epoch: 1 | Step: 199810 | Dataset: 0-1133581 | Loss: 0.708 | 914 ms/step , 6881.37 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-06 00:32:46 | Epoch: 1 | Step: 199820 | Dataset: 0-1133901 | Loss: 0.656 | 914 ms/step , 6884.53 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-06 00:32:56 | Epoch: 1 | Step: 199830 | Dataset: 0-1134221 | Loss: 0.712 | 915 ms/step , 6876.84 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-06 00:33:05 | Epoch: 1 | Step: 199840 | Dataset: 0-1134541 | Loss: 0.736 | 915 ms/step , 6873.25 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-06 00:33:14 | Epoch: 1 | Step: 199850 | Dataset: 0-1134861 | Loss: 0.681 | 913 ms/step , 6886.74 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 00:33:23 | Epoch: 1 | Step: 199860 | Dataset: 0-1135181 | Loss: 0.739 | 916 ms/step , 6866.35 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-06 00:33:32 | Epoch: 1 | Step: 199870 | Dataset: 0-1135501 | Loss: 0.685 | 913 ms/step , 6890.71 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 00:33:41 | Epoch: 1 | Step: 199880 | Dataset: 0-1135821 | Loss: 0.787 | 913 ms/step , 6885.91 GFLOP/s , 17912.5 tokens/s INFO:__main__:2024-11-06 00:33:50 | Epoch: 1 | Step: 199890 | Dataset: 0-1136141 | Loss: 0.797 | 914 ms/step , 6878.87 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-06 00:34:00 | Epoch: 1 | Step: 199900 | Dataset: 0-1136461 | Loss: 0.662 | 914 ms/step , 6881.27 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 00:34:01 | Validation | Step: 199900 | Val_loss: 0.742 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:34:10 | Epoch: 1 | Step: 199910 | Dataset: 0-1136781 | Loss: 0.730 | 913 ms/step , 6885.72 GFLOP/s , 15270.7 tokens/s INFO:__main__:2024-11-06 00:34:19 | Epoch: 1 | Step: 199920 | Dataset: 0-1137101 | Loss: 0.778 | 914 ms/step , 6881.74 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-06 00:34:29 | Epoch: 1 | Step: 199930 | Dataset: 0-1137421 | Loss: 0.689 | 914 ms/step , 6883.01 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-06 00:34:38 | Epoch: 1 | Step: 199940 | Dataset: 0-1137741 | Loss: 0.676 | 914 ms/step , 6881.88 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 00:34:47 | Epoch: 1 | Step: 199950 | Dataset: 0-1138061 | Loss: 0.740 | 913 ms/step , 6888.55 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 00:34:56 | Epoch: 1 | Step: 199960 | Dataset: 0-1138381 | Loss: 0.665 | 914 ms/step , 6884.65 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-06 00:35:05 | Epoch: 1 | Step: 199970 | Dataset: 0-1138701 | Loss: 0.802 | 913 ms/step , 6891.81 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 00:35:14 | Epoch: 1 | Step: 199980 | Dataset: 0-1139021 | Loss: 0.748 | 914 ms/step , 6882.00 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 00:35:23 | Epoch: 1 | Step: 199990 | Dataset: 0-1139341 | Loss: 0.719 | 913 ms/step , 6890.01 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-06 00:35:33 | Epoch: 1 | Step: 200000 | Dataset: 0-1139661 | Loss: 0.606 | 914 ms/step , 6880.74 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 00:35:34 | Validation | Step: 200000 | Val_loss: 0.702 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:35:34 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_003534_step_200000.pt` INFO:__main__:2024-11-06 00:35:44 | Epoch: 1 | Step: 200010 | Dataset: 0-1139981 | Loss: 0.654 | 913 ms/step , 6889.17 GFLOP/s , 13758.2 tokens/s INFO:__main__:2024-11-06 00:35:54 | Epoch: 1 | Step: 200020 | Dataset: 0-1140301 | Loss: 0.612 | 913 ms/step , 6885.19 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 00:36:03 | Epoch: 1 | Step: 200030 | Dataset: 0-1140621 | Loss: 0.727 | 913 ms/step , 6886.67 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 00:36:12 | Epoch: 1 | Step: 200040 | Dataset: 0-1140941 | Loss: 0.691 | 915 ms/step , 6871.05 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-06 00:36:21 | Epoch: 1 | Step: 200050 | Dataset: 0-1141261 | Loss: 0.750 | 913 ms/step , 6886.71 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 00:36:30 | Epoch: 1 | Step: 200060 | Dataset: 0-1141581 | Loss: 0.546 | 913 ms/step , 6885.08 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-06 00:36:39 | Epoch: 1 | Step: 200070 | Dataset: 0-1141901 | Loss: 0.661 | 913 ms/step , 6888.51 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 00:36:48 | Epoch: 1 | Step: 200080 | Dataset: 0-1142221 | Loss: 0.690 | 914 ms/step , 6884.12 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 00:36:58 | Epoch: 1 | Step: 200090 | Dataset: 0-1142541 | Loss: 0.786 | 913 ms/step , 6885.95 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 00:37:07 | Epoch: 1 | Step: 200100 | Dataset: 0-1142861 | Loss: 0.605 | 913 ms/step , 6886.48 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 00:37:08 | Validation | Step: 200100 | Val_loss: 0.761 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:37:17 | Epoch: 1 | Step: 200110 | Dataset: 0-1143181 | Loss: 0.791 | 913 ms/step , 6890.90 GFLOP/s , 15275.0 tokens/s INFO:__main__:2024-11-06 00:37:27 | Epoch: 1 | Step: 200120 | Dataset: 0-1143501 | Loss: 0.768 | 913 ms/step , 6887.93 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 00:37:36 | Epoch: 1 | Step: 200130 | Dataset: 0-1143821 | Loss: 0.664 | 912 ms/step , 6895.01 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 00:37:45 | Epoch: 1 | Step: 200140 | Dataset: 0-1144141 | Loss: 0.619 | 914 ms/step , 6884.73 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 00:37:54 | Epoch: 1 | Step: 200150 | Dataset: 0-1144461 | Loss: 0.634 | 912 ms/step , 6893.02 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 00:38:03 | Epoch: 1 | Step: 200160 | Dataset: 0-1144781 | Loss: 0.692 | 912 ms/step , 6893.78 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 00:38:12 | Epoch: 1 | Step: 200170 | Dataset: 0-1145101 | Loss: 0.748 | 913 ms/step , 6885.96 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 00:38:21 | Epoch: 1 | Step: 200180 | Dataset: 0-1145421 | Loss: 0.766 | 913 ms/step , 6892.52 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 00:38:31 | Epoch: 1 | Step: 200190 | Dataset: 0-1145741 | Loss: 0.649 | 913 ms/step , 6891.23 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 00:38:40 | Epoch: 1 | Step: 200200 | Dataset: 0-1146061 | Loss: 0.679 | 912 ms/step , 6893.13 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 00:38:41 | Validation | Step: 200200 | Val_loss: 0.748 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:38:50 | Epoch: 1 | Step: 200210 | Dataset: 0-1146381 | Loss: 0.682 | 913 ms/step , 6885.06 GFLOP/s , 15271.1 tokens/s INFO:__main__:2024-11-06 00:39:00 | Epoch: 1 | Step: 200220 | Dataset: 0-1146701 | Loss: 0.711 | 913 ms/step , 6887.63 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 00:39:09 | Epoch: 1 | Step: 200230 | Dataset: 0-1147021 | Loss: 0.728 | 915 ms/step , 6877.42 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-06 00:39:18 | Epoch: 1 | Step: 200240 | Dataset: 0-1147341 | Loss: 0.720 | 915 ms/step , 6874.08 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 00:39:27 | Epoch: 1 | Step: 200250 | Dataset: 0-1147661 | Loss: 0.655 | 913 ms/step , 6886.77 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 00:39:36 | Epoch: 1 | Step: 200260 | Dataset: 0-1147981 | Loss: 0.713 | 913 ms/step , 6888.59 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 00:39:45 | Epoch: 1 | Step: 200270 | Dataset: 0-1148301 | Loss: 0.674 | 913 ms/step , 6887.88 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 00:39:54 | Epoch: 1 | Step: 200280 | Dataset: 0-1148621 | Loss: 0.596 | 913 ms/step , 6887.89 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 00:40:04 | Epoch: 1 | Step: 200290 | Dataset: 0-1148941 | Loss: 0.871 | 914 ms/step , 6879.13 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 00:40:13 | Epoch: 1 | Step: 200300 | Dataset: 0-1149261 | Loss: 0.684 | 914 ms/step , 6878.45 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 00:40:14 | Validation | Step: 200300 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:40:23 | Epoch: 1 | Step: 200310 | Dataset: 0-1149581 | Loss: 0.617 | 913 ms/step , 6886.24 GFLOP/s , 15267.5 tokens/s INFO:__main__:2024-11-06 00:40:33 | Epoch: 1 | Step: 200320 | Dataset: 0-1149901 | Loss: 0.749 | 914 ms/step , 6880.31 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 00:40:42 | Epoch: 1 | Step: 200330 | Dataset: 0-1150221 | Loss: 0.755 | 912 ms/step , 6895.04 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 00:40:51 | Epoch: 1 | Step: 200340 | Dataset: 0-1150541 | Loss: 0.696 | 912 ms/step , 6893.13 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 00:41:00 | Epoch: 1 | Step: 200350 | Dataset: 0-1150861 | Loss: 0.701 | 914 ms/step , 6880.73 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 00:41:09 | Epoch: 1 | Step: 200360 | Dataset: 0-1151181 | Loss: 0.817 | 914 ms/step , 6881.30 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 00:41:18 | Epoch: 1 | Step: 200370 | Dataset: 0-1151501 | Loss: 0.676 | 914 ms/step , 6884.00 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 00:41:27 | Epoch: 1 | Step: 200380 | Dataset: 0-1151821 | Loss: 0.631 | 913 ms/step , 6890.93 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 00:41:37 | Epoch: 1 | Step: 200390 | Dataset: 0-1152141 | Loss: 0.685 | 915 ms/step , 6877.23 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 00:41:46 | Epoch: 1 | Step: 200400 | Dataset: 0-1152461 | Loss: 0.649 | 912 ms/step , 6893.94 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 00:41:47 | Validation | Step: 200400 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:41:56 | Epoch: 1 | Step: 200410 | Dataset: 0-1152781 | Loss: 0.752 | 913 ms/step , 6886.85 GFLOP/s , 15271.3 tokens/s INFO:__main__:2024-11-06 00:42:06 | Epoch: 1 | Step: 200420 | Dataset: 0-1153101 | Loss: 0.630 | 913 ms/step , 6891.95 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 00:42:15 | Epoch: 1 | Step: 200430 | Dataset: 0-1153421 | Loss: 0.600 | 913 ms/step , 6887.22 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 00:42:24 | Epoch: 1 | Step: 200440 | Dataset: 0-1153741 | Loss: 0.769 | 915 ms/step , 6872.65 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 00:42:33 | Epoch: 1 | Step: 200450 | Dataset: 0-1154061 | Loss: 0.748 | 914 ms/step , 6880.70 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 00:42:42 | Epoch: 1 | Step: 200460 | Dataset: 0-1154381 | Loss: 0.659 | 914 ms/step , 6878.83 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 00:42:51 | Epoch: 1 | Step: 200470 | Dataset: 0-1154701 | Loss: 0.688 | 913 ms/step , 6887.71 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 00:43:00 | Epoch: 1 | Step: 200480 | Dataset: 0-1155021 | Loss: 0.753 | 915 ms/step , 6870.53 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 00:43:10 | Epoch: 1 | Step: 200490 | Dataset: 0-1155341 | Loss: 0.782 | 913 ms/step , 6887.87 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 00:43:19 | Epoch: 1 | Step: 200500 | Dataset: 0-1155661 | Loss: 0.660 | 913 ms/step , 6887.34 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 00:43:20 | Validation | Step: 200500 | Val_loss: 0.767 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:43:29 | Epoch: 1 | Step: 200510 | Dataset: 0-1155981 | Loss: 0.776 | 914 ms/step , 6881.34 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 00:43:39 | Epoch: 1 | Step: 200520 | Dataset: 0-1156301 | Loss: 0.625 | 914 ms/step , 6882.17 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 00:43:48 | Epoch: 1 | Step: 200530 | Dataset: 0-1156621 | Loss: 0.629 | 913 ms/step , 6885.90 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 00:43:57 | Epoch: 1 | Step: 200540 | Dataset: 0-1156941 | Loss: 0.776 | 914 ms/step , 6883.86 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 00:44:06 | Epoch: 1 | Step: 200550 | Dataset: 0-1157261 | Loss: 0.674 | 914 ms/step , 6884.12 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 00:44:15 | Epoch: 1 | Step: 200560 | Dataset: 0-1157581 | Loss: 0.715 | 914 ms/step , 6881.92 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 00:44:24 | Epoch: 1 | Step: 200570 | Dataset: 0-1157901 | Loss: 0.643 | 912 ms/step , 6892.61 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 00:44:33 | Epoch: 1 | Step: 200580 | Dataset: 0-1158221 | Loss: 0.698 | 913 ms/step , 6886.72 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 00:44:42 | Epoch: 1 | Step: 200590 | Dataset: 0-1158541 | Loss: 0.688 | 914 ms/step , 6882.09 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 00:44:52 | Epoch: 1 | Step: 200600 | Dataset: 0-1158861 | Loss: 0.668 | 913 ms/step , 6886.72 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 00:44:53 | Validation | Step: 200600 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:45:02 | Epoch: 1 | Step: 200610 | Dataset: 0-1159181 | Loss: 0.728 | 912 ms/step , 6892.81 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-06 00:45:11 | Epoch: 1 | Step: 200620 | Dataset: 0-1159501 | Loss: 0.693 | 914 ms/step , 6882.03 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 00:45:21 | Epoch: 1 | Step: 200630 | Dataset: 0-1159821 | Loss: 0.764 | 914 ms/step , 6884.20 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 00:45:30 | Epoch: 1 | Step: 200640 | Dataset: 0-1160141 | Loss: 0.718 | 913 ms/step , 6889.41 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 00:45:39 | Epoch: 1 | Step: 200650 | Dataset: 0-1160461 | Loss: 0.736 | 914 ms/step , 6884.87 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 00:45:48 | Epoch: 1 | Step: 200660 | Dataset: 0-1160781 | Loss: 0.782 | 913 ms/step , 6885.68 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 00:45:57 | Epoch: 1 | Step: 200670 | Dataset: 0-1161101 | Loss: 0.752 | 914 ms/step , 6878.43 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 00:46:06 | Epoch: 1 | Step: 200680 | Dataset: 0-1161421 | Loss: 0.760 | 916 ms/step , 6869.60 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 00:46:15 | Epoch: 1 | Step: 200690 | Dataset: 0-1161741 | Loss: 0.649 | 913 ms/step , 6891.35 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 00:46:25 | Epoch: 1 | Step: 200700 | Dataset: 0-1162061 | Loss: 0.667 | 913 ms/step , 6888.16 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 00:46:26 | Validation | Step: 200700 | Val_loss: 0.759 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:46:35 | Epoch: 1 | Step: 200710 | Dataset: 0-1162381 | Loss: 0.691 | 913 ms/step , 6890.76 GFLOP/s , 15291.8 tokens/s INFO:__main__:2024-11-06 00:46:44 | Epoch: 1 | Step: 200720 | Dataset: 0-1162701 | Loss: 0.767 | 913 ms/step , 6885.71 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 00:46:54 | Epoch: 1 | Step: 200730 | Dataset: 0-1163021 | Loss: 0.715 | 912 ms/step , 6895.85 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 00:47:03 | Epoch: 1 | Step: 200740 | Dataset: 0-1163341 | Loss: 0.655 | 913 ms/step , 6890.80 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 00:47:12 | Epoch: 1 | Step: 200750 | Dataset: 0-1163661 | Loss: 0.665 | 913 ms/step , 6886.67 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 00:47:21 | Epoch: 1 | Step: 200760 | Dataset: 0-1163981 | Loss: 0.693 | 913 ms/step , 6888.31 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 00:47:30 | Epoch: 1 | Step: 200770 | Dataset: 0-1164301 | Loss: 0.734 | 913 ms/step , 6892.59 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 00:47:39 | Epoch: 1 | Step: 200780 | Dataset: 0-1164621 | Loss: 0.666 | 915 ms/step , 6874.03 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 00:47:48 | Epoch: 1 | Step: 200790 | Dataset: 0-1164941 | Loss: 0.786 | 913 ms/step , 6885.70 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 00:47:58 | Epoch: 1 | Step: 200800 | Dataset: 0-1165261 | Loss: 0.658 | 913 ms/step , 6886.43 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 00:47:59 | Validation | Step: 200800 | Val_loss: 0.725 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:48:08 | Epoch: 1 | Step: 200810 | Dataset: 0-1165581 | Loss: 0.608 | 914 ms/step , 6883.68 GFLOP/s , 15268.5 tokens/s INFO:__main__:2024-11-06 00:48:17 | Epoch: 1 | Step: 200820 | Dataset: 0-1165901 | Loss: 0.756 | 915 ms/step , 6875.24 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 00:48:27 | Epoch: 1 | Step: 200830 | Dataset: 0-1166221 | Loss: 0.669 | 911 ms/step , 6901.23 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 00:48:36 | Epoch: 1 | Step: 200840 | Dataset: 0-1166541 | Loss: 0.684 | 914 ms/step , 6884.04 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 00:48:45 | Epoch: 1 | Step: 200850 | Dataset: 0-1166861 | Loss: 0.759 | 916 ms/step , 6869.28 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 00:48:54 | Epoch: 1 | Step: 200860 | Dataset: 0-1167181 | Loss: 0.717 | 913 ms/step , 6888.85 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 00:49:03 | Epoch: 1 | Step: 200870 | Dataset: 0-1167501 | Loss: 0.692 | 913 ms/step , 6888.97 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 00:49:12 | Epoch: 1 | Step: 200880 | Dataset: 0-1167821 | Loss: 0.649 | 912 ms/step , 6896.58 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 00:49:21 | Epoch: 1 | Step: 200890 | Dataset: 0-1168141 | Loss: 0.754 | 913 ms/step , 6887.16 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 00:49:31 | Epoch: 1 | Step: 200900 | Dataset: 0-1168461 | Loss: 0.753 | 914 ms/step , 6880.41 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 00:49:32 | Validation | Step: 200900 | Val_loss: 0.758 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:49:41 | Epoch: 1 | Step: 200910 | Dataset: 0-1168781 | Loss: 0.640 | 913 ms/step , 6889.61 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-06 00:49:50 | Epoch: 1 | Step: 200920 | Dataset: 0-1169101 | Loss: 0.648 | 913 ms/step , 6888.24 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 00:50:00 | Epoch: 1 | Step: 200930 | Dataset: 0-1169421 | Loss: 0.608 | 914 ms/step , 6878.68 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 00:50:09 | Epoch: 1 | Step: 200940 | Dataset: 0-1169741 | Loss: 0.600 | 913 ms/step , 6886.78 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 00:50:18 | Epoch: 1 | Step: 200950 | Dataset: 0-1170061 | Loss: 0.699 | 912 ms/step , 6895.83 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 00:50:27 | Epoch: 1 | Step: 200960 | Dataset: 0-1170381 | Loss: 0.582 | 911 ms/step , 6901.41 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 00:50:36 | Epoch: 1 | Step: 200970 | Dataset: 0-1170701 | Loss: 0.627 | 913 ms/step , 6891.39 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-06 00:50:45 | Epoch: 1 | Step: 200980 | Dataset: 0-1171021 | Loss: 0.674 | 913 ms/step , 6891.91 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 00:50:54 | Epoch: 1 | Step: 200990 | Dataset: 0-1171341 | Loss: 0.676 | 912 ms/step , 6893.39 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 00:51:03 | Epoch: 1 | Step: 201000 | Dataset: 0-1171661 | Loss: 0.670 | 915 ms/step , 6876.58 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 00:51:05 | Validation | Step: 201000 | Val_loss: 0.718 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:51:05 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_005105_step_201000.pt` INFO:__main__:2024-11-06 00:51:15 | Epoch: 1 | Step: 201010 | Dataset: 0-1171981 | Loss: 0.707 | 913 ms/step , 6889.95 GFLOP/s , 13729.6 tokens/s INFO:__main__:2024-11-06 00:51:25 | Epoch: 1 | Step: 201020 | Dataset: 0-1172301 | Loss: 0.667 | 913 ms/step , 6888.83 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 00:51:34 | Epoch: 1 | Step: 201030 | Dataset: 0-1172621 | Loss: 0.775 | 914 ms/step , 6884.27 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 00:51:43 | Epoch: 1 | Step: 201040 | Dataset: 0-1172941 | Loss: 0.736 | 914 ms/step , 6881.01 GFLOP/s , 17887.0 tokens/s INFO:__main__:2024-11-06 00:51:52 | Epoch: 1 | Step: 201050 | Dataset: 0-1173261 | Loss: 0.774 | 912 ms/step , 6897.01 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 00:52:01 | Epoch: 1 | Step: 201060 | Dataset: 0-1173581 | Loss: 0.676 | 913 ms/step , 6890.02 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 00:52:10 | Epoch: 1 | Step: 201070 | Dataset: 0-1173901 | Loss: 0.775 | 914 ms/step , 6881.97 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 00:52:19 | Epoch: 1 | Step: 201080 | Dataset: 0-1174221 | Loss: 0.679 | 912 ms/step , 6895.41 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 00:52:29 | Epoch: 1 | Step: 201090 | Dataset: 0-1174541 | Loss: 0.640 | 914 ms/step , 6883.51 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 00:52:38 | Epoch: 1 | Step: 201100 | Dataset: 0-1174861 | Loss: 0.637 | 914 ms/step , 6882.10 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 00:52:39 | Validation | Step: 201100 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:52:48 | Epoch: 1 | Step: 201110 | Dataset: 0-1175181 | Loss: 0.696 | 914 ms/step , 6878.92 GFLOP/s , 15263.9 tokens/s INFO:__main__:2024-11-06 00:52:58 | Epoch: 1 | Step: 201120 | Dataset: 0-1175501 | Loss: 0.708 | 913 ms/step , 6891.82 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 00:53:07 | Epoch: 1 | Step: 201130 | Dataset: 0-1175821 | Loss: 0.697 | 913 ms/step , 6888.45 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 00:53:16 | Epoch: 1 | Step: 201140 | Dataset: 0-1176141 | Loss: 0.627 | 914 ms/step , 6880.19 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 00:53:25 | Epoch: 1 | Step: 201150 | Dataset: 0-1176461 | Loss: 0.554 | 913 ms/step , 6889.23 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 00:53:34 | Epoch: 1 | Step: 201160 | Dataset: 0-1176781 | Loss: 0.635 | 912 ms/step , 6898.57 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 00:53:43 | Epoch: 1 | Step: 201170 | Dataset: 0-1177101 | Loss: 0.673 | 915 ms/step , 6875.01 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 00:53:52 | Epoch: 1 | Step: 201180 | Dataset: 0-1177421 | Loss: 0.727 | 913 ms/step , 6886.08 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 00:54:02 | Epoch: 1 | Step: 201190 | Dataset: 0-1177741 | Loss: 0.691 | 912 ms/step , 6894.45 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 00:54:11 | Epoch: 1 | Step: 201200 | Dataset: 0-1178061 | Loss: 0.802 | 913 ms/step , 6885.39 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 00:54:12 | Validation | Step: 201200 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:54:21 | Epoch: 1 | Step: 201210 | Dataset: 0-1178381 | Loss: 0.728 | 914 ms/step , 6883.40 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 00:54:31 | Epoch: 1 | Step: 201220 | Dataset: 0-1178701 | Loss: 0.696 | 913 ms/step , 6886.32 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 00:54:40 | Epoch: 1 | Step: 201230 | Dataset: 0-1179021 | Loss: 0.661 | 914 ms/step , 6883.84 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-06 00:54:49 | Epoch: 1 | Step: 201240 | Dataset: 0-1179341 | Loss: 0.748 | 915 ms/step , 6870.15 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 00:54:58 | Epoch: 1 | Step: 201250 | Dataset: 0-1179661 | Loss: 0.772 | 913 ms/step , 6886.34 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 00:55:07 | Epoch: 1 | Step: 201260 | Dataset: 0-1179981 | Loss: 0.746 | 912 ms/step , 6892.92 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 00:55:16 | Epoch: 1 | Step: 201270 | Dataset: 0-1180301 | Loss: 0.679 | 913 ms/step , 6889.18 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 00:55:25 | Epoch: 1 | Step: 201280 | Dataset: 0-1180621 | Loss: 0.632 | 912 ms/step , 6892.97 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 00:55:34 | Epoch: 1 | Step: 201290 | Dataset: 0-1180941 | Loss: 0.779 | 913 ms/step , 6885.33 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 00:55:44 | Epoch: 1 | Step: 201300 | Dataset: 0-1181261 | Loss: 0.725 | 913 ms/step , 6890.64 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 00:55:45 | Validation | Step: 201300 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:55:54 | Epoch: 1 | Step: 201310 | Dataset: 0-1181581 | Loss: 0.686 | 914 ms/step , 6882.44 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-06 00:56:03 | Epoch: 1 | Step: 201320 | Dataset: 0-1181901 | Loss: 0.620 | 913 ms/step , 6887.18 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 00:56:13 | Epoch: 1 | Step: 201330 | Dataset: 0-1182221 | Loss: 0.664 | 913 ms/step , 6888.79 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 00:56:22 | Epoch: 1 | Step: 201340 | Dataset: 0-1182541 | Loss: 0.782 | 914 ms/step , 6881.11 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 00:56:31 | Epoch: 1 | Step: 201350 | Dataset: 0-1182861 | Loss: 0.772 | 914 ms/step , 6882.14 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 00:56:40 | Epoch: 1 | Step: 201360 | Dataset: 0-1183181 | Loss: 0.795 | 913 ms/step , 6886.58 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 00:56:49 | Epoch: 1 | Step: 201370 | Dataset: 0-1183501 | Loss: 0.723 | 913 ms/step , 6885.32 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 00:56:58 | Epoch: 1 | Step: 201380 | Dataset: 0-1183821 | Loss: 0.716 | 913 ms/step , 6887.80 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-06 00:57:07 | Epoch: 1 | Step: 201390 | Dataset: 0-1184141 | Loss: 0.771 | 912 ms/step , 6896.38 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 00:57:17 | Epoch: 1 | Step: 201400 | Dataset: 0-1184461 | Loss: 0.796 | 912 ms/step , 6896.24 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 00:57:18 | Validation | Step: 201400 | Val_loss: 0.690 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:57:27 | Epoch: 1 | Step: 201410 | Dataset: 0-1184781 | Loss: 0.665 | 912 ms/step , 6894.48 GFLOP/s , 15285.2 tokens/s INFO:__main__:2024-11-06 00:57:36 | Epoch: 1 | Step: 201420 | Dataset: 0-1185101 | Loss: 0.752 | 914 ms/step , 6882.70 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 00:57:46 | Epoch: 1 | Step: 201430 | Dataset: 0-1185421 | Loss: 0.507 | 914 ms/step , 6881.76 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 00:57:55 | Epoch: 1 | Step: 201440 | Dataset: 0-1185741 | Loss: 0.741 | 914 ms/step , 6881.84 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 00:58:04 | Epoch: 1 | Step: 201450 | Dataset: 0-1186061 | Loss: 0.793 | 913 ms/step , 6891.98 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 00:58:13 | Epoch: 1 | Step: 201460 | Dataset: 0-1186381 | Loss: 0.813 | 913 ms/step , 6887.04 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 00:58:22 | Epoch: 1 | Step: 201470 | Dataset: 0-1186701 | Loss: 0.728 | 914 ms/step , 6881.11 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 00:58:31 | Epoch: 1 | Step: 201480 | Dataset: 0-1187021 | Loss: 0.708 | 913 ms/step , 6888.39 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 00:58:40 | Epoch: 1 | Step: 201490 | Dataset: 0-1187341 | Loss: 0.690 | 913 ms/step , 6887.76 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 00:58:50 | Epoch: 1 | Step: 201500 | Dataset: 0-1187661 | Loss: 0.827 | 913 ms/step , 6890.43 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 00:58:51 | Validation | Step: 201500 | Val_loss: 0.673 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 00:59:00 | Epoch: 1 | Step: 201510 | Dataset: 0-1187981 | Loss: 0.577 | 912 ms/step , 6895.33 GFLOP/s , 15274.0 tokens/s INFO:__main__:2024-11-06 00:59:09 | Epoch: 1 | Step: 201520 | Dataset: 0-1188301 | Loss: 0.721 | 913 ms/step , 6890.07 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 00:59:19 | Epoch: 1 | Step: 201530 | Dataset: 0-1188621 | Loss: 0.719 | 915 ms/step , 6874.83 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 00:59:28 | Epoch: 1 | Step: 201540 | Dataset: 0-1188941 | Loss: 0.699 | 913 ms/step , 6890.91 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 00:59:37 | Epoch: 1 | Step: 201550 | Dataset: 0-1189261 | Loss: 0.629 | 912 ms/step , 6894.83 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 00:59:46 | Epoch: 1 | Step: 201560 | Dataset: 0-1189581 | Loss: 0.702 | 913 ms/step , 6889.73 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 00:59:55 | Epoch: 1 | Step: 201570 | Dataset: 0-1189901 | Loss: 0.793 | 914 ms/step , 6882.59 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 01:00:04 | Epoch: 1 | Step: 201580 | Dataset: 0-1190221 | Loss: 0.706 | 913 ms/step , 6889.80 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 01:00:13 | Epoch: 1 | Step: 201590 | Dataset: 0-1190541 | Loss: 0.838 | 914 ms/step , 6879.21 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 01:00:22 | Epoch: 1 | Step: 201600 | Dataset: 0-1190861 | Loss: 0.664 | 913 ms/step , 6887.44 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 01:00:24 | Validation | Step: 201600 | Val_loss: 0.674 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:00:33 | Epoch: 1 | Step: 201610 | Dataset: 0-1191181 | Loss: 0.725 | 914 ms/step , 6883.85 GFLOP/s , 15274.9 tokens/s INFO:__main__:2024-11-06 01:00:42 | Epoch: 1 | Step: 201620 | Dataset: 0-1191501 | Loss: 0.832 | 914 ms/step , 6878.47 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 01:00:51 | Epoch: 1 | Step: 201630 | Dataset: 0-1191821 | Loss: 0.816 | 914 ms/step , 6884.42 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 01:01:01 | Epoch: 1 | Step: 201640 | Dataset: 0-1192141 | Loss: 0.824 | 914 ms/step , 6883.77 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 01:01:10 | Epoch: 1 | Step: 201650 | Dataset: 0-1192461 | Loss: 0.807 | 913 ms/step , 6886.91 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 01:01:19 | Epoch: 1 | Step: 201660 | Dataset: 0-1192781 | Loss: 0.700 | 913 ms/step , 6885.97 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 01:01:28 | Epoch: 1 | Step: 201670 | Dataset: 0-1193101 | Loss: 0.811 | 913 ms/step , 6885.55 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 01:01:37 | Epoch: 1 | Step: 201680 | Dataset: 0-1193421 | Loss: 0.795 | 915 ms/step , 6877.19 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 01:01:46 | Epoch: 1 | Step: 201690 | Dataset: 0-1193741 | Loss: 0.747 | 913 ms/step , 6885.64 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 01:01:55 | Epoch: 1 | Step: 201700 | Dataset: 0-1194061 | Loss: 0.698 | 913 ms/step , 6888.42 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 01:01:57 | Validation | Step: 201700 | Val_loss: 0.749 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:02:06 | Epoch: 1 | Step: 201710 | Dataset: 0-1194381 | Loss: 0.789 | 913 ms/step , 6885.87 GFLOP/s , 15286.7 tokens/s INFO:__main__:2024-11-06 01:02:15 | Epoch: 1 | Step: 201720 | Dataset: 0-1194701 | Loss: 0.744 | 913 ms/step , 6890.07 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 01:02:24 | Epoch: 1 | Step: 201730 | Dataset: 0-1195021 | Loss: 0.675 | 913 ms/step , 6889.26 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 01:02:34 | Epoch: 1 | Step: 201740 | Dataset: 0-1195341 | Loss: 0.700 | 911 ms/step , 6902.71 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 01:02:43 | Epoch: 1 | Step: 201750 | Dataset: 0-1195661 | Loss: 0.631 | 911 ms/step , 6900.89 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 01:02:52 | Epoch: 1 | Step: 201760 | Dataset: 0-1195981 | Loss: 0.724 | 914 ms/step , 6880.99 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 01:03:01 | Epoch: 1 | Step: 201770 | Dataset: 0-1196301 | Loss: 0.610 | 912 ms/step , 6897.69 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 01:03:10 | Epoch: 1 | Step: 201780 | Dataset: 0-1196621 | Loss: 0.865 | 914 ms/step , 6884.73 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 01:03:19 | Epoch: 1 | Step: 201790 | Dataset: 0-1196941 | Loss: 0.695 | 912 ms/step , 6894.15 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 01:03:28 | Epoch: 1 | Step: 201800 | Dataset: 0-1197261 | Loss: 0.711 | 914 ms/step , 6880.95 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 01:03:30 | Validation | Step: 201800 | Val_loss: 0.665 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:03:39 | Epoch: 1 | Step: 201810 | Dataset: 0-1197581 | Loss: 0.678 | 914 ms/step , 6881.13 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-06 01:03:48 | Epoch: 1 | Step: 201820 | Dataset: 0-1197901 | Loss: 0.612 | 914 ms/step , 6880.84 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 01:03:57 | Epoch: 1 | Step: 201830 | Dataset: 0-1198221 | Loss: 0.633 | 912 ms/step , 6893.46 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 01:04:06 | Epoch: 1 | Step: 201840 | Dataset: 0-1198541 | Loss: 0.629 | 914 ms/step , 6883.67 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 01:04:16 | Epoch: 1 | Step: 201850 | Dataset: 0-1198861 | Loss: 0.614 | 912 ms/step , 6893.23 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 01:04:25 | Epoch: 1 | Step: 201860 | Dataset: 0-1199181 | Loss: 0.720 | 913 ms/step , 6889.29 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:04:34 | Epoch: 1 | Step: 201870 | Dataset: 0-1199501 | Loss: 0.741 | 914 ms/step , 6882.81 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 01:04:43 | Epoch: 1 | Step: 201880 | Dataset: 0-1199821 | Loss: 0.528 | 913 ms/step , 6885.33 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 01:04:52 | Epoch: 1 | Step: 201890 | Dataset: 0-1200141 | Loss: 0.619 | 913 ms/step , 6891.77 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 01:05:01 | Epoch: 1 | Step: 201900 | Dataset: 0-1200461 | Loss: 0.643 | 913 ms/step , 6886.32 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 01:05:03 | Validation | Step: 201900 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:05:12 | Epoch: 1 | Step: 201910 | Dataset: 0-1200781 | Loss: 0.791 | 914 ms/step , 6884.76 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-06 01:05:21 | Epoch: 1 | Step: 201920 | Dataset: 0-1201101 | Loss: 0.733 | 912 ms/step , 6892.60 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 01:05:30 | Epoch: 1 | Step: 201930 | Dataset: 0-1201421 | Loss: 0.684 | 914 ms/step , 6883.36 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 01:05:39 | Epoch: 1 | Step: 201940 | Dataset: 0-1201741 | Loss: 0.628 | 912 ms/step , 6898.88 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 01:05:49 | Epoch: 1 | Step: 201950 | Dataset: 0-1202061 | Loss: 0.753 | 912 ms/step , 6893.29 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 01:05:58 | Epoch: 1 | Step: 201960 | Dataset: 0-1202381 | Loss: 0.726 | 913 ms/step , 6892.25 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-06 01:06:07 | Epoch: 1 | Step: 201970 | Dataset: 0-1202701 | Loss: 0.679 | 913 ms/step , 6889.43 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 01:06:16 | Epoch: 1 | Step: 201980 | Dataset: 0-1203021 | Loss: 0.689 | 913 ms/step , 6890.30 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 01:06:25 | Epoch: 1 | Step: 201990 | Dataset: 0-1203341 | Loss: 0.753 | 914 ms/step , 6878.77 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 01:06:34 | Epoch: 1 | Step: 202000 | Dataset: 0-1203661 | Loss: 0.722 | 913 ms/step , 6887.93 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 01:06:36 | Validation | Step: 202000 | Val_loss: 0.698 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:06:36 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_010636_step_202000.pt` INFO:__main__:2024-11-06 01:06:46 | Epoch: 1 | Step: 202010 | Dataset: 0-1203981 | Loss: 0.714 | 913 ms/step , 6891.72 GFLOP/s , 13760.1 tokens/s INFO:__main__:2024-11-06 01:06:55 | Epoch: 1 | Step: 202020 | Dataset: 0-1204301 | Loss: 0.676 | 914 ms/step , 6879.19 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 01:07:04 | Epoch: 1 | Step: 202030 | Dataset: 0-1204621 | Loss: 0.666 | 913 ms/step , 6888.93 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 01:07:14 | Epoch: 1 | Step: 202040 | Dataset: 0-1204941 | Loss: 0.735 | 913 ms/step , 6887.50 GFLOP/s , 17893.2 tokens/s INFO:__main__:2024-11-06 01:07:23 | Epoch: 1 | Step: 202050 | Dataset: 0-1205261 | Loss: 0.768 | 913 ms/step , 6890.63 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-06 01:07:32 | Epoch: 1 | Step: 202060 | Dataset: 0-1205581 | Loss: 0.780 | 912 ms/step , 6895.96 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 01:07:41 | Epoch: 1 | Step: 202070 | Dataset: 0-1205901 | Loss: 0.746 | 913 ms/step , 6890.18 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 01:07:50 | Epoch: 1 | Step: 202080 | Dataset: 0-1206221 | Loss: 0.795 | 913 ms/step , 6886.62 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:07:59 | Epoch: 1 | Step: 202090 | Dataset: 0-1206541 | Loss: 0.610 | 913 ms/step , 6888.64 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:08:08 | Epoch: 1 | Step: 202100 | Dataset: 0-1206861 | Loss: 0.785 | 914 ms/step , 6878.10 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 01:08:10 | Validation | Step: 202100 | Val_loss: 0.756 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:08:19 | Epoch: 1 | Step: 202110 | Dataset: 0-1207181 | Loss: 0.767 | 914 ms/step , 6885.04 GFLOP/s , 15268.3 tokens/s INFO:__main__:2024-11-06 01:08:28 | Epoch: 1 | Step: 202120 | Dataset: 0-1207501 | Loss: 0.818 | 914 ms/step , 6880.46 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 01:08:37 | Epoch: 1 | Step: 202130 | Dataset: 0-1207821 | Loss: 0.734 | 915 ms/step , 6870.83 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 01:08:47 | Epoch: 1 | Step: 202140 | Dataset: 0-1208141 | Loss: 0.737 | 912 ms/step , 6893.79 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-06 01:08:56 | Epoch: 1 | Step: 202150 | Dataset: 0-1208461 | Loss: 0.540 | 914 ms/step , 6884.90 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 01:09:05 | Epoch: 1 | Step: 202160 | Dataset: 0-1208781 | Loss: 0.742 | 912 ms/step , 6897.86 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-06 01:09:14 | Epoch: 1 | Step: 202170 | Dataset: 0-1209101 | Loss: 0.790 | 913 ms/step , 6890.92 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 01:09:23 | Epoch: 1 | Step: 202180 | Dataset: 0-1209421 | Loss: 0.859 | 913 ms/step , 6889.19 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-06 01:09:32 | Epoch: 1 | Step: 202190 | Dataset: 0-1209741 | Loss: 0.727 | 913 ms/step , 6888.84 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 01:09:41 | Epoch: 1 | Step: 202200 | Dataset: 0-1210061 | Loss: 0.667 | 913 ms/step , 6891.00 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 01:09:43 | Validation | Step: 202200 | Val_loss: 0.742 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:09:52 | Epoch: 1 | Step: 202210 | Dataset: 0-1210381 | Loss: 0.803 | 914 ms/step , 6878.55 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-06 01:10:01 | Epoch: 1 | Step: 202220 | Dataset: 0-1210701 | Loss: 0.636 | 913 ms/step , 6889.04 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 01:10:10 | Epoch: 1 | Step: 202230 | Dataset: 0-1211021 | Loss: 0.839 | 915 ms/step , 6870.48 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 01:10:20 | Epoch: 1 | Step: 202240 | Dataset: 0-1211341 | Loss: 0.794 | 913 ms/step , 6885.39 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 01:10:29 | Epoch: 1 | Step: 202250 | Dataset: 0-1211661 | Loss: 0.670 | 913 ms/step , 6891.19 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 01:10:38 | Epoch: 1 | Step: 202260 | Dataset: 0-1211981 | Loss: 0.679 | 913 ms/step , 6892.30 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-06 01:10:47 | Epoch: 1 | Step: 202270 | Dataset: 0-1212301 | Loss: 0.747 | 913 ms/step , 6889.12 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 01:10:56 | Epoch: 1 | Step: 202280 | Dataset: 0-1212621 | Loss: 0.682 | 913 ms/step , 6889.69 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 01:11:05 | Epoch: 1 | Step: 202290 | Dataset: 0-1212941 | Loss: 0.783 | 915 ms/step , 6875.75 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 01:11:14 | Epoch: 1 | Step: 202300 | Dataset: 0-1213261 | Loss: 0.649 | 913 ms/step , 6886.57 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 01:11:16 | Validation | Step: 202300 | Val_loss: 0.744 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:11:25 | Epoch: 1 | Step: 202310 | Dataset: 0-1213581 | Loss: 0.687 | 913 ms/step , 6890.67 GFLOP/s , 15278.8 tokens/s INFO:__main__:2024-11-06 01:11:34 | Epoch: 1 | Step: 202320 | Dataset: 0-1213901 | Loss: 0.703 | 913 ms/step , 6890.42 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 01:11:43 | Epoch: 1 | Step: 202330 | Dataset: 0-1214221 | Loss: 0.804 | 913 ms/step , 6885.54 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 01:11:52 | Epoch: 1 | Step: 202340 | Dataset: 0-1214541 | Loss: 0.815 | 914 ms/step , 6882.79 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 01:12:02 | Epoch: 1 | Step: 202350 | Dataset: 0-1214861 | Loss: 0.843 | 913 ms/step , 6890.09 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-06 01:12:11 | Epoch: 1 | Step: 202360 | Dataset: 0-1215181 | Loss: 0.688 | 912 ms/step , 6894.06 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 01:12:20 | Epoch: 1 | Step: 202370 | Dataset: 0-1215501 | Loss: 0.665 | 913 ms/step , 6887.46 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 01:12:29 | Epoch: 1 | Step: 202380 | Dataset: 0-1215821 | Loss: 0.807 | 913 ms/step , 6891.40 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 01:12:38 | Epoch: 1 | Step: 202390 | Dataset: 0-1216141 | Loss: 0.830 | 913 ms/step , 6887.12 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 01:12:47 | Epoch: 1 | Step: 202400 | Dataset: 0-1216461 | Loss: 0.546 | 912 ms/step , 6897.76 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 01:12:49 | Validation | Step: 202400 | Val_loss: 0.749 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:12:58 | Epoch: 1 | Step: 202410 | Dataset: 0-1216781 | Loss: 0.667 | 912 ms/step , 6895.04 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-06 01:13:07 | Epoch: 1 | Step: 202420 | Dataset: 0-1217101 | Loss: 0.652 | 913 ms/step , 6889.46 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 01:13:16 | Epoch: 1 | Step: 202430 | Dataset: 0-1217421 | Loss: 0.823 | 913 ms/step , 6888.27 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 01:13:25 | Epoch: 1 | Step: 202440 | Dataset: 0-1217741 | Loss: 0.753 | 914 ms/step , 6884.69 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 01:13:35 | Epoch: 1 | Step: 202450 | Dataset: 0-1218061 | Loss: 0.824 | 914 ms/step , 6880.53 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 01:13:44 | Epoch: 1 | Step: 202460 | Dataset: 0-1218381 | Loss: 0.518 | 914 ms/step , 6883.66 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 01:13:53 | Epoch: 1 | Step: 202470 | Dataset: 0-1218701 | Loss: 0.621 | 914 ms/step , 6883.55 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 01:14:02 | Epoch: 1 | Step: 202480 | Dataset: 0-1219021 | Loss: 0.644 | 913 ms/step , 6890.15 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 01:14:11 | Epoch: 1 | Step: 202490 | Dataset: 0-1219341 | Loss: 0.724 | 913 ms/step , 6892.01 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 01:14:20 | Epoch: 1 | Step: 202500 | Dataset: 0-1219661 | Loss: 0.699 | 913 ms/step , 6890.59 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:14:22 | Validation | Step: 202500 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:14:31 | Epoch: 1 | Step: 202510 | Dataset: 0-1219981 | Loss: 0.674 | 912 ms/step , 6892.85 GFLOP/s , 15283.9 tokens/s INFO:__main__:2024-11-06 01:14:40 | Epoch: 1 | Step: 202520 | Dataset: 0-1220301 | Loss: 0.750 | 913 ms/step , 6889.36 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-06 01:14:49 | Epoch: 1 | Step: 202530 | Dataset: 0-1220621 | Loss: 0.720 | 913 ms/step , 6889.29 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 01:14:58 | Epoch: 1 | Step: 202540 | Dataset: 0-1220941 | Loss: 0.615 | 913 ms/step , 6890.38 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 01:15:08 | Epoch: 1 | Step: 202550 | Dataset: 0-1221261 | Loss: 0.707 | 912 ms/step , 6892.63 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 01:15:17 | Epoch: 1 | Step: 202560 | Dataset: 0-1221581 | Loss: 0.643 | 914 ms/step , 6882.48 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 01:15:26 | Epoch: 1 | Step: 202570 | Dataset: 0-1221901 | Loss: 0.679 | 912 ms/step , 6893.42 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 01:15:35 | Epoch: 1 | Step: 202580 | Dataset: 0-1222221 | Loss: 0.798 | 913 ms/step , 6891.55 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 01:15:44 | Epoch: 1 | Step: 202590 | Dataset: 0-1222541 | Loss: 0.673 | 912 ms/step , 6894.08 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 01:15:53 | Epoch: 1 | Step: 202600 | Dataset: 0-1222861 | Loss: 0.445 | 911 ms/step , 6902.11 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 01:15:55 | Validation | Step: 202600 | Val_loss: 0.724 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:16:04 | Epoch: 1 | Step: 202610 | Dataset: 0-1223181 | Loss: 0.725 | 913 ms/step , 6891.97 GFLOP/s , 15278.1 tokens/s INFO:__main__:2024-11-06 01:16:13 | Epoch: 1 | Step: 202620 | Dataset: 0-1223501 | Loss: 0.747 | 913 ms/step , 6885.30 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 01:16:22 | Epoch: 1 | Step: 202630 | Dataset: 0-1223821 | Loss: 0.745 | 913 ms/step , 6890.31 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 01:16:31 | Epoch: 1 | Step: 202640 | Dataset: 0-1224141 | Loss: 0.838 | 914 ms/step , 6884.85 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:16:40 | Epoch: 1 | Step: 202650 | Dataset: 0-1224461 | Loss: 0.850 | 913 ms/step , 6887.81 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 01:16:50 | Epoch: 1 | Step: 202660 | Dataset: 0-1224781 | Loss: 0.684 | 914 ms/step , 6883.51 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 01:16:59 | Epoch: 1 | Step: 202670 | Dataset: 0-1225101 | Loss: 0.704 | 913 ms/step , 6885.57 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 01:17:08 | Epoch: 1 | Step: 202680 | Dataset: 0-1225421 | Loss: 0.726 | 915 ms/step , 6871.44 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 01:17:17 | Epoch: 1 | Step: 202690 | Dataset: 0-1225741 | Loss: 0.651 | 913 ms/step , 6887.20 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 01:17:26 | Epoch: 1 | Step: 202700 | Dataset: 0-1226061 | Loss: 0.623 | 914 ms/step , 6878.79 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 01:17:28 | Validation | Step: 202700 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:17:37 | Epoch: 1 | Step: 202710 | Dataset: 0-1226381 | Loss: 0.813 | 914 ms/step , 6881.04 GFLOP/s , 15290.0 tokens/s INFO:__main__:2024-11-06 01:17:46 | Epoch: 1 | Step: 202720 | Dataset: 0-1226701 | Loss: 0.815 | 913 ms/step , 6886.79 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 01:17:55 | Epoch: 1 | Step: 202730 | Dataset: 0-1227021 | Loss: 0.608 | 912 ms/step , 6898.49 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 01:18:04 | Epoch: 1 | Step: 202740 | Dataset: 0-1227341 | Loss: 0.748 | 914 ms/step , 6879.38 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 01:18:13 | Epoch: 1 | Step: 202750 | Dataset: 0-1227661 | Loss: 0.704 | 913 ms/step , 6890.02 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 01:18:23 | Epoch: 1 | Step: 202760 | Dataset: 0-1227981 | Loss: 0.686 | 913 ms/step , 6885.09 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 01:18:32 | Epoch: 1 | Step: 202770 | Dataset: 0-1228301 | Loss: 0.694 | 914 ms/step , 6882.62 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 01:18:41 | Epoch: 1 | Step: 202780 | Dataset: 0-1228621 | Loss: 0.734 | 914 ms/step , 6883.69 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-06 01:18:50 | Epoch: 1 | Step: 202790 | Dataset: 0-1228941 | Loss: 0.739 | 913 ms/step , 6888.65 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 01:18:59 | Epoch: 1 | Step: 202800 | Dataset: 0-1229261 | Loss: 0.732 | 912 ms/step , 6893.36 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 01:19:01 | Validation | Step: 202800 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:19:10 | Epoch: 1 | Step: 202810 | Dataset: 0-1229581 | Loss: 0.738 | 914 ms/step , 6884.97 GFLOP/s , 15276.4 tokens/s INFO:__main__:2024-11-06 01:19:19 | Epoch: 1 | Step: 202820 | Dataset: 0-1229901 | Loss: 0.755 | 914 ms/step , 6879.96 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-06 01:19:28 | Epoch: 1 | Step: 202830 | Dataset: 0-1230221 | Loss: 0.758 | 915 ms/step , 6870.12 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 01:19:37 | Epoch: 1 | Step: 202840 | Dataset: 0-1230541 | Loss: 0.710 | 914 ms/step , 6884.73 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 01:19:46 | Epoch: 1 | Step: 202850 | Dataset: 0-1230861 | Loss: 0.697 | 913 ms/step , 6889.05 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 01:19:56 | Epoch: 1 | Step: 202860 | Dataset: 0-1231181 | Loss: 0.728 | 914 ms/step , 6882.14 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 01:20:05 | Epoch: 1 | Step: 202870 | Dataset: 0-1231501 | Loss: 0.809 | 914 ms/step , 6884.39 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-06 01:20:14 | Epoch: 1 | Step: 202880 | Dataset: 0-1231821 | Loss: 0.752 | 914 ms/step , 6883.51 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 01:20:23 | Epoch: 1 | Step: 202890 | Dataset: 0-1232141 | Loss: 0.724 | 914 ms/step , 6880.56 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 01:20:32 | Epoch: 1 | Step: 202900 | Dataset: 0-1232461 | Loss: 0.648 | 914 ms/step , 6882.63 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 01:20:34 | Validation | Step: 202900 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:20:43 | Epoch: 1 | Step: 202910 | Dataset: 0-1232781 | Loss: 0.671 | 913 ms/step , 6889.48 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 01:20:52 | Epoch: 1 | Step: 202920 | Dataset: 0-1233101 | Loss: 0.704 | 914 ms/step , 6883.90 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:21:01 | Epoch: 1 | Step: 202930 | Dataset: 0-1233421 | Loss: 0.657 | 913 ms/step , 6886.22 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 01:21:10 | Epoch: 1 | Step: 202940 | Dataset: 0-1233741 | Loss: 0.665 | 914 ms/step , 6877.77 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 01:21:19 | Epoch: 1 | Step: 202950 | Dataset: 0-1234061 | Loss: 0.582 | 913 ms/step , 6889.83 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 01:21:29 | Epoch: 1 | Step: 202960 | Dataset: 0-1234381 | Loss: 0.697 | 913 ms/step , 6891.46 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 01:21:38 | Epoch: 1 | Step: 202970 | Dataset: 0-1234701 | Loss: 0.708 | 913 ms/step , 6888.03 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 01:21:47 | Epoch: 1 | Step: 202980 | Dataset: 0-1235021 | Loss: 0.659 | 915 ms/step , 6876.71 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 01:21:56 | Epoch: 1 | Step: 202990 | Dataset: 0-1235341 | Loss: 0.680 | 914 ms/step , 6880.99 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 01:22:05 | Epoch: 1 | Step: 203000 | Dataset: 0-1235661 | Loss: 0.860 | 913 ms/step , 6891.26 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 01:22:07 | Validation | Step: 203000 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:22:07 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_012207_step_203000.pt` INFO:__main__:2024-11-06 01:22:17 | Epoch: 1 | Step: 203010 | Dataset: 0-1235981 | Loss: 0.647 | 913 ms/step , 6889.74 GFLOP/s , 13761.8 tokens/s INFO:__main__:2024-11-06 01:22:26 | Epoch: 1 | Step: 203020 | Dataset: 0-1236301 | Loss: 0.714 | 913 ms/step , 6887.62 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 01:22:35 | Epoch: 1 | Step: 203030 | Dataset: 0-1236621 | Loss: 0.805 | 913 ms/step , 6891.38 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 01:22:44 | Epoch: 1 | Step: 203040 | Dataset: 0-1236941 | Loss: 0.611 | 916 ms/step , 6867.17 GFLOP/s , 17881.0 tokens/s INFO:__main__:2024-11-06 01:22:54 | Epoch: 1 | Step: 203050 | Dataset: 0-1237261 | Loss: 0.667 | 913 ms/step , 6890.76 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-06 01:23:03 | Epoch: 1 | Step: 203060 | Dataset: 0-1237581 | Loss: 0.737 | 914 ms/step , 6878.19 GFLOP/s , 17906.3 tokens/s INFO:__main__:2024-11-06 01:23:12 | Epoch: 1 | Step: 203070 | Dataset: 0-1237901 | Loss: 0.723 | 915 ms/step , 6875.98 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 01:23:21 | Epoch: 1 | Step: 203080 | Dataset: 0-1238221 | Loss: 0.777 | 913 ms/step , 6886.57 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 01:23:30 | Epoch: 1 | Step: 203090 | Dataset: 0-1238541 | Loss: 0.628 | 913 ms/step , 6891.57 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:23:39 | Epoch: 1 | Step: 203100 | Dataset: 0-1238861 | Loss: 0.602 | 914 ms/step , 6884.51 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 01:23:41 | Validation | Step: 203100 | Val_loss: 0.728 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:23:50 | Epoch: 1 | Step: 203110 | Dataset: 0-1239181 | Loss: 0.777 | 913 ms/step , 6886.76 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-06 01:23:59 | Epoch: 1 | Step: 203120 | Dataset: 0-1239501 | Loss: 0.713 | 913 ms/step , 6887.69 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 01:24:08 | Epoch: 1 | Step: 203130 | Dataset: 0-1239821 | Loss: 0.674 | 913 ms/step , 6888.26 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 01:24:17 | Epoch: 1 | Step: 203140 | Dataset: 0-1240141 | Loss: 0.660 | 913 ms/step , 6886.24 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 01:24:27 | Epoch: 1 | Step: 203150 | Dataset: 0-1240461 | Loss: 0.592 | 913 ms/step , 6892.10 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:24:36 | Epoch: 1 | Step: 203160 | Dataset: 0-1240781 | Loss: 0.708 | 913 ms/step , 6892.09 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 01:24:45 | Epoch: 1 | Step: 203170 | Dataset: 0-1241101 | Loss: 0.767 | 913 ms/step , 6886.64 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:24:54 | Epoch: 1 | Step: 203180 | Dataset: 0-1241421 | Loss: 0.741 | 914 ms/step , 6880.60 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 01:25:03 | Epoch: 1 | Step: 203190 | Dataset: 0-1241741 | Loss: 0.660 | 913 ms/step , 6889.38 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 01:25:12 | Epoch: 1 | Step: 203200 | Dataset: 0-1242061 | Loss: 0.577 | 913 ms/step , 6890.97 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 01:25:14 | Validation | Step: 203200 | Val_loss: 0.716 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:25:23 | Epoch: 1 | Step: 203210 | Dataset: 0-1242381 | Loss: 0.814 | 915 ms/step , 6874.44 GFLOP/s , 15282.5 tokens/s INFO:__main__:2024-11-06 01:25:32 | Epoch: 1 | Step: 203220 | Dataset: 0-1242701 | Loss: 0.646 | 914 ms/step , 6879.63 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 01:25:41 | Epoch: 1 | Step: 203230 | Dataset: 0-1243021 | Loss: 0.696 | 913 ms/step , 6891.15 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 01:25:50 | Epoch: 1 | Step: 203240 | Dataset: 0-1243341 | Loss: 0.749 | 914 ms/step , 6879.33 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 01:26:00 | Epoch: 1 | Step: 203250 | Dataset: 0-1243661 | Loss: 0.620 | 913 ms/step , 6888.28 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 01:26:09 | Epoch: 1 | Step: 203260 | Dataset: 0-1243981 | Loss: 0.666 | 914 ms/step , 6884.06 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 01:26:18 | Epoch: 1 | Step: 203270 | Dataset: 0-1244301 | Loss: 0.703 | 914 ms/step , 6882.86 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 01:26:27 | Epoch: 1 | Step: 203280 | Dataset: 0-1244621 | Loss: 0.708 | 914 ms/step , 6879.19 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 01:26:36 | Epoch: 1 | Step: 203290 | Dataset: 0-1244941 | Loss: 0.664 | 913 ms/step , 6891.20 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 01:26:45 | Epoch: 1 | Step: 203300 | Dataset: 0-1245261 | Loss: 0.827 | 914 ms/step , 6879.31 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 01:26:47 | Validation | Step: 203300 | Val_loss: 0.714 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:26:56 | Epoch: 1 | Step: 203310 | Dataset: 0-1245581 | Loss: 0.696 | 914 ms/step , 6880.70 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-06 01:27:05 | Epoch: 1 | Step: 203320 | Dataset: 0-1245901 | Loss: 0.713 | 913 ms/step , 6889.97 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 01:27:14 | Epoch: 1 | Step: 203330 | Dataset: 0-1246221 | Loss: 0.672 | 914 ms/step , 6879.21 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 01:27:23 | Epoch: 1 | Step: 203340 | Dataset: 0-1246541 | Loss: 0.709 | 915 ms/step , 6874.29 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 01:27:32 | Epoch: 1 | Step: 203350 | Dataset: 0-1246861 | Loss: 0.636 | 914 ms/step , 6877.68 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 01:27:42 | Epoch: 1 | Step: 203360 | Dataset: 0-1247181 | Loss: 0.766 | 913 ms/step , 6889.05 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 01:27:51 | Epoch: 1 | Step: 203370 | Dataset: 0-1247501 | Loss: 0.651 | 912 ms/step , 6895.12 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 01:28:00 | Epoch: 1 | Step: 203380 | Dataset: 0-1247821 | Loss: 0.636 | 913 ms/step , 6891.19 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 01:28:09 | Epoch: 1 | Step: 203390 | Dataset: 0-1248141 | Loss: 0.692 | 912 ms/step , 6892.62 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 01:28:18 | Epoch: 1 | Step: 203400 | Dataset: 0-1248461 | Loss: 0.689 | 913 ms/step , 6886.56 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 01:28:20 | Validation | Step: 203400 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:28:29 | Epoch: 1 | Step: 203410 | Dataset: 0-1248781 | Loss: 0.723 | 914 ms/step , 6884.52 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-06 01:28:38 | Epoch: 1 | Step: 203420 | Dataset: 0-1249101 | Loss: 0.727 | 913 ms/step , 6885.83 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:28:47 | Epoch: 1 | Step: 203430 | Dataset: 0-1249421 | Loss: 0.773 | 913 ms/step , 6888.65 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 01:28:56 | Epoch: 1 | Step: 203440 | Dataset: 0-1249741 | Loss: 0.662 | 914 ms/step , 6882.46 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 01:29:05 | Epoch: 1 | Step: 203450 | Dataset: 0-1250061 | Loss: 0.704 | 915 ms/step , 6875.97 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 01:29:15 | Epoch: 1 | Step: 203460 | Dataset: 0-1250381 | Loss: 0.634 | 912 ms/step , 6895.24 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 01:29:24 | Epoch: 1 | Step: 203470 | Dataset: 0-1250701 | Loss: 0.682 | 913 ms/step , 6885.65 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:29:33 | Epoch: 1 | Step: 203480 | Dataset: 0-1251021 | Loss: 0.769 | 914 ms/step , 6882.24 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 01:29:42 | Epoch: 1 | Step: 203490 | Dataset: 0-1251341 | Loss: 0.671 | 913 ms/step , 6886.85 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 01:29:51 | Epoch: 1 | Step: 203500 | Dataset: 0-1251661 | Loss: 0.792 | 914 ms/step , 6882.04 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-06 01:29:53 | Validation | Step: 203500 | Val_loss: 0.717 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:30:02 | Epoch: 1 | Step: 203510 | Dataset: 0-1251981 | Loss: 0.774 | 913 ms/step , 6890.13 GFLOP/s , 15279.1 tokens/s INFO:__main__:2024-11-06 01:30:11 | Epoch: 1 | Step: 203520 | Dataset: 0-1252301 | Loss: 0.702 | 915 ms/step , 6876.87 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 01:30:20 | Epoch: 1 | Step: 203530 | Dataset: 0-1252621 | Loss: 0.707 | 915 ms/step , 6876.67 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 01:30:29 | Epoch: 1 | Step: 203540 | Dataset: 0-1252941 | Loss: 0.677 | 914 ms/step , 6884.73 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 01:30:38 | Epoch: 1 | Step: 203550 | Dataset: 0-1253261 | Loss: 0.715 | 915 ms/step , 6875.25 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 01:30:48 | Epoch: 1 | Step: 203560 | Dataset: 0-1253581 | Loss: 0.678 | 913 ms/step , 6889.88 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 01:30:57 | Epoch: 1 | Step: 203570 | Dataset: 0-1253901 | Loss: 0.685 | 914 ms/step , 6882.89 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-06 01:31:06 | Epoch: 1 | Step: 203580 | Dataset: 0-1254221 | Loss: 0.703 | 913 ms/step , 6890.74 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 01:31:15 | Epoch: 1 | Step: 203590 | Dataset: 0-1254541 | Loss: 0.714 | 913 ms/step , 6886.99 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 01:31:24 | Epoch: 1 | Step: 203600 | Dataset: 0-1254861 | Loss: 0.643 | 913 ms/step , 6890.02 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 01:31:26 | Validation | Step: 203600 | Val_loss: 0.761 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:31:35 | Epoch: 1 | Step: 203610 | Dataset: 0-1255181 | Loss: 0.683 | 914 ms/step , 6881.51 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-06 01:31:44 | Epoch: 1 | Step: 203620 | Dataset: 0-1255501 | Loss: 0.651 | 913 ms/step , 6889.03 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:31:53 | Epoch: 1 | Step: 203630 | Dataset: 0-1255821 | Loss: 0.599 | 914 ms/step , 6883.05 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-06 01:32:02 | Epoch: 1 | Step: 203640 | Dataset: 0-1256141 | Loss: 0.786 | 915 ms/step , 6871.52 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:32:11 | Epoch: 1 | Step: 203650 | Dataset: 0-1256461 | Loss: 0.670 | 913 ms/step , 6887.73 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 01:32:21 | Epoch: 1 | Step: 203660 | Dataset: 0-1256781 | Loss: 0.694 | 914 ms/step , 6881.80 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 01:32:30 | Epoch: 1 | Step: 203670 | Dataset: 0-1257101 | Loss: 0.698 | 913 ms/step , 6891.03 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 01:32:39 | Epoch: 1 | Step: 203680 | Dataset: 0-1257421 | Loss: 0.735 | 914 ms/step , 6883.04 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 01:32:48 | Epoch: 1 | Step: 203690 | Dataset: 0-1257741 | Loss: 0.632 | 912 ms/step , 6896.56 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 01:32:57 | Epoch: 1 | Step: 203700 | Dataset: 0-1258061 | Loss: 0.712 | 915 ms/step , 6870.06 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 01:32:59 | Validation | Step: 203700 | Val_loss: 0.745 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:33:08 | Epoch: 1 | Step: 203710 | Dataset: 0-1258381 | Loss: 0.701 | 914 ms/step , 6880.44 GFLOP/s , 15272.0 tokens/s INFO:__main__:2024-11-06 01:33:17 | Epoch: 1 | Step: 203720 | Dataset: 0-1258701 | Loss: 0.602 | 912 ms/step , 6900.09 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 01:33:26 | Epoch: 1 | Step: 203730 | Dataset: 0-1259021 | Loss: 0.690 | 913 ms/step , 6891.13 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 01:33:35 | Epoch: 1 | Step: 203740 | Dataset: 0-1259341 | Loss: 0.662 | 914 ms/step , 6878.89 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 01:33:44 | Epoch: 1 | Step: 203750 | Dataset: 0-1259661 | Loss: 0.676 | 912 ms/step , 6892.76 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 01:33:54 | Epoch: 1 | Step: 203760 | Dataset: 0-1259981 | Loss: 0.750 | 913 ms/step , 6891.69 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-06 01:34:03 | Epoch: 1 | Step: 203770 | Dataset: 0-1260301 | Loss: 0.653 | 913 ms/step , 6885.31 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 01:34:12 | Epoch: 1 | Step: 203780 | Dataset: 0-1260621 | Loss: 0.611 | 913 ms/step , 6886.04 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 01:34:21 | Epoch: 1 | Step: 203790 | Dataset: 0-1260941 | Loss: 0.726 | 914 ms/step , 6878.40 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 01:34:30 | Epoch: 1 | Step: 203800 | Dataset: 0-1261261 | Loss: 0.711 | 913 ms/step , 6885.78 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 01:34:32 | Validation | Step: 203800 | Val_loss: 0.709 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:34:41 | Epoch: 1 | Step: 203810 | Dataset: 0-1261581 | Loss: 0.801 | 913 ms/step , 6885.31 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-06 01:34:50 | Epoch: 1 | Step: 203820 | Dataset: 0-1261901 | Loss: 0.777 | 915 ms/step , 6875.32 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 01:34:59 | Epoch: 1 | Step: 203830 | Dataset: 0-1262221 | Loss: 0.716 | 913 ms/step , 6887.57 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 01:35:08 | Epoch: 1 | Step: 203840 | Dataset: 0-1262541 | Loss: 0.780 | 913 ms/step , 6890.85 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 01:35:17 | Epoch: 1 | Step: 203850 | Dataset: 0-1262861 | Loss: 0.652 | 913 ms/step , 6885.66 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 01:35:27 | Epoch: 1 | Step: 203860 | Dataset: 0-1263181 | Loss: 0.632 | 913 ms/step , 6892.27 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 01:35:36 | Epoch: 1 | Step: 203870 | Dataset: 0-1263501 | Loss: 0.752 | 914 ms/step , 6883.92 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 01:35:45 | Epoch: 1 | Step: 203880 | Dataset: 0-1263821 | Loss: 0.657 | 913 ms/step , 6885.98 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 01:35:54 | Epoch: 1 | Step: 203890 | Dataset: 0-1264141 | Loss: 0.624 | 913 ms/step , 6889.04 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-06 01:36:03 | Epoch: 1 | Step: 203900 | Dataset: 0-1264461 | Loss: 0.657 | 915 ms/step , 6875.78 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 01:36:05 | Validation | Step: 203900 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:36:14 | Epoch: 1 | Step: 203910 | Dataset: 0-1264781 | Loss: 0.745 | 913 ms/step , 6885.49 GFLOP/s , 15263.1 tokens/s INFO:__main__:2024-11-06 01:36:23 | Epoch: 1 | Step: 203920 | Dataset: 0-1265101 | Loss: 0.703 | 913 ms/step , 6888.97 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 01:36:32 | Epoch: 1 | Step: 203930 | Dataset: 0-1265421 | Loss: 0.670 | 914 ms/step , 6882.64 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 01:36:41 | Epoch: 1 | Step: 203940 | Dataset: 0-1265741 | Loss: 0.714 | 913 ms/step , 6886.01 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:36:50 | Epoch: 1 | Step: 203950 | Dataset: 0-1266061 | Loss: 0.674 | 915 ms/step , 6870.59 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 01:37:00 | Epoch: 1 | Step: 203960 | Dataset: 0-1266381 | Loss: 0.648 | 912 ms/step , 6895.87 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:37:09 | Epoch: 1 | Step: 203970 | Dataset: 0-1266701 | Loss: 0.696 | 913 ms/step , 6889.40 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 01:37:18 | Epoch: 1 | Step: 203980 | Dataset: 0-1267021 | Loss: 0.752 | 913 ms/step , 6885.82 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 01:37:27 | Epoch: 1 | Step: 203990 | Dataset: 0-1267341 | Loss: 0.646 | 913 ms/step , 6885.28 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 01:37:36 | Epoch: 1 | Step: 204000 | Dataset: 0-1267661 | Loss: 0.731 | 913 ms/step , 6887.59 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 01:37:38 | Validation | Step: 204000 | Val_loss: 0.710 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:37:38 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_013738_step_204000.pt` INFO:__main__:2024-11-06 01:37:48 | Epoch: 1 | Step: 204010 | Dataset: 0-1267981 | Loss: 0.804 | 913 ms/step , 6886.62 GFLOP/s , 13797.8 tokens/s INFO:__main__:2024-11-06 01:37:57 | Epoch: 1 | Step: 204020 | Dataset: 0-1268301 | Loss: 0.656 | 914 ms/step , 6880.04 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-06 01:38:06 | Epoch: 1 | Step: 204030 | Dataset: 0-1268621 | Loss: 0.683 | 913 ms/step , 6886.41 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 01:38:15 | Epoch: 1 | Step: 204040 | Dataset: 0-1268941 | Loss: 0.604 | 912 ms/step , 6894.34 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-06 01:38:25 | Epoch: 1 | Step: 204050 | Dataset: 0-1269261 | Loss: 0.724 | 915 ms/step , 6871.08 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 01:38:34 | Epoch: 1 | Step: 204060 | Dataset: 0-1269581 | Loss: 0.751 | 913 ms/step , 6888.39 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:38:43 | Epoch: 1 | Step: 204070 | Dataset: 0-1269901 | Loss: 0.710 | 913 ms/step , 6887.36 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 01:38:52 | Epoch: 1 | Step: 204080 | Dataset: 0-1270221 | Loss: 0.752 | 914 ms/step , 6884.06 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 01:39:01 | Epoch: 1 | Step: 204090 | Dataset: 0-1270541 | Loss: 0.667 | 914 ms/step , 6882.70 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 01:39:10 | Epoch: 1 | Step: 204100 | Dataset: 0-1270861 | Loss: 0.822 | 915 ms/step , 6873.01 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 01:39:12 | Validation | Step: 204100 | Val_loss: 0.712 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:39:21 | Epoch: 1 | Step: 204110 | Dataset: 0-1271181 | Loss: 0.770 | 916 ms/step , 6867.03 GFLOP/s , 15289.2 tokens/s INFO:__main__:2024-11-06 01:39:30 | Epoch: 1 | Step: 204120 | Dataset: 0-1271501 | Loss: 0.616 | 913 ms/step , 6891.93 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-06 01:39:39 | Epoch: 1 | Step: 204130 | Dataset: 0-1271821 | Loss: 0.663 | 913 ms/step , 6885.05 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 01:39:48 | Epoch: 1 | Step: 204140 | Dataset: 0-1272141 | Loss: 0.632 | 913 ms/step , 6892.40 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 01:39:57 | Epoch: 1 | Step: 204150 | Dataset: 0-1272461 | Loss: 0.642 | 913 ms/step , 6892.54 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-06 01:40:07 | Epoch: 1 | Step: 204160 | Dataset: 0-1272781 | Loss: 0.708 | 915 ms/step , 6874.46 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 01:40:16 | Epoch: 1 | Step: 204170 | Dataset: 0-1273101 | Loss: 0.717 | 913 ms/step , 6886.63 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 01:40:25 | Epoch: 1 | Step: 204180 | Dataset: 0-1273421 | Loss: 0.610 | 913 ms/step , 6885.37 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 01:40:34 | Epoch: 1 | Step: 204190 | Dataset: 0-1273741 | Loss: 0.703 | 914 ms/step , 6884.05 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 01:40:43 | Epoch: 1 | Step: 204200 | Dataset: 0-1274061 | Loss: 0.697 | 913 ms/step , 6889.30 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 01:40:45 | Validation | Step: 204200 | Val_loss: 0.746 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:40:54 | Epoch: 1 | Step: 204210 | Dataset: 0-1274381 | Loss: 0.690 | 914 ms/step , 6881.36 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-06 01:41:03 | Epoch: 1 | Step: 204220 | Dataset: 0-1274701 | Loss: 0.570 | 913 ms/step , 6888.28 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:41:12 | Epoch: 1 | Step: 204230 | Dataset: 0-1275021 | Loss: 0.745 | 913 ms/step , 6889.02 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 01:41:21 | Epoch: 1 | Step: 204240 | Dataset: 0-1275341 | Loss: 0.684 | 915 ms/step , 6877.48 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 01:41:30 | Epoch: 1 | Step: 204250 | Dataset: 0-1275661 | Loss: 0.783 | 913 ms/step , 6891.50 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 01:41:40 | Epoch: 1 | Step: 204260 | Dataset: 0-1275981 | Loss: 0.738 | 913 ms/step , 6886.17 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 01:41:49 | Epoch: 1 | Step: 204270 | Dataset: 0-1276301 | Loss: 0.711 | 914 ms/step , 6879.90 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 01:41:58 | Epoch: 1 | Step: 204280 | Dataset: 0-1276621 | Loss: 0.693 | 913 ms/step , 6885.80 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 01:42:07 | Epoch: 1 | Step: 204290 | Dataset: 0-1276941 | Loss: 0.753 | 914 ms/step , 6883.63 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 01:42:16 | Epoch: 1 | Step: 204300 | Dataset: 0-1277261 | Loss: 0.685 | 914 ms/step , 6883.39 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 01:42:18 | Validation | Step: 204300 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:42:27 | Epoch: 1 | Step: 204310 | Dataset: 0-1277581 | Loss: 0.650 | 914 ms/step , 6881.29 GFLOP/s , 15274.3 tokens/s INFO:__main__:2024-11-06 01:42:36 | Epoch: 1 | Step: 204320 | Dataset: 0-1277901 | Loss: 0.608 | 913 ms/step , 6891.70 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 01:42:45 | Epoch: 1 | Step: 204330 | Dataset: 0-1278221 | Loss: 0.462 | 912 ms/step , 6896.86 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 01:42:54 | Epoch: 1 | Step: 204340 | Dataset: 0-1278541 | Loss: 0.706 | 915 ms/step , 6875.29 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 01:43:03 | Epoch: 1 | Step: 204350 | Dataset: 0-1278861 | Loss: 0.672 | 913 ms/step , 6885.23 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 01:43:13 | Epoch: 1 | Step: 204360 | Dataset: 0-1279181 | Loss: 0.708 | 912 ms/step , 6894.09 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 01:43:22 | Epoch: 1 | Step: 204370 | Dataset: 0-1279501 | Loss: 0.700 | 913 ms/step , 6891.07 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 01:43:31 | Epoch: 1 | Step: 204380 | Dataset: 0-1279821 | Loss: 0.722 | 913 ms/step , 6891.13 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 01:43:40 | Epoch: 1 | Step: 204390 | Dataset: 0-1280141 | Loss: 0.707 | 912 ms/step , 6897.65 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 01:43:49 | Epoch: 1 | Step: 204400 | Dataset: 0-1280461 | Loss: 0.656 | 912 ms/step , 6897.47 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 01:43:51 | Validation | Step: 204400 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:44:00 | Epoch: 1 | Step: 204410 | Dataset: 0-1280781 | Loss: 0.626 | 913 ms/step , 6890.26 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-06 01:44:09 | Epoch: 1 | Step: 204420 | Dataset: 0-1281101 | Loss: 0.781 | 914 ms/step , 6880.30 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 01:44:18 | Epoch: 1 | Step: 204430 | Dataset: 0-1281421 | Loss: 0.701 | 913 ms/step , 6885.47 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 01:44:27 | Epoch: 1 | Step: 204440 | Dataset: 0-1281741 | Loss: 0.637 | 913 ms/step , 6886.98 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 01:44:36 | Epoch: 1 | Step: 204450 | Dataset: 0-1282061 | Loss: 0.704 | 913 ms/step , 6885.73 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 01:44:45 | Epoch: 1 | Step: 204460 | Dataset: 0-1282381 | Loss: 0.600 | 913 ms/step , 6890.38 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 01:44:55 | Epoch: 1 | Step: 204470 | Dataset: 0-1282701 | Loss: 0.670 | 913 ms/step , 6890.35 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 01:45:04 | Epoch: 1 | Step: 204480 | Dataset: 0-1283021 | Loss: 0.735 | 912 ms/step , 6898.30 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 01:45:13 | Epoch: 1 | Step: 204490 | Dataset: 0-1283341 | Loss: 0.684 | 914 ms/step , 6880.72 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 01:45:22 | Epoch: 1 | Step: 204500 | Dataset: 0-1283661 | Loss: 0.676 | 914 ms/step , 6884.32 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 01:45:24 | Validation | Step: 204500 | Val_loss: 0.702 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:45:33 | Epoch: 1 | Step: 204510 | Dataset: 0-1283981 | Loss: 0.674 | 913 ms/step , 6887.47 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-06 01:45:42 | Epoch: 1 | Step: 204520 | Dataset: 0-1284301 | Loss: 0.669 | 914 ms/step , 6880.18 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 01:45:51 | Epoch: 1 | Step: 204530 | Dataset: 0-1284621 | Loss: 0.848 | 914 ms/step , 6880.53 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 01:46:00 | Epoch: 1 | Step: 204540 | Dataset: 0-1284941 | Loss: 0.714 | 912 ms/step , 6895.39 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 01:46:09 | Epoch: 1 | Step: 204550 | Dataset: 0-1285261 | Loss: 0.716 | 913 ms/step , 6885.42 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 01:46:18 | Epoch: 1 | Step: 204560 | Dataset: 0-1285581 | Loss: 0.750 | 915 ms/step , 6871.06 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 01:46:28 | Epoch: 1 | Step: 204570 | Dataset: 0-1285901 | Loss: 0.728 | 914 ms/step , 6880.94 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 01:46:37 | Epoch: 1 | Step: 204580 | Dataset: 0-1286221 | Loss: 0.685 | 913 ms/step , 6890.37 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 01:46:46 | Epoch: 1 | Step: 204590 | Dataset: 0-1286541 | Loss: 0.662 | 913 ms/step , 6891.23 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 01:46:55 | Epoch: 1 | Step: 204600 | Dataset: 0-1286861 | Loss: 0.624 | 913 ms/step , 6889.89 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 01:46:57 | Validation | Step: 204600 | Val_loss: 0.714 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:47:06 | Epoch: 1 | Step: 204610 | Dataset: 0-1287181 | Loss: 0.664 | 912 ms/step , 6893.43 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-06 01:47:15 | Epoch: 1 | Step: 204620 | Dataset: 0-1287501 | Loss: 0.813 | 913 ms/step , 6890.79 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 01:47:24 | Epoch: 1 | Step: 204630 | Dataset: 0-1287821 | Loss: 0.753 | 914 ms/step , 6881.16 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 01:47:33 | Epoch: 1 | Step: 204640 | Dataset: 0-1288141 | Loss: 0.702 | 913 ms/step , 6886.26 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 01:47:42 | Epoch: 1 | Step: 204650 | Dataset: 0-1288461 | Loss: 0.615 | 915 ms/step , 6877.13 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 01:47:51 | Epoch: 1 | Step: 204660 | Dataset: 0-1288781 | Loss: 0.681 | 913 ms/step , 6887.07 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 01:48:01 | Epoch: 1 | Step: 204670 | Dataset: 0-1289101 | Loss: 0.759 | 913 ms/step , 6889.78 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 01:48:10 | Epoch: 1 | Step: 204680 | Dataset: 0-1289421 | Loss: 0.725 | 914 ms/step , 6878.35 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 01:48:19 | Epoch: 1 | Step: 204690 | Dataset: 0-1289741 | Loss: 0.658 | 915 ms/step , 6875.21 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 01:48:28 | Epoch: 1 | Step: 204700 | Dataset: 0-1290061 | Loss: 0.725 | 913 ms/step , 6887.12 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 01:48:30 | Validation | Step: 204700 | Val_loss: 0.786 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:48:39 | Epoch: 1 | Step: 204710 | Dataset: 0-1290381 | Loss: 0.664 | 914 ms/step , 6882.33 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-06 01:48:48 | Epoch: 1 | Step: 204720 | Dataset: 0-1290701 | Loss: 0.520 | 912 ms/step , 6895.62 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 01:48:57 | Epoch: 1 | Step: 204730 | Dataset: 0-1291021 | Loss: 0.636 | 912 ms/step , 6896.47 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 01:49:06 | Epoch: 1 | Step: 204740 | Dataset: 0-1291341 | Loss: 0.661 | 913 ms/step , 6885.70 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 01:49:15 | Epoch: 1 | Step: 204750 | Dataset: 0-1291661 | Loss: 0.778 | 913 ms/step , 6890.93 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 01:49:24 | Epoch: 1 | Step: 204760 | Dataset: 0-1291981 | Loss: 0.753 | 913 ms/step , 6889.64 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 01:49:34 | Epoch: 1 | Step: 204770 | Dataset: 0-1292301 | Loss: 0.723 | 913 ms/step , 6889.65 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 01:49:43 | Epoch: 1 | Step: 204780 | Dataset: 0-1292621 | Loss: 0.683 | 914 ms/step , 6881.96 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 01:49:52 | Epoch: 1 | Step: 204790 | Dataset: 0-1292941 | Loss: 0.702 | 912 ms/step , 6895.50 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 01:50:01 | Epoch: 1 | Step: 204800 | Dataset: 0-1293261 | Loss: 0.735 | 913 ms/step , 6890.09 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 01:50:03 | Validation | Step: 204800 | Val_loss: 0.734 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:50:12 | Epoch: 1 | Step: 204810 | Dataset: 0-1293581 | Loss: 0.633 | 914 ms/step , 6884.30 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-06 01:50:21 | Epoch: 1 | Step: 204820 | Dataset: 0-1293901 | Loss: 0.682 | 914 ms/step , 6881.40 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 01:50:30 | Epoch: 1 | Step: 204830 | Dataset: 0-1294221 | Loss: 0.704 | 914 ms/step , 6884.60 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 01:50:39 | Epoch: 1 | Step: 204840 | Dataset: 0-1294541 | Loss: 0.673 | 913 ms/step , 6887.78 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 01:50:48 | Epoch: 1 | Step: 204850 | Dataset: 0-1294861 | Loss: 0.671 | 916 ms/step , 6865.58 GFLOP/s , 17921.2 tokens/s INFO:__main__:2024-11-06 01:50:57 | Epoch: 1 | Step: 204860 | Dataset: 0-1295181 | Loss: 0.682 | 913 ms/step , 6890.62 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 01:51:06 | Epoch: 1 | Step: 204870 | Dataset: 0-1295501 | Loss: 0.780 | 914 ms/step , 6884.94 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 01:51:16 | Epoch: 1 | Step: 204880 | Dataset: 0-1295821 | Loss: 0.708 | 914 ms/step , 6884.20 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 01:51:25 | Epoch: 1 | Step: 204890 | Dataset: 0-1296141 | Loss: 0.744 | 913 ms/step , 6890.94 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-06 01:51:34 | Epoch: 1 | Step: 204900 | Dataset: 0-1296461 | Loss: 0.707 | 912 ms/step , 6892.91 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 01:51:35 | Validation | Step: 204900 | Val_loss: 0.708 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:51:45 | Epoch: 1 | Step: 204910 | Dataset: 0-1296781 | Loss: 0.666 | 914 ms/step , 6884.47 GFLOP/s , 15274.2 tokens/s INFO:__main__:2024-11-06 01:51:54 | Epoch: 1 | Step: 204920 | Dataset: 0-1297101 | Loss: 0.752 | 914 ms/step , 6878.70 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 01:52:03 | Epoch: 1 | Step: 204930 | Dataset: 0-1297421 | Loss: 0.692 | 914 ms/step , 6881.14 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 01:52:12 | Epoch: 1 | Step: 204940 | Dataset: 0-1297741 | Loss: 0.647 | 913 ms/step , 6885.40 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 01:52:21 | Epoch: 1 | Step: 204950 | Dataset: 0-1298061 | Loss: 0.643 | 912 ms/step , 6895.53 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 01:52:30 | Epoch: 1 | Step: 204960 | Dataset: 0-1298381 | Loss: 0.656 | 913 ms/step , 6885.65 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 01:52:39 | Epoch: 1 | Step: 204970 | Dataset: 0-1298701 | Loss: 0.663 | 912 ms/step , 6892.89 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 01:52:49 | Epoch: 1 | Step: 204980 | Dataset: 0-1299021 | Loss: 0.720 | 915 ms/step , 6874.80 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 01:52:58 | Epoch: 1 | Step: 204990 | Dataset: 0-1299341 | Loss: 0.752 | 915 ms/step , 6873.86 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 01:53:07 | Epoch: 1 | Step: 205000 | Dataset: 0-1299661 | Loss: 0.704 | 915 ms/step , 6877.46 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 01:53:08 | Validation | Step: 205000 | Val_loss: 0.697 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:53:08 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_015308_step_205000.pt` INFO:__main__:2024-11-06 01:53:19 | Epoch: 1 | Step: 205010 | Dataset: 0-1299981 | Loss: 0.709 | 917 ms/step , 6856.16 GFLOP/s , 13783.9 tokens/s INFO:__main__:2024-11-06 01:53:28 | Epoch: 1 | Step: 205020 | Dataset: 0-1300301 | Loss: 0.697 | 914 ms/step , 6879.36 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 01:53:37 | Epoch: 1 | Step: 205030 | Dataset: 0-1300621 | Loss: 0.732 | 915 ms/step , 6872.26 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-06 01:53:46 | Epoch: 1 | Step: 205040 | Dataset: 0-1300941 | Loss: 0.658 | 913 ms/step , 6889.49 GFLOP/s , 17870.1 tokens/s INFO:__main__:2024-11-06 01:53:55 | Epoch: 1 | Step: 205050 | Dataset: 0-1301261 | Loss: 0.589 | 914 ms/step , 6884.25 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 01:54:04 | Epoch: 1 | Step: 205060 | Dataset: 0-1301581 | Loss: 0.686 | 913 ms/step , 6885.47 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 01:54:14 | Epoch: 1 | Step: 205070 | Dataset: 0-1301901 | Loss: 0.795 | 914 ms/step , 6884.71 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-06 01:54:23 | Epoch: 1 | Step: 205080 | Dataset: 0-1302221 | Loss: 0.672 | 912 ms/step , 6893.93 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 01:54:32 | Epoch: 1 | Step: 205090 | Dataset: 0-1302541 | Loss: 0.709 | 914 ms/step , 6882.96 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 01:54:41 | Epoch: 1 | Step: 205100 | Dataset: 0-1302861 | Loss: 0.685 | 913 ms/step , 6890.70 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 01:54:43 | Validation | Step: 205100 | Val_loss: 0.698 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:54:52 | Epoch: 1 | Step: 205110 | Dataset: 0-1303181 | Loss: 0.698 | 913 ms/step , 6892.01 GFLOP/s , 15284.8 tokens/s INFO:__main__:2024-11-06 01:55:01 | Epoch: 1 | Step: 205120 | Dataset: 0-1303501 | Loss: 0.797 | 914 ms/step , 6881.96 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 01:55:10 | Epoch: 1 | Step: 205130 | Dataset: 0-1303821 | Loss: 0.713 | 913 ms/step , 6891.99 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 01:55:19 | Epoch: 1 | Step: 205140 | Dataset: 0-1304141 | Loss: 0.653 | 913 ms/step , 6891.25 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-06 01:55:28 | Epoch: 1 | Step: 205150 | Dataset: 0-1304461 | Loss: 0.717 | 913 ms/step , 6885.86 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 01:55:37 | Epoch: 1 | Step: 205160 | Dataset: 0-1304781 | Loss: 0.729 | 916 ms/step , 6865.40 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-06 01:55:47 | Epoch: 1 | Step: 205170 | Dataset: 0-1305101 | Loss: 0.611 | 913 ms/step , 6888.26 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 01:55:56 | Epoch: 1 | Step: 205180 | Dataset: 0-1305421 | Loss: 0.695 | 914 ms/step , 6883.88 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 01:56:05 | Epoch: 1 | Step: 205190 | Dataset: 0-1305741 | Loss: 0.687 | 914 ms/step , 6882.71 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 01:56:14 | Epoch: 1 | Step: 205200 | Dataset: 0-1306061 | Loss: 0.675 | 913 ms/step , 6892.39 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 01:56:16 | Validation | Step: 205200 | Val_loss: 0.753 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:56:25 | Epoch: 1 | Step: 205210 | Dataset: 0-1306381 | Loss: 0.726 | 914 ms/step , 6884.37 GFLOP/s , 15263.4 tokens/s INFO:__main__:2024-11-06 01:56:34 | Epoch: 1 | Step: 205220 | Dataset: 0-1306701 | Loss: 0.750 | 913 ms/step , 6890.58 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 01:56:43 | Epoch: 1 | Step: 205230 | Dataset: 0-1307021 | Loss: 0.700 | 915 ms/step , 6876.81 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 01:56:52 | Epoch: 1 | Step: 205240 | Dataset: 0-1307341 | Loss: 0.695 | 914 ms/step , 6883.67 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 01:57:01 | Epoch: 1 | Step: 205250 | Dataset: 0-1307661 | Loss: 0.593 | 913 ms/step , 6891.77 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 01:57:10 | Epoch: 1 | Step: 205260 | Dataset: 0-1307981 | Loss: 0.735 | 914 ms/step , 6878.61 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 01:57:20 | Epoch: 1 | Step: 205270 | Dataset: 0-1308301 | Loss: 0.551 | 913 ms/step , 6887.52 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 01:57:29 | Epoch: 1 | Step: 205280 | Dataset: 0-1308621 | Loss: 0.802 | 915 ms/step , 6871.47 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 01:57:38 | Epoch: 1 | Step: 205290 | Dataset: 0-1308941 | Loss: 0.717 | 914 ms/step , 6879.16 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 01:57:47 | Epoch: 1 | Step: 205300 | Dataset: 0-1309261 | Loss: 0.618 | 912 ms/step , 6894.94 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 01:57:49 | Validation | Step: 205300 | Val_loss: 0.695 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:57:58 | Epoch: 1 | Step: 205310 | Dataset: 0-1309581 | Loss: 0.743 | 913 ms/step , 6886.66 GFLOP/s , 15274.0 tokens/s INFO:__main__:2024-11-06 01:58:07 | Epoch: 1 | Step: 205320 | Dataset: 0-1309901 | Loss: 0.742 | 914 ms/step , 6881.26 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 01:58:16 | Epoch: 1 | Step: 205330 | Dataset: 0-1310221 | Loss: 0.752 | 913 ms/step , 6889.45 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 01:58:25 | Epoch: 1 | Step: 205340 | Dataset: 0-1310541 | Loss: 0.752 | 912 ms/step , 6894.66 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 01:58:34 | Epoch: 1 | Step: 205350 | Dataset: 0-1310861 | Loss: 0.710 | 913 ms/step , 6889.41 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 01:58:43 | Epoch: 1 | Step: 205360 | Dataset: 0-1311181 | Loss: 0.698 | 912 ms/step , 6896.49 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 01:58:53 | Epoch: 1 | Step: 205370 | Dataset: 0-1311501 | Loss: 0.787 | 913 ms/step , 6887.16 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 01:59:02 | Epoch: 1 | Step: 205380 | Dataset: 0-1311821 | Loss: 0.773 | 914 ms/step , 6881.88 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 01:59:11 | Epoch: 1 | Step: 205390 | Dataset: 0-1312141 | Loss: 0.749 | 912 ms/step , 6893.19 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 01:59:20 | Epoch: 1 | Step: 205400 | Dataset: 0-1312461 | Loss: 0.745 | 913 ms/step , 6889.20 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 01:59:22 | Validation | Step: 205400 | Val_loss: 0.663 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 01:59:31 | Epoch: 1 | Step: 205410 | Dataset: 0-1312781 | Loss: 0.551 | 913 ms/step , 6886.72 GFLOP/s , 15274.8 tokens/s INFO:__main__:2024-11-06 01:59:40 | Epoch: 1 | Step: 205420 | Dataset: 0-1313101 | Loss: 0.589 | 913 ms/step , 6887.21 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 01:59:49 | Epoch: 1 | Step: 205430 | Dataset: 0-1313421 | Loss: 0.714 | 912 ms/step , 6893.52 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 01:59:58 | Epoch: 1 | Step: 205440 | Dataset: 0-1313741 | Loss: 0.767 | 914 ms/step , 6879.97 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 02:00:07 | Epoch: 1 | Step: 205450 | Dataset: 0-1314061 | Loss: 0.686 | 915 ms/step , 6876.25 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 02:00:16 | Epoch: 1 | Step: 205460 | Dataset: 0-1314381 | Loss: 0.731 | 914 ms/step , 6882.12 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 02:00:25 | Epoch: 1 | Step: 205470 | Dataset: 0-1314701 | Loss: 0.781 | 913 ms/step , 6889.70 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 02:00:35 | Epoch: 1 | Step: 205480 | Dataset: 0-1315021 | Loss: 0.670 | 915 ms/step , 6877.24 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 02:00:44 | Epoch: 1 | Step: 205490 | Dataset: 0-1315341 | Loss: 0.774 | 914 ms/step , 6883.99 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 02:00:53 | Epoch: 1 | Step: 205500 | Dataset: 0-1315661 | Loss: 0.791 | 914 ms/step , 6880.36 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 02:00:54 | Validation | Step: 205500 | Val_loss: 0.678 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:01:04 | Epoch: 1 | Step: 205510 | Dataset: 0-1315981 | Loss: 0.745 | 913 ms/step , 6887.66 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-06 02:01:13 | Epoch: 1 | Step: 205520 | Dataset: 0-1316301 | Loss: 0.852 | 914 ms/step , 6883.82 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-06 02:01:22 | Epoch: 1 | Step: 205530 | Dataset: 0-1316621 | Loss: 0.779 | 914 ms/step , 6880.26 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 02:01:31 | Epoch: 1 | Step: 205540 | Dataset: 0-1316941 | Loss: 0.805 | 914 ms/step , 6879.45 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 02:01:40 | Epoch: 1 | Step: 205550 | Dataset: 0-1317261 | Loss: 0.807 | 913 ms/step , 6887.07 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 02:01:49 | Epoch: 1 | Step: 205560 | Dataset: 0-1317581 | Loss: 0.692 | 914 ms/step , 6881.38 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 02:01:58 | Epoch: 1 | Step: 205570 | Dataset: 0-1317901 | Loss: 0.738 | 913 ms/step , 6889.32 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 02:02:08 | Epoch: 1 | Step: 205580 | Dataset: 0-1318221 | Loss: 0.756 | 913 ms/step , 6889.35 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 02:02:17 | Epoch: 1 | Step: 205590 | Dataset: 0-1318541 | Loss: 0.722 | 913 ms/step , 6888.12 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 02:02:26 | Epoch: 1 | Step: 205600 | Dataset: 0-1318861 | Loss: 0.693 | 915 ms/step , 6872.40 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 02:02:27 | Validation | Step: 205600 | Val_loss: 0.696 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:02:37 | Epoch: 1 | Step: 205610 | Dataset: 0-1319181 | Loss: 0.592 | 913 ms/step , 6891.51 GFLOP/s , 15267.6 tokens/s INFO:__main__:2024-11-06 02:02:46 | Epoch: 1 | Step: 205620 | Dataset: 0-1319501 | Loss: 0.721 | 914 ms/step , 6880.48 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 02:02:55 | Epoch: 1 | Step: 205630 | Dataset: 0-1319821 | Loss: 0.688 | 914 ms/step , 6880.63 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 02:03:04 | Epoch: 1 | Step: 205640 | Dataset: 0-1320141 | Loss: 0.711 | 915 ms/step , 6876.63 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 02:03:13 | Epoch: 1 | Step: 205650 | Dataset: 0-1320461 | Loss: 0.576 | 913 ms/step , 6887.80 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 02:03:22 | Epoch: 1 | Step: 205660 | Dataset: 0-1320781 | Loss: 0.832 | 914 ms/step , 6882.93 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 02:03:31 | Epoch: 1 | Step: 205670 | Dataset: 0-1321101 | Loss: 0.707 | 913 ms/step , 6891.84 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 02:03:41 | Epoch: 1 | Step: 205680 | Dataset: 0-1321421 | Loss: 0.702 | 913 ms/step , 6891.88 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 02:03:50 | Epoch: 1 | Step: 205690 | Dataset: 0-1321741 | Loss: 0.499 | 912 ms/step , 6892.85 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 02:03:59 | Epoch: 1 | Step: 205700 | Dataset: 0-1322061 | Loss: 0.603 | 913 ms/step , 6891.33 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 02:04:00 | Validation | Step: 205700 | Val_loss: 0.671 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:04:10 | Epoch: 1 | Step: 205710 | Dataset: 0-1322381 | Loss: 0.776 | 913 ms/step , 6889.48 GFLOP/s , 15279.9 tokens/s INFO:__main__:2024-11-06 02:04:19 | Epoch: 1 | Step: 205720 | Dataset: 0-1322701 | Loss: 0.622 | 912 ms/step , 6896.56 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-06 02:04:28 | Epoch: 1 | Step: 205730 | Dataset: 0-1323021 | Loss: 0.650 | 913 ms/step , 6891.15 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-06 02:04:37 | Epoch: 1 | Step: 205740 | Dataset: 0-1323341 | Loss: 0.743 | 913 ms/step , 6888.32 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 02:04:46 | Epoch: 1 | Step: 205750 | Dataset: 0-1323661 | Loss: 0.582 | 913 ms/step , 6889.53 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 02:04:55 | Epoch: 1 | Step: 205760 | Dataset: 0-1323981 | Loss: 0.762 | 914 ms/step , 6883.02 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 02:05:04 | Epoch: 1 | Step: 205770 | Dataset: 0-1324301 | Loss: 0.877 | 912 ms/step , 6893.19 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 02:05:14 | Epoch: 1 | Step: 205780 | Dataset: 0-1324621 | Loss: 0.773 | 914 ms/step , 6878.18 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 02:05:23 | Epoch: 1 | Step: 205790 | Dataset: 0-1324941 | Loss: 0.623 | 913 ms/step , 6890.15 GFLOP/s , 17946.4 tokens/s INFO:__main__:2024-11-06 02:05:32 | Epoch: 1 | Step: 205800 | Dataset: 0-1325261 | Loss: 0.699 | 914 ms/step , 6883.60 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 02:05:33 | Validation | Step: 205800 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:05:43 | Epoch: 1 | Step: 205810 | Dataset: 0-1325581 | Loss: 0.784 | 913 ms/step , 6891.31 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-06 02:05:52 | Epoch: 1 | Step: 205820 | Dataset: 0-1325901 | Loss: 0.811 | 912 ms/step , 6894.08 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 02:06:01 | Epoch: 1 | Step: 205830 | Dataset: 0-1326221 | Loss: 0.556 | 913 ms/step , 6886.45 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 02:06:10 | Epoch: 1 | Step: 205840 | Dataset: 0-1326541 | Loss: 0.654 | 913 ms/step , 6887.11 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 02:06:19 | Epoch: 1 | Step: 205850 | Dataset: 0-1326861 | Loss: 0.695 | 913 ms/step , 6887.88 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 02:06:28 | Epoch: 1 | Step: 205860 | Dataset: 0-1327181 | Loss: 0.666 | 912 ms/step , 6892.61 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 02:06:37 | Epoch: 1 | Step: 205870 | Dataset: 0-1327501 | Loss: 0.766 | 913 ms/step , 6887.67 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 02:06:46 | Epoch: 1 | Step: 205880 | Dataset: 0-1327821 | Loss: 0.729 | 915 ms/step , 6870.61 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 02:06:56 | Epoch: 1 | Step: 205890 | Dataset: 0-1328141 | Loss: 0.850 | 915 ms/step , 6877.29 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 02:07:05 | Epoch: 1 | Step: 205900 | Dataset: 0-1328461 | Loss: 0.789 | 913 ms/step , 6887.82 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 02:07:06 | Validation | Step: 205900 | Val_loss: 0.692 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:07:15 | Epoch: 1 | Step: 205910 | Dataset: 0-1328781 | Loss: 0.782 | 913 ms/step , 6889.19 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 02:07:25 | Epoch: 1 | Step: 205920 | Dataset: 0-1329101 | Loss: 0.750 | 914 ms/step , 6884.38 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 02:07:34 | Epoch: 1 | Step: 205930 | Dataset: 0-1329421 | Loss: 0.735 | 913 ms/step , 6892.15 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 02:07:43 | Epoch: 1 | Step: 205940 | Dataset: 0-1329741 | Loss: 0.830 | 912 ms/step , 6892.62 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 02:07:52 | Epoch: 1 | Step: 205950 | Dataset: 0-1330061 | Loss: 0.670 | 913 ms/step , 6887.60 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 02:08:01 | Epoch: 1 | Step: 205960 | Dataset: 0-1330381 | Loss: 0.761 | 913 ms/step , 6891.09 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 02:08:10 | Epoch: 1 | Step: 205970 | Dataset: 0-1330701 | Loss: 0.754 | 915 ms/step , 6874.09 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 02:08:19 | Epoch: 1 | Step: 205980 | Dataset: 0-1331021 | Loss: 0.751 | 913 ms/step , 6888.75 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 02:08:29 | Epoch: 1 | Step: 205990 | Dataset: 0-1331341 | Loss: 0.780 | 912 ms/step , 6894.89 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 02:08:38 | Epoch: 1 | Step: 206000 | Dataset: 0-1331661 | Loss: 0.703 | 913 ms/step , 6891.08 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 02:08:39 | Validation | Step: 206000 | Val_loss: 0.709 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:08:39 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_020839_step_206000.pt` INFO:__main__:2024-11-06 02:08:50 | Epoch: 1 | Step: 206010 | Dataset: 0-1331981 | Loss: 0.729 | 913 ms/step , 6889.24 GFLOP/s , 13806.2 tokens/s INFO:__main__:2024-11-06 02:08:59 | Epoch: 1 | Step: 206020 | Dataset: 0-1332301 | Loss: 0.789 | 914 ms/step , 6884.00 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 02:09:08 | Epoch: 1 | Step: 206030 | Dataset: 0-1332621 | Loss: 0.781 | 913 ms/step , 6886.32 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 02:09:17 | Epoch: 1 | Step: 206040 | Dataset: 0-1332941 | Loss: 0.648 | 914 ms/step , 6878.96 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 02:09:26 | Epoch: 1 | Step: 206050 | Dataset: 0-1333261 | Loss: 0.830 | 914 ms/step , 6884.45 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 02:09:35 | Epoch: 1 | Step: 206060 | Dataset: 0-1333581 | Loss: 0.703 | 912 ms/step , 6893.13 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 02:09:44 | Epoch: 1 | Step: 206070 | Dataset: 0-1333901 | Loss: 0.820 | 913 ms/step , 6886.51 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 02:09:54 | Epoch: 1 | Step: 206080 | Dataset: 0-1334221 | Loss: 0.752 | 913 ms/step , 6890.46 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-06 02:10:03 | Epoch: 1 | Step: 206090 | Dataset: 0-1334541 | Loss: 0.747 | 914 ms/step , 6882.70 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 02:10:12 | Epoch: 1 | Step: 206100 | Dataset: 0-1334861 | Loss: 0.789 | 915 ms/step , 6876.58 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 02:10:13 | Validation | Step: 206100 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:10:23 | Epoch: 1 | Step: 206110 | Dataset: 0-1335181 | Loss: 0.467 | 912 ms/step , 6895.31 GFLOP/s , 15295.3 tokens/s INFO:__main__:2024-11-06 02:10:32 | Epoch: 1 | Step: 206120 | Dataset: 0-1335501 | Loss: 0.741 | 914 ms/step , 6883.86 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 02:10:41 | Epoch: 1 | Step: 206130 | Dataset: 0-1335821 | Loss: 0.766 | 913 ms/step , 6889.03 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 02:10:50 | Epoch: 1 | Step: 206140 | Dataset: 0-1336141 | Loss: 0.580 | 912 ms/step , 6895.49 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 02:10:59 | Epoch: 1 | Step: 206150 | Dataset: 0-1336461 | Loss: 0.713 | 913 ms/step , 6889.87 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-06 02:11:08 | Epoch: 1 | Step: 206160 | Dataset: 0-1336781 | Loss: 0.809 | 913 ms/step , 6888.18 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-06 02:11:17 | Epoch: 1 | Step: 206170 | Dataset: 0-1337101 | Loss: 0.725 | 913 ms/step , 6889.22 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 02:11:26 | Epoch: 1 | Step: 206180 | Dataset: 0-1337421 | Loss: 0.774 | 913 ms/step , 6888.98 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 02:11:36 | Epoch: 1 | Step: 206190 | Dataset: 0-1337741 | Loss: 0.830 | 913 ms/step , 6889.63 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-06 02:11:45 | Epoch: 1 | Step: 206200 | Dataset: 0-1338061 | Loss: 0.781 | 913 ms/step , 6885.16 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 02:11:46 | Validation | Step: 206200 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:11:55 | Epoch: 1 | Step: 206210 | Dataset: 0-1338381 | Loss: 0.875 | 914 ms/step , 6884.99 GFLOP/s , 15272.2 tokens/s INFO:__main__:2024-11-06 02:12:05 | Epoch: 1 | Step: 206220 | Dataset: 0-1338701 | Loss: 0.792 | 913 ms/step , 6887.42 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 02:12:14 | Epoch: 1 | Step: 206230 | Dataset: 0-1339021 | Loss: 0.669 | 914 ms/step , 6884.44 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 02:12:23 | Epoch: 1 | Step: 206240 | Dataset: 0-1339341 | Loss: 0.617 | 913 ms/step , 6890.86 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 02:12:32 | Epoch: 1 | Step: 206250 | Dataset: 0-1339661 | Loss: 0.733 | 913 ms/step , 6885.30 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 02:12:41 | Epoch: 1 | Step: 206260 | Dataset: 0-1339981 | Loss: 0.759 | 913 ms/step , 6886.85 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-06 02:12:50 | Epoch: 1 | Step: 206270 | Dataset: 0-1340301 | Loss: 0.695 | 913 ms/step , 6892.08 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 02:12:59 | Epoch: 1 | Step: 206280 | Dataset: 0-1340621 | Loss: 0.695 | 915 ms/step , 6873.31 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 02:13:09 | Epoch: 1 | Step: 206290 | Dataset: 0-1340941 | Loss: 0.625 | 914 ms/step , 6880.56 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 02:13:18 | Epoch: 1 | Step: 206300 | Dataset: 0-1341261 | Loss: 0.678 | 914 ms/step , 6878.77 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 02:13:19 | Validation | Step: 206300 | Val_loss: 0.684 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:13:28 | Epoch: 1 | Step: 206310 | Dataset: 0-1341581 | Loss: 0.762 | 913 ms/step , 6888.31 GFLOP/s , 15271.8 tokens/s INFO:__main__:2024-11-06 02:13:38 | Epoch: 1 | Step: 206320 | Dataset: 0-1341901 | Loss: 0.775 | 913 ms/step , 6888.08 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 02:13:47 | Epoch: 1 | Step: 206330 | Dataset: 0-1342221 | Loss: 0.701 | 912 ms/step , 6895.68 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 02:13:56 | Epoch: 1 | Step: 206340 | Dataset: 0-1342541 | Loss: 0.751 | 913 ms/step , 6885.28 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 02:14:05 | Epoch: 1 | Step: 206350 | Dataset: 0-1342861 | Loss: 0.762 | 913 ms/step , 6888.16 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-06 02:14:14 | Epoch: 1 | Step: 206360 | Dataset: 0-1343181 | Loss: 0.809 | 913 ms/step , 6887.92 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 02:14:23 | Epoch: 1 | Step: 206370 | Dataset: 0-1343501 | Loss: 0.801 | 915 ms/step , 6877.18 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 02:14:32 | Epoch: 1 | Step: 206380 | Dataset: 0-1343821 | Loss: 0.869 | 913 ms/step , 6892.47 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 02:14:42 | Epoch: 1 | Step: 206390 | Dataset: 0-1344141 | Loss: 0.637 | 914 ms/step , 6884.85 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 02:14:51 | Epoch: 1 | Step: 206400 | Dataset: 0-1344461 | Loss: 0.598 | 914 ms/step , 6883.34 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 02:14:52 | Validation | Step: 206400 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:15:01 | Epoch: 1 | Step: 206410 | Dataset: 0-1344781 | Loss: 0.639 | 913 ms/step , 6891.63 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-06 02:15:11 | Epoch: 1 | Step: 206420 | Dataset: 0-1345101 | Loss: 0.695 | 914 ms/step , 6882.27 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 02:15:20 | Epoch: 1 | Step: 206430 | Dataset: 0-1345421 | Loss: 0.767 | 914 ms/step , 6881.30 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 02:15:29 | Epoch: 1 | Step: 206440 | Dataset: 0-1345741 | Loss: 0.651 | 915 ms/step , 6874.15 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 02:15:38 | Epoch: 1 | Step: 206450 | Dataset: 0-1346061 | Loss: 0.702 | 913 ms/step , 6886.17 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 02:15:47 | Epoch: 1 | Step: 206460 | Dataset: 0-1346381 | Loss: 0.666 | 913 ms/step , 6889.13 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 02:15:56 | Epoch: 1 | Step: 206470 | Dataset: 0-1346701 | Loss: 0.456 | 912 ms/step , 6896.78 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 02:16:05 | Epoch: 1 | Step: 206480 | Dataset: 0-1347021 | Loss: 0.742 | 914 ms/step , 6884.37 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 02:16:14 | Epoch: 1 | Step: 206490 | Dataset: 0-1347341 | Loss: 0.564 | 913 ms/step , 6891.55 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 02:16:24 | Epoch: 1 | Step: 206500 | Dataset: 0-1347661 | Loss: 0.872 | 913 ms/step , 6885.20 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 02:16:25 | Validation | Step: 206500 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:16:34 | Epoch: 1 | Step: 206510 | Dataset: 0-1347981 | Loss: 0.797 | 914 ms/step , 6879.90 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-06 02:16:43 | Epoch: 1 | Step: 206520 | Dataset: 0-1348301 | Loss: 0.720 | 914 ms/step , 6882.90 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 02:16:53 | Epoch: 1 | Step: 206530 | Dataset: 0-1348621 | Loss: 0.761 | 913 ms/step , 6886.11 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 02:17:02 | Epoch: 1 | Step: 206540 | Dataset: 0-1348941 | Loss: 0.756 | 913 ms/step , 6891.04 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 02:17:11 | Epoch: 1 | Step: 206550 | Dataset: 0-1349261 | Loss: 0.807 | 914 ms/step , 6881.56 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 02:17:20 | Epoch: 1 | Step: 206560 | Dataset: 0-1349581 | Loss: 0.669 | 913 ms/step , 6888.10 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-06 02:17:29 | Epoch: 1 | Step: 206570 | Dataset: 0-1349901 | Loss: 0.824 | 914 ms/step , 6882.96 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 02:17:38 | Epoch: 1 | Step: 206580 | Dataset: 0-1350221 | Loss: 0.788 | 912 ms/step , 6894.26 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 02:17:47 | Epoch: 1 | Step: 206590 | Dataset: 0-1350541 | Loss: 0.706 | 914 ms/step , 6878.93 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 02:17:57 | Epoch: 1 | Step: 206600 | Dataset: 0-1350861 | Loss: 0.712 | 913 ms/step , 6887.83 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 02:17:58 | Validation | Step: 206600 | Val_loss: 0.746 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:18:07 | Epoch: 1 | Step: 206610 | Dataset: 0-1351181 | Loss: 0.679 | 913 ms/step , 6890.04 GFLOP/s , 15260.4 tokens/s INFO:__main__:2024-11-06 02:18:16 | Epoch: 1 | Step: 206620 | Dataset: 0-1351501 | Loss: 0.819 | 914 ms/step , 6883.81 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 02:18:26 | Epoch: 1 | Step: 206630 | Dataset: 0-1351821 | Loss: 0.435 | 913 ms/step , 6891.96 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 02:18:35 | Epoch: 1 | Step: 206640 | Dataset: 0-1352141 | Loss: 0.700 | 913 ms/step , 6889.36 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 02:18:44 | Epoch: 1 | Step: 206650 | Dataset: 0-1352461 | Loss: 0.822 | 913 ms/step , 6886.36 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:18:53 | Epoch: 1 | Step: 206660 | Dataset: 0-1352781 | Loss: 0.688 | 913 ms/step , 6887.62 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 02:19:02 | Epoch: 1 | Step: 206670 | Dataset: 0-1353101 | Loss: 0.788 | 914 ms/step , 6880.47 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 02:19:11 | Epoch: 1 | Step: 206680 | Dataset: 0-1353421 | Loss: 0.673 | 913 ms/step , 6888.18 GFLOP/s , 17946.3 tokens/s INFO:__main__:2024-11-06 02:19:20 | Epoch: 1 | Step: 206690 | Dataset: 0-1353741 | Loss: 0.671 | 912 ms/step , 6894.10 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 02:19:30 | Epoch: 1 | Step: 206700 | Dataset: 0-1354061 | Loss: 0.628 | 913 ms/step , 6887.38 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 02:19:31 | Validation | Step: 206700 | Val_loss: 0.710 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:19:40 | Epoch: 1 | Step: 206710 | Dataset: 0-1354381 | Loss: 0.563 | 913 ms/step , 6888.45 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-06 02:19:49 | Epoch: 1 | Step: 206720 | Dataset: 0-1354701 | Loss: 0.554 | 912 ms/step , 6899.12 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 02:19:59 | Epoch: 1 | Step: 206730 | Dataset: 0-1355021 | Loss: 0.624 | 915 ms/step , 6872.94 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 02:20:08 | Epoch: 1 | Step: 206740 | Dataset: 0-1355341 | Loss: 0.734 | 913 ms/step , 6890.37 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 02:20:17 | Epoch: 1 | Step: 206750 | Dataset: 0-1355661 | Loss: 0.766 | 913 ms/step , 6887.41 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 02:20:26 | Epoch: 1 | Step: 206760 | Dataset: 0-1355981 | Loss: 0.718 | 914 ms/step , 6884.30 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 02:20:35 | Epoch: 1 | Step: 206770 | Dataset: 0-1356301 | Loss: 0.774 | 914 ms/step , 6881.95 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 02:20:44 | Epoch: 1 | Step: 206780 | Dataset: 0-1356621 | Loss: 0.731 | 915 ms/step , 6872.18 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-06 02:20:53 | Epoch: 1 | Step: 206790 | Dataset: 0-1356941 | Loss: 0.734 | 914 ms/step , 6881.97 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 02:21:03 | Epoch: 1 | Step: 206800 | Dataset: 0-1357261 | Loss: 0.775 | 913 ms/step , 6889.53 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 02:21:04 | Validation | Step: 206800 | Val_loss: 0.677 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:21:13 | Epoch: 1 | Step: 206810 | Dataset: 0-1357581 | Loss: 0.679 | 913 ms/step , 6885.30 GFLOP/s , 15265.3 tokens/s INFO:__main__:2024-11-06 02:21:22 | Epoch: 1 | Step: 206820 | Dataset: 0-1357901 | Loss: 0.747 | 913 ms/step , 6885.92 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-06 02:21:32 | Epoch: 1 | Step: 206830 | Dataset: 0-1358221 | Loss: 0.752 | 914 ms/step , 6877.58 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 02:21:41 | Epoch: 1 | Step: 206840 | Dataset: 0-1358541 | Loss: 0.688 | 913 ms/step , 6889.55 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 02:21:50 | Epoch: 1 | Step: 206850 | Dataset: 0-1358861 | Loss: 0.667 | 914 ms/step , 6879.17 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 02:21:59 | Epoch: 1 | Step: 206860 | Dataset: 0-1359181 | Loss: 0.702 | 914 ms/step , 6882.52 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 02:22:08 | Epoch: 1 | Step: 206870 | Dataset: 0-1359501 | Loss: 0.759 | 914 ms/step , 6879.31 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 02:22:17 | Epoch: 1 | Step: 206880 | Dataset: 0-1359821 | Loss: 0.726 | 914 ms/step , 6879.17 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 02:22:26 | Epoch: 1 | Step: 206890 | Dataset: 0-1360141 | Loss: 0.718 | 914 ms/step , 6880.38 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 02:22:36 | Epoch: 1 | Step: 206900 | Dataset: 0-1360461 | Loss: 0.752 | 914 ms/step , 6881.29 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-06 02:22:37 | Validation | Step: 206900 | Val_loss: 0.690 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:22:46 | Epoch: 1 | Step: 206910 | Dataset: 0-1360781 | Loss: 0.679 | 914 ms/step , 6880.96 GFLOP/s , 15262.7 tokens/s INFO:__main__:2024-11-06 02:22:55 | Epoch: 1 | Step: 206920 | Dataset: 0-1361101 | Loss: 0.797 | 914 ms/step , 6881.26 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 02:23:05 | Epoch: 1 | Step: 206930 | Dataset: 0-1361421 | Loss: 0.691 | 913 ms/step , 6886.89 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 02:23:14 | Epoch: 1 | Step: 206940 | Dataset: 0-1361741 | Loss: 0.697 | 914 ms/step , 6879.30 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-06 02:23:23 | Epoch: 1 | Step: 206950 | Dataset: 0-1362061 | Loss: 0.718 | 913 ms/step , 6886.20 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-06 02:23:32 | Epoch: 1 | Step: 206960 | Dataset: 0-1362381 | Loss: 0.718 | 913 ms/step , 6885.32 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 02:23:41 | Epoch: 1 | Step: 206970 | Dataset: 0-1362701 | Loss: 0.681 | 913 ms/step , 6888.95 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-06 02:23:50 | Epoch: 1 | Step: 206980 | Dataset: 0-1363021 | Loss: 0.715 | 914 ms/step , 6879.03 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-06 02:23:59 | Epoch: 1 | Step: 206990 | Dataset: 0-1363341 | Loss: 0.648 | 914 ms/step , 6883.63 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-06 02:24:09 | Epoch: 1 | Step: 207000 | Dataset: 0-1363661 | Loss: 0.701 | 914 ms/step , 6880.63 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-06 02:24:10 | Validation | Step: 207000 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:24:10 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_022410_step_207000.pt` INFO:__main__:2024-11-06 02:24:20 | Epoch: 1 | Step: 207010 | Dataset: 0-1363981 | Loss: 0.737 | 916 ms/step , 6868.03 GFLOP/s , 13787.8 tokens/s INFO:__main__:2024-11-06 02:24:30 | Epoch: 1 | Step: 207020 | Dataset: 0-1364301 | Loss: 0.679 | 914 ms/step , 6879.96 GFLOP/s , 17909.5 tokens/s INFO:__main__:2024-11-06 02:24:39 | Epoch: 1 | Step: 207030 | Dataset: 0-1364621 | Loss: 0.675 | 914 ms/step , 6879.32 GFLOP/s , 17906.1 tokens/s INFO:__main__:2024-11-06 02:24:48 | Epoch: 1 | Step: 207040 | Dataset: 0-1364941 | Loss: 0.673 | 913 ms/step , 6891.61 GFLOP/s , 17901.3 tokens/s INFO:__main__:2024-11-06 02:24:57 | Epoch: 1 | Step: 207050 | Dataset: 0-1365261 | Loss: 0.704 | 914 ms/step , 6884.47 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-06 02:25:06 | Epoch: 1 | Step: 207060 | Dataset: 0-1365581 | Loss: 0.746 | 915 ms/step , 6873.50 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 02:25:15 | Epoch: 1 | Step: 207070 | Dataset: 0-1365901 | Loss: 0.722 | 915 ms/step , 6873.93 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-06 02:25:25 | Epoch: 1 | Step: 207080 | Dataset: 0-1366221 | Loss: 0.684 | 913 ms/step , 6886.16 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-06 02:25:34 | Epoch: 1 | Step: 207090 | Dataset: 0-1366541 | Loss: 0.762 | 915 ms/step , 6876.11 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-06 02:25:43 | Epoch: 1 | Step: 207100 | Dataset: 0-1366861 | Loss: 0.804 | 914 ms/step , 6882.24 GFLOP/s , 17909.7 tokens/s INFO:__main__:2024-11-06 02:25:44 | Validation | Step: 207100 | Val_loss: 0.712 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:25:54 | Epoch: 1 | Step: 207110 | Dataset: 0-1367181 | Loss: 0.764 | 914 ms/step , 6878.89 GFLOP/s , 15259.8 tokens/s INFO:__main__:2024-11-06 02:26:03 | Epoch: 1 | Step: 207120 | Dataset: 0-1367501 | Loss: 0.724 | 914 ms/step , 6878.72 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-06 02:26:12 | Epoch: 1 | Step: 207130 | Dataset: 0-1367821 | Loss: 0.737 | 913 ms/step , 6891.56 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 02:26:21 | Epoch: 1 | Step: 207140 | Dataset: 0-1368141 | Loss: 0.732 | 913 ms/step , 6885.61 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-06 02:26:30 | Epoch: 1 | Step: 207150 | Dataset: 0-1368461 | Loss: 0.777 | 914 ms/step , 6884.54 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 02:26:39 | Epoch: 1 | Step: 207160 | Dataset: 0-1368781 | Loss: 0.748 | 914 ms/step , 6880.68 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 02:26:48 | Epoch: 1 | Step: 207170 | Dataset: 0-1369101 | Loss: 0.751 | 914 ms/step , 6881.17 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-06 02:26:58 | Epoch: 1 | Step: 207180 | Dataset: 0-1369421 | Loss: 0.713 | 914 ms/step , 6883.12 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-06 02:27:07 | Epoch: 1 | Step: 207190 | Dataset: 0-1369741 | Loss: 0.811 | 913 ms/step , 6886.11 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-06 02:27:16 | Epoch: 1 | Step: 207200 | Dataset: 0-1370061 | Loss: 0.762 | 913 ms/step , 6885.38 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-06 02:27:17 | Validation | Step: 207200 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:27:27 | Epoch: 1 | Step: 207210 | Dataset: 0-1370381 | Loss: 0.724 | 915 ms/step , 6876.59 GFLOP/s , 15262.3 tokens/s INFO:__main__:2024-11-06 02:27:36 | Epoch: 1 | Step: 207220 | Dataset: 0-1370701 | Loss: 0.791 | 915 ms/step , 6873.77 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-06 02:27:45 | Epoch: 1 | Step: 207230 | Dataset: 0-1371021 | Loss: 0.742 | 914 ms/step , 6884.20 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-06 02:27:54 | Epoch: 1 | Step: 207240 | Dataset: 0-1371341 | Loss: 0.766 | 914 ms/step , 6878.35 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-06 02:28:03 | Epoch: 1 | Step: 207250 | Dataset: 0-1371661 | Loss: 0.686 | 914 ms/step , 6884.26 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-06 02:28:12 | Epoch: 1 | Step: 207260 | Dataset: 0-1371981 | Loss: 0.720 | 915 ms/step , 6873.83 GFLOP/s , 17909.8 tokens/s INFO:__main__:2024-11-06 02:28:21 | Epoch: 1 | Step: 207270 | Dataset: 0-1372301 | Loss: 0.772 | 917 ms/step , 6857.43 GFLOP/s , 17903.5 tokens/s INFO:__main__:2024-11-06 02:28:31 | Epoch: 1 | Step: 207280 | Dataset: 0-1372621 | Loss: 0.701 | 913 ms/step , 6888.40 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 02:28:40 | Epoch: 1 | Step: 207290 | Dataset: 0-1372941 | Loss: 0.699 | 915 ms/step , 6876.09 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 02:28:49 | Epoch: 1 | Step: 207300 | Dataset: 0-1373261 | Loss: 0.707 | 915 ms/step , 6876.72 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-06 02:28:50 | Validation | Step: 207300 | Val_loss: 0.675 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:29:00 | Epoch: 1 | Step: 207310 | Dataset: 0-1373581 | Loss: 0.731 | 913 ms/step , 6888.58 GFLOP/s , 15249.3 tokens/s INFO:__main__:2024-11-06 02:29:09 | Epoch: 1 | Step: 207320 | Dataset: 0-1373901 | Loss: 0.725 | 915 ms/step , 6875.89 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-06 02:29:18 | Epoch: 1 | Step: 207330 | Dataset: 0-1374221 | Loss: 0.682 | 912 ms/step , 6892.91 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 02:29:27 | Epoch: 1 | Step: 207340 | Dataset: 0-1374541 | Loss: 0.748 | 915 ms/step , 6870.57 GFLOP/s , 17914.3 tokens/s INFO:__main__:2024-11-06 02:29:36 | Epoch: 1 | Step: 207350 | Dataset: 0-1374861 | Loss: 0.732 | 914 ms/step , 6884.27 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 02:29:45 | Epoch: 1 | Step: 207360 | Dataset: 0-1375181 | Loss: 0.791 | 914 ms/step , 6884.59 GFLOP/s , 17914.1 tokens/s INFO:__main__:2024-11-06 02:29:54 | Epoch: 1 | Step: 207370 | Dataset: 0-1375501 | Loss: 0.749 | 915 ms/step , 6871.26 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-06 02:30:04 | Epoch: 1 | Step: 207380 | Dataset: 0-1375821 | Loss: 0.774 | 913 ms/step , 6890.72 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 02:30:13 | Epoch: 1 | Step: 207390 | Dataset: 0-1376141 | Loss: 0.806 | 913 ms/step , 6885.46 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 02:30:22 | Epoch: 1 | Step: 207400 | Dataset: 0-1376461 | Loss: 0.744 | 915 ms/step , 6877.25 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 02:30:24 | Validation | Step: 207400 | Val_loss: 0.692 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:30:33 | Epoch: 1 | Step: 207410 | Dataset: 0-1376781 | Loss: 0.746 | 913 ms/step , 6885.41 GFLOP/s , 15270.9 tokens/s INFO:__main__:2024-11-06 02:30:42 | Epoch: 1 | Step: 207420 | Dataset: 0-1377101 | Loss: 0.761 | 915 ms/step , 6874.61 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-06 02:30:51 | Epoch: 1 | Step: 207430 | Dataset: 0-1377421 | Loss: 0.654 | 913 ms/step , 6885.88 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:31:00 | Epoch: 1 | Step: 207440 | Dataset: 0-1377741 | Loss: 0.684 | 914 ms/step , 6882.08 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-06 02:31:09 | Epoch: 1 | Step: 207450 | Dataset: 0-1378061 | Loss: 0.699 | 914 ms/step , 6878.84 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 02:31:18 | Epoch: 1 | Step: 207460 | Dataset: 0-1378381 | Loss: 0.657 | 914 ms/step , 6883.25 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-06 02:31:28 | Epoch: 1 | Step: 207470 | Dataset: 0-1378701 | Loss: 0.743 | 914 ms/step , 6883.13 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 02:31:37 | Epoch: 1 | Step: 207480 | Dataset: 0-1379021 | Loss: 0.687 | 914 ms/step , 6879.66 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 02:31:46 | Epoch: 1 | Step: 207490 | Dataset: 0-1379341 | Loss: 0.798 | 914 ms/step , 6882.96 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-06 02:31:55 | Epoch: 1 | Step: 207500 | Dataset: 0-1379661 | Loss: 0.773 | 914 ms/step , 6879.60 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 02:31:57 | Validation | Step: 207500 | Val_loss: 0.680 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:32:06 | Epoch: 1 | Step: 207510 | Dataset: 0-1379981 | Loss: 0.722 | 913 ms/step , 6885.78 GFLOP/s , 15270.8 tokens/s INFO:__main__:2024-11-06 02:32:15 | Epoch: 1 | Step: 207520 | Dataset: 0-1380301 | Loss: 0.699 | 913 ms/step , 6890.11 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 02:32:24 | Epoch: 1 | Step: 207530 | Dataset: 0-1380621 | Loss: 0.712 | 914 ms/step , 6883.70 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-06 02:32:33 | Epoch: 1 | Step: 207540 | Dataset: 0-1380941 | Loss: 0.812 | 913 ms/step , 6885.40 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 02:32:42 | Epoch: 1 | Step: 207550 | Dataset: 0-1381261 | Loss: 0.656 | 913 ms/step , 6886.59 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 02:32:51 | Epoch: 1 | Step: 207560 | Dataset: 0-1381581 | Loss: 0.664 | 914 ms/step , 6884.08 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 02:33:01 | Epoch: 1 | Step: 207570 | Dataset: 0-1381901 | Loss: 0.710 | 914 ms/step , 6884.46 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 02:33:10 | Epoch: 1 | Step: 207580 | Dataset: 0-1382221 | Loss: 0.712 | 915 ms/step , 6875.12 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:33:19 | Epoch: 1 | Step: 207590 | Dataset: 0-1382541 | Loss: 0.696 | 913 ms/step , 6891.85 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 02:33:28 | Epoch: 1 | Step: 207600 | Dataset: 0-1382861 | Loss: 0.799 | 914 ms/step , 6880.12 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-06 02:33:30 | Validation | Step: 207600 | Val_loss: 0.690 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:33:39 | Epoch: 1 | Step: 207610 | Dataset: 0-1383181 | Loss: 0.634 | 915 ms/step , 6871.55 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-06 02:33:48 | Epoch: 1 | Step: 207620 | Dataset: 0-1383501 | Loss: 0.674 | 915 ms/step , 6874.25 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 02:33:57 | Epoch: 1 | Step: 207630 | Dataset: 0-1383821 | Loss: 0.810 | 914 ms/step , 6882.65 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-06 02:34:06 | Epoch: 1 | Step: 207640 | Dataset: 0-1384141 | Loss: 0.666 | 913 ms/step , 6885.20 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 02:34:15 | Epoch: 1 | Step: 207650 | Dataset: 0-1384461 | Loss: 0.733 | 917 ms/step , 6856.34 GFLOP/s , 17906.9 tokens/s INFO:__main__:2024-11-06 02:34:24 | Epoch: 1 | Step: 207660 | Dataset: 0-1384781 | Loss: 0.731 | 916 ms/step , 6868.71 GFLOP/s , 17912.1 tokens/s INFO:__main__:2024-11-06 02:34:34 | Epoch: 1 | Step: 207670 | Dataset: 0-1385101 | Loss: 0.750 | 914 ms/step , 6882.52 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-06 02:34:43 | Epoch: 1 | Step: 207680 | Dataset: 0-1385421 | Loss: 0.762 | 915 ms/step , 6872.54 GFLOP/s , 17906.0 tokens/s INFO:__main__:2024-11-06 02:34:52 | Epoch: 1 | Step: 207690 | Dataset: 0-1385741 | Loss: 0.780 | 913 ms/step , 6886.05 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 02:35:01 | Epoch: 1 | Step: 207700 | Dataset: 0-1386061 | Loss: 0.753 | 915 ms/step , 6870.13 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-06 02:35:03 | Validation | Step: 207700 | Val_loss: 0.759 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:35:12 | Epoch: 1 | Step: 207710 | Dataset: 0-1386381 | Loss: 0.706 | 915 ms/step , 6877.03 GFLOP/s , 15261.2 tokens/s INFO:__main__:2024-11-06 02:35:21 | Epoch: 1 | Step: 207720 | Dataset: 0-1386701 | Loss: 0.678 | 915 ms/step , 6871.87 GFLOP/s , 17904.2 tokens/s INFO:__main__:2024-11-06 02:35:30 | Epoch: 1 | Step: 207730 | Dataset: 0-1387021 | Loss: 0.690 | 915 ms/step , 6872.86 GFLOP/s , 17906.4 tokens/s INFO:__main__:2024-11-06 02:35:39 | Epoch: 1 | Step: 207740 | Dataset: 0-1387341 | Loss: 0.761 | 915 ms/step , 6876.22 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-06 02:35:48 | Epoch: 1 | Step: 207750 | Dataset: 0-1387661 | Loss: 0.721 | 913 ms/step , 6888.75 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-06 02:35:57 | Epoch: 1 | Step: 207760 | Dataset: 0-1387981 | Loss: 0.784 | 914 ms/step , 6878.36 GFLOP/s , 17904.5 tokens/s INFO:__main__:2024-11-06 02:36:07 | Epoch: 1 | Step: 207770 | Dataset: 0-1388301 | Loss: 0.760 | 914 ms/step , 6882.90 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-06 02:36:16 | Epoch: 1 | Step: 207780 | Dataset: 0-1388621 | Loss: 0.754 | 914 ms/step , 6882.14 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-06 02:36:25 | Epoch: 1 | Step: 207790 | Dataset: 0-1388941 | Loss: 0.665 | 914 ms/step , 6878.85 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-06 02:36:34 | Epoch: 1 | Step: 207800 | Dataset: 0-1389261 | Loss: 0.729 | 914 ms/step , 6878.74 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-06 02:36:36 | Validation | Step: 207800 | Val_loss: 0.725 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:36:45 | Epoch: 1 | Step: 207810 | Dataset: 0-1389581 | Loss: 0.720 | 914 ms/step , 6881.50 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-06 02:36:54 | Epoch: 1 | Step: 207820 | Dataset: 0-1389901 | Loss: 0.719 | 915 ms/step , 6874.81 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-06 02:37:03 | Epoch: 1 | Step: 207830 | Dataset: 0-1390221 | Loss: 0.704 | 914 ms/step , 6878.10 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-06 02:37:12 | Epoch: 1 | Step: 207840 | Dataset: 0-1390541 | Loss: 0.718 | 913 ms/step , 6886.17 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 02:37:21 | Epoch: 1 | Step: 207850 | Dataset: 0-1390861 | Loss: 0.736 | 915 ms/step , 6874.85 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-06 02:37:31 | Epoch: 1 | Step: 207860 | Dataset: 0-1391181 | Loss: 0.745 | 914 ms/step , 6881.10 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-06 02:37:40 | Epoch: 1 | Step: 207870 | Dataset: 0-1391501 | Loss: 0.711 | 913 ms/step , 6887.17 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:37:49 | Epoch: 1 | Step: 207880 | Dataset: 0-1391821 | Loss: 0.797 | 916 ms/step , 6868.86 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-06 02:37:58 | Epoch: 1 | Step: 207890 | Dataset: 0-1392141 | Loss: 0.718 | 915 ms/step , 6874.72 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-06 02:38:07 | Epoch: 1 | Step: 207900 | Dataset: 0-1392461 | Loss: 0.746 | 915 ms/step , 6875.86 GFLOP/s , 17913.5 tokens/s INFO:__main__:2024-11-06 02:38:09 | Validation | Step: 207900 | Val_loss: 0.744 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:38:18 | Epoch: 1 | Step: 207910 | Dataset: 0-1392781 | Loss: 0.664 | 914 ms/step , 6881.73 GFLOP/s , 15263.8 tokens/s INFO:__main__:2024-11-06 02:38:27 | Epoch: 1 | Step: 207920 | Dataset: 0-1393101 | Loss: 0.697 | 913 ms/step , 6886.60 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-06 02:38:36 | Epoch: 1 | Step: 207930 | Dataset: 0-1393421 | Loss: 0.690 | 915 ms/step , 6874.99 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-06 02:38:45 | Epoch: 1 | Step: 207940 | Dataset: 0-1393741 | Loss: 0.670 | 913 ms/step , 6886.40 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 02:38:54 | Epoch: 1 | Step: 207950 | Dataset: 0-1394061 | Loss: 0.809 | 915 ms/step , 6875.51 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-06 02:39:04 | Epoch: 1 | Step: 207960 | Dataset: 0-1394381 | Loss: 0.679 | 913 ms/step , 6885.15 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 02:39:13 | Epoch: 1 | Step: 207970 | Dataset: 0-1394701 | Loss: 0.775 | 914 ms/step , 6879.01 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 02:39:22 | Epoch: 1 | Step: 207980 | Dataset: 0-1395021 | Loss: 0.708 | 914 ms/step , 6880.32 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-06 02:39:31 | Epoch: 1 | Step: 207990 | Dataset: 0-1395341 | Loss: 0.761 | 914 ms/step , 6879.65 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-06 02:39:40 | Epoch: 1 | Step: 208000 | Dataset: 0-1395661 | Loss: 0.757 | 913 ms/step , 6885.76 GFLOP/s , 17907.3 tokens/s INFO:__main__:2024-11-06 02:39:42 | Validation | Step: 208000 | Val_loss: 0.757 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:39:42 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_023942_step_208000.pt` INFO:__main__:2024-11-06 02:39:52 | Epoch: 1 | Step: 208010 | Dataset: 0-1395981 | Loss: 0.651 | 914 ms/step , 6882.76 GFLOP/s , 13788.0 tokens/s INFO:__main__:2024-11-06 02:40:01 | Epoch: 1 | Step: 208020 | Dataset: 0-1396301 | Loss: 0.683 | 914 ms/step , 6882.47 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-06 02:40:10 | Epoch: 1 | Step: 208030 | Dataset: 0-1396621 | Loss: 0.741 | 915 ms/step , 6873.55 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-06 02:40:19 | Epoch: 1 | Step: 208040 | Dataset: 0-1396941 | Loss: 0.772 | 914 ms/step , 6879.36 GFLOP/s , 17895.9 tokens/s INFO:__main__:2024-11-06 02:40:29 | Epoch: 1 | Step: 208050 | Dataset: 0-1397261 | Loss: 0.710 | 914 ms/step , 6884.06 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-06 02:40:38 | Epoch: 1 | Step: 208060 | Dataset: 0-1397581 | Loss: 0.771 | 914 ms/step , 6881.38 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 02:40:47 | Epoch: 1 | Step: 208070 | Dataset: 0-1397901 | Loss: 0.679 | 915 ms/step , 6876.72 GFLOP/s , 17905.3 tokens/s INFO:__main__:2024-11-06 02:40:56 | Epoch: 1 | Step: 208080 | Dataset: 0-1398221 | Loss: 0.664 | 915 ms/step , 6877.21 GFLOP/s , 17910.6 tokens/s INFO:__main__:2024-11-06 02:41:05 | Epoch: 1 | Step: 208090 | Dataset: 0-1398541 | Loss: 0.715 | 914 ms/step , 6884.96 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 02:41:14 | Epoch: 1 | Step: 208100 | Dataset: 0-1398861 | Loss: 0.706 | 915 ms/step , 6873.39 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 02:41:16 | Validation | Step: 208100 | Val_loss: 0.736 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:41:25 | Epoch: 1 | Step: 208110 | Dataset: 0-1399181 | Loss: 0.730 | 915 ms/step , 6876.74 GFLOP/s , 15280.0 tokens/s INFO:__main__:2024-11-06 02:41:34 | Epoch: 1 | Step: 208120 | Dataset: 0-1399501 | Loss: 0.702 | 913 ms/step , 6889.80 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 02:41:43 | Epoch: 1 | Step: 208130 | Dataset: 0-1399821 | Loss: 0.715 | 915 ms/step , 6876.47 GFLOP/s , 17918.7 tokens/s INFO:__main__:2024-11-06 02:41:53 | Epoch: 1 | Step: 208140 | Dataset: 0-1400141 | Loss: 0.732 | 914 ms/step , 6880.53 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-06 02:42:02 | Epoch: 1 | Step: 208150 | Dataset: 0-1400461 | Loss: 0.741 | 914 ms/step , 6881.61 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-06 02:42:11 | Epoch: 1 | Step: 208160 | Dataset: 0-1400781 | Loss: 0.709 | 913 ms/step , 6892.48 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 02:42:20 | Epoch: 1 | Step: 208170 | Dataset: 0-1401101 | Loss: 0.761 | 912 ms/step , 6892.85 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 02:42:29 | Epoch: 1 | Step: 208180 | Dataset: 0-1401421 | Loss: 0.733 | 913 ms/step , 6887.48 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-06 02:42:38 | Epoch: 1 | Step: 208190 | Dataset: 0-1401741 | Loss: 0.775 | 916 ms/step , 6866.72 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-06 02:42:47 | Epoch: 1 | Step: 208200 | Dataset: 0-1402061 | Loss: 0.683 | 914 ms/step , 6882.06 GFLOP/s , 17920.7 tokens/s INFO:__main__:2024-11-06 02:42:49 | Validation | Step: 208200 | Val_loss: 0.775 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:42:58 | Epoch: 1 | Step: 208210 | Dataset: 0-1402381 | Loss: 0.677 | 915 ms/step , 6876.17 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-06 02:43:07 | Epoch: 1 | Step: 208220 | Dataset: 0-1402701 | Loss: 0.727 | 915 ms/step , 6873.19 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 02:43:16 | Epoch: 1 | Step: 208230 | Dataset: 0-1403021 | Loss: 0.775 | 914 ms/step , 6881.77 GFLOP/s , 17916.4 tokens/s INFO:__main__:2024-11-06 02:43:26 | Epoch: 1 | Step: 208240 | Dataset: 0-1403341 | Loss: 0.682 | 914 ms/step , 6878.24 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 02:43:35 | Epoch: 1 | Step: 208250 | Dataset: 0-1403661 | Loss: 0.707 | 914 ms/step , 6880.12 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 02:43:44 | Epoch: 1 | Step: 208260 | Dataset: 0-1403981 | Loss: 0.676 | 915 ms/step , 6875.52 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-06 02:43:53 | Epoch: 1 | Step: 208270 | Dataset: 0-1404301 | Loss: 0.687 | 915 ms/step , 6871.62 GFLOP/s , 17903.9 tokens/s INFO:__main__:2024-11-06 02:44:02 | Epoch: 1 | Step: 208280 | Dataset: 0-1404621 | Loss: 0.717 | 914 ms/step , 6885.00 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-06 02:44:11 | Epoch: 1 | Step: 208290 | Dataset: 0-1404941 | Loss: 0.691 | 914 ms/step , 6878.26 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-06 02:44:20 | Epoch: 1 | Step: 208300 | Dataset: 0-1405261 | Loss: 0.733 | 913 ms/step , 6887.00 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-06 02:44:22 | Validation | Step: 208300 | Val_loss: 0.759 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:44:31 | Epoch: 1 | Step: 208310 | Dataset: 0-1405581 | Loss: 0.717 | 913 ms/step , 6890.53 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-06 02:44:40 | Epoch: 1 | Step: 208320 | Dataset: 0-1405901 | Loss: 0.708 | 915 ms/step , 6873.51 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 02:44:49 | Epoch: 1 | Step: 208330 | Dataset: 0-1406221 | Loss: 0.725 | 914 ms/step , 6879.36 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 02:44:59 | Epoch: 1 | Step: 208340 | Dataset: 0-1406541 | Loss: 0.716 | 914 ms/step , 6884.75 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 02:45:08 | Epoch: 1 | Step: 208350 | Dataset: 0-1406861 | Loss: 0.655 | 912 ms/step , 6894.11 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 02:45:17 | Epoch: 1 | Step: 208360 | Dataset: 0-1407181 | Loss: 0.745 | 916 ms/step , 6864.85 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-06 02:45:26 | Epoch: 1 | Step: 208370 | Dataset: 0-1407501 | Loss: 0.757 | 915 ms/step , 6875.74 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 02:45:35 | Epoch: 1 | Step: 208380 | Dataset: 0-1407821 | Loss: 0.740 | 915 ms/step , 6876.89 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-06 02:45:44 | Epoch: 1 | Step: 208390 | Dataset: 0-1408141 | Loss: 0.805 | 914 ms/step , 6881.91 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 02:45:53 | Epoch: 1 | Step: 208400 | Dataset: 0-1408461 | Loss: 0.702 | 912 ms/step , 6893.63 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 02:45:55 | Validation | Step: 208400 | Val_loss: 0.780 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:46:04 | Epoch: 1 | Step: 208410 | Dataset: 0-1408781 | Loss: 0.764 | 915 ms/step , 6875.81 GFLOP/s , 15268.9 tokens/s INFO:__main__:2024-11-06 02:46:13 | Epoch: 1 | Step: 208420 | Dataset: 0-1409101 | Loss: 0.701 | 914 ms/step , 6880.13 GFLOP/s , 17905.1 tokens/s INFO:__main__:2024-11-06 02:46:22 | Epoch: 1 | Step: 208430 | Dataset: 0-1409421 | Loss: 0.735 | 914 ms/step , 6883.63 GFLOP/s , 17914.8 tokens/s INFO:__main__:2024-11-06 02:46:32 | Epoch: 1 | Step: 208440 | Dataset: 0-1409741 | Loss: 0.663 | 916 ms/step , 6869.57 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 02:46:41 | Epoch: 1 | Step: 208450 | Dataset: 0-1410061 | Loss: 0.673 | 912 ms/step , 6892.72 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 02:46:50 | Epoch: 1 | Step: 208460 | Dataset: 0-1410381 | Loss: 0.739 | 914 ms/step , 6884.00 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 02:46:59 | Epoch: 1 | Step: 208470 | Dataset: 0-1410701 | Loss: 0.643 | 914 ms/step , 6878.71 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-06 02:47:08 | Epoch: 1 | Step: 208480 | Dataset: 0-1411021 | Loss: 0.702 | 913 ms/step , 6885.55 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:47:17 | Epoch: 1 | Step: 208490 | Dataset: 0-1411341 | Loss: 0.760 | 915 ms/step , 6874.48 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 02:47:26 | Epoch: 1 | Step: 208500 | Dataset: 0-1411661 | Loss: 0.772 | 914 ms/step , 6882.67 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-06 02:47:28 | Validation | Step: 208500 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:47:37 | Epoch: 1 | Step: 208510 | Dataset: 0-1411981 | Loss: 0.744 | 913 ms/step , 6892.38 GFLOP/s , 15265.0 tokens/s INFO:__main__:2024-11-06 02:47:46 | Epoch: 1 | Step: 208520 | Dataset: 0-1412301 | Loss: 0.735 | 913 ms/step , 6890.59 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 02:47:55 | Epoch: 1 | Step: 208530 | Dataset: 0-1412621 | Loss: 0.730 | 914 ms/step , 6883.45 GFLOP/s , 17914.9 tokens/s INFO:__main__:2024-11-06 02:48:05 | Epoch: 1 | Step: 208540 | Dataset: 0-1412941 | Loss: 0.718 | 915 ms/step , 6874.03 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-06 02:48:14 | Epoch: 1 | Step: 208550 | Dataset: 0-1413261 | Loss: 0.744 | 914 ms/step , 6883.53 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-06 02:48:23 | Epoch: 1 | Step: 208560 | Dataset: 0-1413581 | Loss: 0.747 | 913 ms/step , 6886.86 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 02:48:32 | Epoch: 1 | Step: 208570 | Dataset: 0-1413901 | Loss: 0.749 | 915 ms/step , 6871.61 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 02:48:41 | Epoch: 1 | Step: 208580 | Dataset: 0-1414221 | Loss: 0.689 | 915 ms/step , 6876.84 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 02:48:50 | Epoch: 1 | Step: 208590 | Dataset: 0-1414541 | Loss: 0.695 | 913 ms/step , 6885.85 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 02:48:59 | Epoch: 1 | Step: 208600 | Dataset: 0-1414861 | Loss: 0.679 | 914 ms/step , 6882.29 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 02:49:01 | Validation | Step: 208600 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:49:10 | Epoch: 1 | Step: 208610 | Dataset: 0-1415181 | Loss: 0.784 | 914 ms/step , 6882.74 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-06 02:49:19 | Epoch: 1 | Step: 208620 | Dataset: 0-1415501 | Loss: 0.688 | 914 ms/step , 6880.30 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-06 02:49:28 | Epoch: 1 | Step: 208630 | Dataset: 0-1415821 | Loss: 0.671 | 913 ms/step , 6885.89 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 02:49:38 | Epoch: 1 | Step: 208640 | Dataset: 0-1416141 | Loss: 0.684 | 913 ms/step , 6886.77 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 02:49:47 | Epoch: 1 | Step: 208650 | Dataset: 0-1416461 | Loss: 0.791 | 915 ms/step , 6874.79 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 02:49:56 | Epoch: 1 | Step: 208660 | Dataset: 0-1416781 | Loss: 0.717 | 913 ms/step , 6886.28 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 02:50:05 | Epoch: 1 | Step: 208670 | Dataset: 0-1417101 | Loss: 0.646 | 914 ms/step , 6881.59 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 02:50:14 | Epoch: 1 | Step: 208680 | Dataset: 0-1417421 | Loss: 0.795 | 914 ms/step , 6879.98 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 02:50:23 | Epoch: 1 | Step: 208690 | Dataset: 0-1417741 | Loss: 0.733 | 913 ms/step , 6888.12 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 02:50:32 | Epoch: 1 | Step: 208700 | Dataset: 0-1418061 | Loss: 0.729 | 914 ms/step , 6880.96 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-06 02:50:34 | Validation | Step: 208700 | Val_loss: 0.716 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:50:43 | Epoch: 1 | Step: 208710 | Dataset: 0-1418381 | Loss: 0.734 | 913 ms/step , 6889.43 GFLOP/s , 15264.1 tokens/s INFO:__main__:2024-11-06 02:50:52 | Epoch: 1 | Step: 208720 | Dataset: 0-1418701 | Loss: 0.667 | 914 ms/step , 6878.45 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-06 02:51:02 | Epoch: 1 | Step: 208730 | Dataset: 0-1419021 | Loss: 0.679 | 914 ms/step , 6882.29 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 02:51:11 | Epoch: 1 | Step: 208740 | Dataset: 0-1419341 | Loss: 0.662 | 915 ms/step , 6874.97 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-06 02:51:20 | Epoch: 1 | Step: 208750 | Dataset: 0-1419661 | Loss: 0.650 | 913 ms/step , 6891.91 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-06 02:51:29 | Epoch: 1 | Step: 208760 | Dataset: 0-1419981 | Loss: 0.709 | 914 ms/step , 6879.62 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 02:51:38 | Epoch: 1 | Step: 208770 | Dataset: 0-1420301 | Loss: 0.770 | 915 ms/step , 6876.64 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-06 02:51:47 | Epoch: 1 | Step: 208780 | Dataset: 0-1420621 | Loss: 0.621 | 915 ms/step , 6876.10 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 02:51:56 | Epoch: 1 | Step: 208790 | Dataset: 0-1420941 | Loss: 0.681 | 914 ms/step , 6883.97 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 02:52:06 | Epoch: 1 | Step: 208800 | Dataset: 0-1421261 | Loss: 0.709 | 914 ms/step , 6881.06 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 02:52:07 | Validation | Step: 208800 | Val_loss: 0.741 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:52:16 | Epoch: 1 | Step: 208810 | Dataset: 0-1421581 | Loss: 0.748 | 914 ms/step , 6883.96 GFLOP/s , 15265.1 tokens/s INFO:__main__:2024-11-06 02:52:25 | Epoch: 1 | Step: 208820 | Dataset: 0-1421901 | Loss: 0.673 | 915 ms/step , 6877.39 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-06 02:52:35 | Epoch: 1 | Step: 208830 | Dataset: 0-1422221 | Loss: 0.724 | 915 ms/step , 6871.92 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 02:52:44 | Epoch: 1 | Step: 208840 | Dataset: 0-1422541 | Loss: 0.746 | 914 ms/step , 6881.26 GFLOP/s , 17917.1 tokens/s INFO:__main__:2024-11-06 02:52:53 | Epoch: 1 | Step: 208850 | Dataset: 0-1422861 | Loss: 0.657 | 914 ms/step , 6881.47 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-06 02:53:02 | Epoch: 1 | Step: 208860 | Dataset: 0-1423181 | Loss: 0.760 | 913 ms/step , 6885.62 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 02:53:11 | Epoch: 1 | Step: 208870 | Dataset: 0-1423501 | Loss: 0.688 | 912 ms/step , 6895.84 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-06 02:53:20 | Epoch: 1 | Step: 208880 | Dataset: 0-1423821 | Loss: 0.718 | 915 ms/step , 6870.43 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 02:53:29 | Epoch: 1 | Step: 208890 | Dataset: 0-1424141 | Loss: 0.751 | 915 ms/step , 6872.15 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 02:53:39 | Epoch: 1 | Step: 208900 | Dataset: 0-1424461 | Loss: 0.683 | 913 ms/step , 6887.06 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-06 02:53:40 | Validation | Step: 208900 | Val_loss: 0.721 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:53:49 | Epoch: 1 | Step: 208910 | Dataset: 0-1424781 | Loss: 0.751 | 914 ms/step , 6883.33 GFLOP/s , 15269.4 tokens/s INFO:__main__:2024-11-06 02:53:58 | Epoch: 1 | Step: 208920 | Dataset: 0-1425101 | Loss: 0.790 | 915 ms/step , 6876.23 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 02:54:08 | Epoch: 1 | Step: 208930 | Dataset: 0-1425421 | Loss: 0.697 | 915 ms/step , 6871.47 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-06 02:54:17 | Epoch: 1 | Step: 208940 | Dataset: 0-1425741 | Loss: 0.779 | 913 ms/step , 6888.24 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 02:54:26 | Epoch: 1 | Step: 208950 | Dataset: 0-1426061 | Loss: 0.672 | 916 ms/step , 6867.90 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 02:54:35 | Epoch: 1 | Step: 208960 | Dataset: 0-1426381 | Loss: 0.725 | 912 ms/step , 6892.96 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 02:54:44 | Epoch: 1 | Step: 208970 | Dataset: 0-1426701 | Loss: 0.765 | 914 ms/step , 6881.98 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-06 02:54:53 | Epoch: 1 | Step: 208980 | Dataset: 0-1427021 | Loss: 0.662 | 912 ms/step , 6892.62 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 02:55:02 | Epoch: 1 | Step: 208990 | Dataset: 0-1427341 | Loss: 0.696 | 914 ms/step , 6878.87 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 02:55:12 | Epoch: 1 | Step: 209000 | Dataset: 0-1427661 | Loss: 0.735 | 914 ms/step , 6883.35 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-06 02:55:13 | Validation | Step: 209000 | Val_loss: 0.741 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:55:13 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_025513_step_209000.pt` INFO:__main__:2024-11-06 02:55:23 | Epoch: 1 | Step: 209010 | Dataset: 0-1427981 | Loss: 0.760 | 915 ms/step , 6872.32 GFLOP/s , 13772.9 tokens/s INFO:__main__:2024-11-06 02:55:33 | Epoch: 1 | Step: 209020 | Dataset: 0-1428301 | Loss: 0.630 | 913 ms/step , 6887.95 GFLOP/s , 17915.1 tokens/s INFO:__main__:2024-11-06 02:55:42 | Epoch: 1 | Step: 209030 | Dataset: 0-1428621 | Loss: 0.762 | 915 ms/step , 6874.81 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-06 02:55:51 | Epoch: 1 | Step: 209040 | Dataset: 0-1428941 | Loss: 0.707 | 916 ms/step , 6869.66 GFLOP/s , 17872.2 tokens/s INFO:__main__:2024-11-06 02:56:00 | Epoch: 1 | Step: 209050 | Dataset: 0-1429261 | Loss: 0.676 | 916 ms/step , 6866.21 GFLOP/s , 17893.6 tokens/s INFO:__main__:2024-11-06 02:56:09 | Epoch: 1 | Step: 209060 | Dataset: 0-1429581 | Loss: 0.700 | 913 ms/step , 6886.61 GFLOP/s , 17896.4 tokens/s INFO:__main__:2024-11-06 02:56:18 | Epoch: 1 | Step: 209070 | Dataset: 0-1429901 | Loss: 0.671 | 914 ms/step , 6881.83 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-06 02:56:28 | Epoch: 1 | Step: 209080 | Dataset: 0-1430221 | Loss: 0.698 | 913 ms/step , 6887.28 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 02:56:37 | Epoch: 1 | Step: 209090 | Dataset: 0-1430541 | Loss: 0.658 | 912 ms/step , 6892.68 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 02:56:46 | Epoch: 1 | Step: 209100 | Dataset: 0-1430861 | Loss: 0.697 | 914 ms/step , 6877.74 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-06 02:56:47 | Validation | Step: 209100 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:56:57 | Epoch: 1 | Step: 209110 | Dataset: 0-1431181 | Loss: 0.809 | 914 ms/step , 6883.11 GFLOP/s , 15273.1 tokens/s INFO:__main__:2024-11-06 02:57:06 | Epoch: 1 | Step: 209120 | Dataset: 0-1431501 | Loss: 0.613 | 914 ms/step , 6883.37 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 02:57:15 | Epoch: 1 | Step: 209130 | Dataset: 0-1431821 | Loss: 0.658 | 913 ms/step , 6886.36 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 02:57:24 | Epoch: 1 | Step: 209140 | Dataset: 0-1432141 | Loss: 0.707 | 913 ms/step , 6889.28 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 02:57:33 | Epoch: 1 | Step: 209150 | Dataset: 0-1432461 | Loss: 0.703 | 914 ms/step , 6883.27 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 02:57:42 | Epoch: 1 | Step: 209160 | Dataset: 0-1432781 | Loss: 0.683 | 914 ms/step , 6883.37 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 02:57:51 | Epoch: 1 | Step: 209170 | Dataset: 0-1433101 | Loss: 0.747 | 913 ms/step , 6886.04 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 02:58:01 | Epoch: 1 | Step: 209180 | Dataset: 0-1433421 | Loss: 0.764 | 914 ms/step , 6883.91 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 02:58:10 | Epoch: 1 | Step: 209190 | Dataset: 0-1433741 | Loss: 0.684 | 913 ms/step , 6892.28 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 02:58:19 | Epoch: 1 | Step: 209200 | Dataset: 0-1434061 | Loss: 0.638 | 914 ms/step , 6881.88 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 02:58:20 | Validation | Step: 209200 | Val_loss: 0.720 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 02:58:30 | Epoch: 1 | Step: 209210 | Dataset: 0-1434381 | Loss: 0.730 | 914 ms/step , 6881.25 GFLOP/s , 15266.8 tokens/s INFO:__main__:2024-11-06 02:58:39 | Epoch: 1 | Step: 209220 | Dataset: 0-1434701 | Loss: 0.667 | 912 ms/step , 6892.65 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-06 02:58:48 | Epoch: 1 | Step: 209230 | Dataset: 0-1435021 | Loss: 0.726 | 913 ms/step , 6886.53 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 02:58:57 | Epoch: 1 | Step: 209240 | Dataset: 0-1435341 | Loss: 0.789 | 914 ms/step , 6878.29 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 02:59:06 | Epoch: 1 | Step: 209250 | Dataset: 0-1435661 | Loss: 0.736 | 914 ms/step , 6884.89 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 02:59:15 | Epoch: 1 | Step: 209260 | Dataset: 0-1435981 | Loss: 0.672 | 914 ms/step , 6878.40 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 02:59:24 | Epoch: 1 | Step: 209270 | Dataset: 0-1436301 | Loss: 0.650 | 914 ms/step , 6881.62 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 02:59:34 | Epoch: 1 | Step: 209280 | Dataset: 0-1436621 | Loss: 0.739 | 914 ms/step , 6884.28 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 02:59:43 | Epoch: 1 | Step: 209290 | Dataset: 0-1436941 | Loss: 0.642 | 915 ms/step , 6876.38 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 02:59:52 | Epoch: 1 | Step: 209300 | Dataset: 0-1437261 | Loss: 0.682 | 914 ms/step , 6883.50 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 02:59:53 | Validation | Step: 209300 | Val_loss: 0.685 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:00:03 | Epoch: 1 | Step: 209310 | Dataset: 0-1437581 | Loss: 0.744 | 912 ms/step , 6897.84 GFLOP/s , 15266.0 tokens/s INFO:__main__:2024-11-06 03:00:12 | Epoch: 1 | Step: 209320 | Dataset: 0-1437901 | Loss: 0.589 | 913 ms/step , 6890.00 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 03:00:21 | Epoch: 1 | Step: 209330 | Dataset: 0-1438221 | Loss: 0.756 | 914 ms/step , 6883.11 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 03:00:30 | Epoch: 1 | Step: 209340 | Dataset: 0-1438541 | Loss: 0.710 | 913 ms/step , 6890.00 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 03:00:39 | Epoch: 1 | Step: 209350 | Dataset: 0-1438861 | Loss: 0.765 | 913 ms/step , 6886.82 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 03:00:48 | Epoch: 1 | Step: 209360 | Dataset: 0-1439181 | Loss: 0.667 | 913 ms/step , 6890.73 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 03:00:57 | Epoch: 1 | Step: 209370 | Dataset: 0-1439501 | Loss: 0.762 | 913 ms/step , 6889.28 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 03:01:07 | Epoch: 1 | Step: 209380 | Dataset: 0-1439821 | Loss: 0.630 | 912 ms/step , 6897.50 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 03:01:16 | Epoch: 1 | Step: 209390 | Dataset: 0-1440141 | Loss: 0.690 | 913 ms/step , 6885.29 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 03:01:25 | Epoch: 1 | Step: 209400 | Dataset: 0-1440461 | Loss: 0.692 | 913 ms/step , 6890.04 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 03:01:26 | Validation | Step: 209400 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:01:36 | Epoch: 1 | Step: 209410 | Dataset: 0-1440781 | Loss: 0.664 | 912 ms/step , 6895.32 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-06 03:01:45 | Epoch: 1 | Step: 209420 | Dataset: 0-1441101 | Loss: 0.710 | 913 ms/step , 6888.30 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 03:01:54 | Epoch: 1 | Step: 209430 | Dataset: 0-1441421 | Loss: 0.640 | 914 ms/step , 6879.94 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 03:02:03 | Epoch: 1 | Step: 209440 | Dataset: 0-1441741 | Loss: 0.621 | 913 ms/step , 6889.00 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 03:02:12 | Epoch: 1 | Step: 209450 | Dataset: 0-1442061 | Loss: 0.598 | 915 ms/step , 6872.48 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 03:02:21 | Epoch: 1 | Step: 209460 | Dataset: 0-1442381 | Loss: 0.788 | 913 ms/step , 6888.40 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 03:02:30 | Epoch: 1 | Step: 209470 | Dataset: 0-1442701 | Loss: 0.761 | 914 ms/step , 6880.53 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-06 03:02:39 | Epoch: 1 | Step: 209480 | Dataset: 0-1443021 | Loss: 0.744 | 914 ms/step , 6879.41 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 03:02:49 | Epoch: 1 | Step: 209490 | Dataset: 0-1443341 | Loss: 0.641 | 913 ms/step , 6891.79 GFLOP/s , 17944.1 tokens/s INFO:__main__:2024-11-06 03:02:58 | Epoch: 1 | Step: 209500 | Dataset: 0-1443661 | Loss: 0.747 | 913 ms/step , 6887.82 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 03:02:59 | Validation | Step: 209500 | Val_loss: 0.699 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:03:08 | Epoch: 1 | Step: 209510 | Dataset: 0-1443981 | Loss: 0.789 | 914 ms/step , 6884.06 GFLOP/s , 15275.9 tokens/s INFO:__main__:2024-11-06 03:03:18 | Epoch: 1 | Step: 209520 | Dataset: 0-1444301 | Loss: 0.495 | 912 ms/step , 6894.57 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 03:03:27 | Epoch: 1 | Step: 209530 | Dataset: 0-1444621 | Loss: 0.630 | 912 ms/step , 6894.22 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 03:03:36 | Epoch: 1 | Step: 209540 | Dataset: 0-1444941 | Loss: 0.666 | 914 ms/step , 6883.86 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 03:03:45 | Epoch: 1 | Step: 209550 | Dataset: 0-1445261 | Loss: 0.758 | 913 ms/step , 6891.54 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 03:03:54 | Epoch: 1 | Step: 209560 | Dataset: 0-1445581 | Loss: 0.671 | 914 ms/step , 6878.69 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 03:04:03 | Epoch: 1 | Step: 209570 | Dataset: 0-1445901 | Loss: 0.673 | 912 ms/step , 6896.07 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 03:04:12 | Epoch: 1 | Step: 209580 | Dataset: 0-1446221 | Loss: 0.809 | 912 ms/step , 6893.09 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 03:04:22 | Epoch: 1 | Step: 209590 | Dataset: 0-1446541 | Loss: 0.627 | 914 ms/step , 6883.52 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 03:04:31 | Epoch: 1 | Step: 209600 | Dataset: 0-1446861 | Loss: 0.740 | 911 ms/step , 6901.28 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 03:04:32 | Validation | Step: 209600 | Val_loss: 0.772 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:04:41 | Epoch: 1 | Step: 209610 | Dataset: 0-1447181 | Loss: 0.669 | 912 ms/step , 6892.70 GFLOP/s , 15270.2 tokens/s INFO:__main__:2024-11-06 03:04:51 | Epoch: 1 | Step: 209620 | Dataset: 0-1447501 | Loss: 0.709 | 912 ms/step , 6897.78 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 03:05:00 | Epoch: 1 | Step: 209630 | Dataset: 0-1447821 | Loss: 0.692 | 912 ms/step , 6898.75 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 03:05:09 | Epoch: 1 | Step: 209640 | Dataset: 0-1448141 | Loss: 0.748 | 915 ms/step , 6875.87 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 03:05:18 | Epoch: 1 | Step: 209650 | Dataset: 0-1448461 | Loss: 0.772 | 913 ms/step , 6888.35 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-06 03:05:27 | Epoch: 1 | Step: 209660 | Dataset: 0-1448781 | Loss: 0.703 | 913 ms/step , 6892.03 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 03:05:36 | Epoch: 1 | Step: 209670 | Dataset: 0-1449101 | Loss: 0.606 | 913 ms/step , 6890.15 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-06 03:05:45 | Epoch: 1 | Step: 209680 | Dataset: 0-1449421 | Loss: 0.756 | 913 ms/step , 6889.72 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 03:05:54 | Epoch: 1 | Step: 209690 | Dataset: 0-1449741 | Loss: 0.803 | 913 ms/step , 6887.88 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-06 03:06:04 | Epoch: 1 | Step: 209700 | Dataset: 0-1450061 | Loss: 0.632 | 913 ms/step , 6889.31 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-06 03:06:05 | Validation | Step: 209700 | Val_loss: 0.762 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:06:14 | Epoch: 1 | Step: 209710 | Dataset: 0-1450381 | Loss: 0.630 | 912 ms/step , 6897.08 GFLOP/s , 15281.6 tokens/s INFO:__main__:2024-11-06 03:06:23 | Epoch: 1 | Step: 209720 | Dataset: 0-1450701 | Loss: 0.808 | 914 ms/step , 6884.58 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 03:06:33 | Epoch: 1 | Step: 209730 | Dataset: 0-1451021 | Loss: 0.809 | 913 ms/step , 6887.12 GFLOP/s , 17942.6 tokens/s INFO:__main__:2024-11-06 03:06:42 | Epoch: 1 | Step: 209740 | Dataset: 0-1451341 | Loss: 0.721 | 914 ms/step , 6882.91 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-06 03:06:51 | Epoch: 1 | Step: 209750 | Dataset: 0-1451661 | Loss: 0.689 | 912 ms/step , 6897.78 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-06 03:07:00 | Epoch: 1 | Step: 209760 | Dataset: 0-1451981 | Loss: 0.732 | 913 ms/step , 6890.11 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 03:07:09 | Epoch: 1 | Step: 209770 | Dataset: 0-1452301 | Loss: 0.716 | 913 ms/step , 6888.25 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 03:07:18 | Epoch: 1 | Step: 209780 | Dataset: 0-1452621 | Loss: 0.788 | 914 ms/step , 6879.84 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 03:07:27 | Epoch: 1 | Step: 209790 | Dataset: 0-1452941 | Loss: 0.679 | 912 ms/step , 6896.91 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 03:07:37 | Epoch: 1 | Step: 209800 | Dataset: 0-1453261 | Loss: 0.631 | 913 ms/step , 6890.18 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 03:07:38 | Validation | Step: 209800 | Val_loss: 0.743 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:07:47 | Epoch: 1 | Step: 209810 | Dataset: 0-1453581 | Loss: 0.717 | 912 ms/step , 6893.41 GFLOP/s , 15274.7 tokens/s INFO:__main__:2024-11-06 03:07:56 | Epoch: 1 | Step: 209820 | Dataset: 0-1453901 | Loss: 0.755 | 913 ms/step , 6886.48 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 03:08:06 | Epoch: 1 | Step: 209830 | Dataset: 0-1454221 | Loss: 0.739 | 913 ms/step , 6890.42 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 03:08:15 | Epoch: 1 | Step: 209840 | Dataset: 0-1454541 | Loss: 0.663 | 912 ms/step , 6894.58 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-06 03:08:24 | Epoch: 1 | Step: 209850 | Dataset: 0-1454861 | Loss: 0.694 | 912 ms/step , 6892.91 GFLOP/s , 17947.8 tokens/s INFO:__main__:2024-11-06 03:08:33 | Epoch: 1 | Step: 209860 | Dataset: 0-1455181 | Loss: 0.769 | 913 ms/step , 6892.46 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-06 03:08:42 | Epoch: 1 | Step: 209870 | Dataset: 0-1455501 | Loss: 0.762 | 913 ms/step , 6888.04 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-06 03:08:51 | Epoch: 1 | Step: 209880 | Dataset: 0-1455821 | Loss: 0.677 | 913 ms/step , 6892.27 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-06 03:09:00 | Epoch: 1 | Step: 209890 | Dataset: 0-1456141 | Loss: 0.883 | 913 ms/step , 6887.33 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 03:09:09 | Epoch: 1 | Step: 209900 | Dataset: 0-1456461 | Loss: 0.664 | 915 ms/step , 6876.42 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 03:09:11 | Validation | Step: 209900 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:09:20 | Epoch: 1 | Step: 209910 | Dataset: 0-1456781 | Loss: 0.738 | 913 ms/step , 6892.36 GFLOP/s , 15282.7 tokens/s INFO:__main__:2024-11-06 03:09:29 | Epoch: 1 | Step: 209920 | Dataset: 0-1457101 | Loss: 0.674 | 912 ms/step , 6896.50 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-06 03:09:38 | Epoch: 1 | Step: 209930 | Dataset: 0-1457421 | Loss: 0.703 | 912 ms/step , 6893.49 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 03:09:48 | Epoch: 1 | Step: 209940 | Dataset: 0-1457741 | Loss: 0.742 | 913 ms/step , 6890.78 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 03:09:57 | Epoch: 1 | Step: 209950 | Dataset: 0-1458061 | Loss: 0.600 | 912 ms/step , 6895.14 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-06 03:10:06 | Epoch: 1 | Step: 209960 | Dataset: 0-1458381 | Loss: 0.759 | 912 ms/step , 6896.95 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 03:10:15 | Epoch: 1 | Step: 209970 | Dataset: 0-1458701 | Loss: 0.716 | 912 ms/step , 6895.65 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 03:10:24 | Epoch: 1 | Step: 209980 | Dataset: 0-1459021 | Loss: 0.720 | 913 ms/step , 6891.46 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 03:10:33 | Epoch: 1 | Step: 209990 | Dataset: 0-1459341 | Loss: 0.693 | 912 ms/step , 6892.71 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-06 03:10:42 | Epoch: 1 | Step: 210000 | Dataset: 0-1459661 | Loss: 0.663 | 913 ms/step , 6886.19 GFLOP/s , 17941.8 tokens/s INFO:__main__:2024-11-06 03:10:44 | Validation | Step: 210000 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:10:44 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_031044_step_210000.pt` INFO:__main__:2024-11-06 03:10:54 | Epoch: 1 | Step: 210010 | Dataset: 0-1459981 | Loss: 0.654 | 913 ms/step , 6891.79 GFLOP/s , 13794.4 tokens/s INFO:__main__:2024-11-06 03:11:03 | Epoch: 1 | Step: 210020 | Dataset: 0-1460301 | Loss: 0.644 | 914 ms/step , 6883.80 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-06 03:11:13 | Epoch: 1 | Step: 210030 | Dataset: 0-1460621 | Loss: 0.687 | 913 ms/step , 6888.59 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-06 03:11:22 | Epoch: 1 | Step: 210040 | Dataset: 0-1460941 | Loss: 0.810 | 914 ms/step , 6879.21 GFLOP/s , 17898.2 tokens/s INFO:__main__:2024-11-06 03:11:31 | Epoch: 1 | Step: 210050 | Dataset: 0-1461261 | Loss: 0.743 | 913 ms/step , 6889.93 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 03:11:40 | Epoch: 1 | Step: 210060 | Dataset: 0-1461581 | Loss: 0.719 | 913 ms/step , 6889.81 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-06 03:11:49 | Epoch: 1 | Step: 210070 | Dataset: 0-1461901 | Loss: 0.718 | 912 ms/step , 6894.31 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-06 03:11:58 | Epoch: 1 | Step: 210080 | Dataset: 0-1462221 | Loss: 0.654 | 913 ms/step , 6892.29 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-06 03:12:07 | Epoch: 1 | Step: 210090 | Dataset: 0-1462541 | Loss: 0.712 | 912 ms/step , 6893.23 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 03:12:16 | Epoch: 1 | Step: 210100 | Dataset: 0-1462861 | Loss: 0.735 | 913 ms/step , 6887.99 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-06 03:12:18 | Validation | Step: 210100 | Val_loss: 0.752 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:12:27 | Epoch: 1 | Step: 210110 | Dataset: 0-1463181 | Loss: 0.723 | 914 ms/step , 6877.98 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-06 03:12:36 | Epoch: 1 | Step: 210120 | Dataset: 0-1463501 | Loss: 0.762 | 912 ms/step , 6892.63 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-06 03:12:45 | Epoch: 1 | Step: 210130 | Dataset: 0-1463821 | Loss: 0.731 | 912 ms/step , 6892.99 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-06 03:12:55 | Epoch: 1 | Step: 210140 | Dataset: 0-1464141 | Loss: 0.666 | 913 ms/step , 6885.58 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 03:13:04 | Epoch: 1 | Step: 210150 | Dataset: 0-1464461 | Loss: 0.685 | 914 ms/step , 6881.01 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 03:13:13 | Epoch: 1 | Step: 210160 | Dataset: 0-1464781 | Loss: 0.676 | 913 ms/step , 6890.47 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-06 03:13:22 | Epoch: 1 | Step: 210170 | Dataset: 0-1465101 | Loss: 0.672 | 913 ms/step , 6889.47 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-06 03:13:31 | Epoch: 1 | Step: 210180 | Dataset: 0-1465421 | Loss: 0.648 | 912 ms/step , 6895.34 GFLOP/s , 17947.4 tokens/s INFO:__main__:2024-11-06 03:13:40 | Epoch: 1 | Step: 210190 | Dataset: 0-1465741 | Loss: 0.794 | 913 ms/step , 6889.24 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 03:13:49 | Epoch: 1 | Step: 210200 | Dataset: 0-1466061 | Loss: 0.773 | 913 ms/step , 6890.21 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 03:13:51 | Validation | Step: 210200 | Val_loss: 0.741 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:14:00 | Epoch: 1 | Step: 210210 | Dataset: 0-1466381 | Loss: 0.623 | 914 ms/step , 6884.46 GFLOP/s , 15278.4 tokens/s INFO:__main__:2024-11-06 03:14:09 | Epoch: 1 | Step: 210220 | Dataset: 0-1466701 | Loss: 0.688 | 912 ms/step , 6898.62 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-06 03:14:18 | Epoch: 1 | Step: 210230 | Dataset: 0-1467021 | Loss: 0.699 | 912 ms/step , 6893.32 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 03:14:28 | Epoch: 1 | Step: 210240 | Dataset: 0-1467341 | Loss: 0.704 | 913 ms/step , 6890.91 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-06 03:14:37 | Epoch: 1 | Step: 210250 | Dataset: 0-1467661 | Loss: 0.763 | 912 ms/step , 6892.74 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 03:14:46 | Epoch: 1 | Step: 210260 | Dataset: 0-1467981 | Loss: 0.772 | 914 ms/step , 6881.91 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 03:14:55 | Epoch: 1 | Step: 210270 | Dataset: 0-1468301 | Loss: 0.820 | 913 ms/step , 6892.56 GFLOP/s , 17945.8 tokens/s INFO:__main__:2024-11-06 03:15:04 | Epoch: 1 | Step: 210280 | Dataset: 0-1468621 | Loss: 0.681 | 913 ms/step , 6888.56 GFLOP/s , 17946.2 tokens/s INFO:__main__:2024-11-06 03:15:13 | Epoch: 1 | Step: 210290 | Dataset: 0-1468941 | Loss: 0.675 | 912 ms/step , 6895.12 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 03:15:22 | Epoch: 1 | Step: 210300 | Dataset: 0-1469261 | Loss: 0.641 | 913 ms/step , 6891.79 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 03:15:24 | Validation | Step: 210300 | Val_loss: 0.756 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:15:33 | Epoch: 1 | Step: 210310 | Dataset: 0-1469581 | Loss: 0.663 | 912 ms/step , 6892.64 GFLOP/s , 15272.1 tokens/s INFO:__main__:2024-11-06 03:15:42 | Epoch: 1 | Step: 210320 | Dataset: 0-1469901 | Loss: 0.931 | 913 ms/step , 6891.89 GFLOP/s , 17940.2 tokens/s INFO:__main__:2024-11-06 03:15:51 | Epoch: 1 | Step: 210330 | Dataset: 0-1470221 | Loss: 0.704 | 914 ms/step , 6883.86 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 03:16:00 | Epoch: 1 | Step: 210340 | Dataset: 0-1470541 | Loss: 0.726 | 914 ms/step , 6884.84 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 03:16:10 | Epoch: 1 | Step: 210350 | Dataset: 0-1470861 | Loss: 0.634 | 912 ms/step , 6895.99 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 03:16:19 | Epoch: 1 | Step: 210360 | Dataset: 0-1471181 | Loss: 0.621 | 913 ms/step , 6889.84 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 03:16:28 | Epoch: 1 | Step: 210370 | Dataset: 0-1471501 | Loss: 0.794 | 913 ms/step , 6885.06 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 03:16:37 | Epoch: 1 | Step: 210380 | Dataset: 0-1471821 | Loss: 0.720 | 913 ms/step , 6885.24 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 03:16:46 | Epoch: 1 | Step: 210390 | Dataset: 0-1472141 | Loss: 0.699 | 914 ms/step , 6882.19 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 03:16:55 | Epoch: 1 | Step: 210400 | Dataset: 0-1472461 | Loss: 0.812 | 912 ms/step , 6892.67 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 03:16:57 | Validation | Step: 210400 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:17:06 | Epoch: 1 | Step: 210410 | Dataset: 0-1472781 | Loss: 0.624 | 912 ms/step , 6895.75 GFLOP/s , 15273.4 tokens/s INFO:__main__:2024-11-06 03:17:15 | Epoch: 1 | Step: 210420 | Dataset: 0-1473101 | Loss: 0.780 | 914 ms/step , 6883.68 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 03:17:24 | Epoch: 1 | Step: 210430 | Dataset: 0-1473421 | Loss: 0.782 | 914 ms/step , 6878.06 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 03:17:33 | Epoch: 1 | Step: 210440 | Dataset: 0-1473741 | Loss: 0.815 | 913 ms/step , 6889.75 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 03:17:43 | Epoch: 1 | Step: 210450 | Dataset: 0-1474061 | Loss: 0.705 | 914 ms/step , 6880.97 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 03:17:52 | Epoch: 1 | Step: 210460 | Dataset: 0-1474381 | Loss: 0.688 | 914 ms/step , 6884.43 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 03:18:01 | Epoch: 1 | Step: 210470 | Dataset: 0-1474701 | Loss: 0.784 | 913 ms/step , 6891.75 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 03:18:10 | Epoch: 1 | Step: 210480 | Dataset: 0-1475021 | Loss: 0.718 | 914 ms/step , 6879.86 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 03:18:19 | Epoch: 1 | Step: 210490 | Dataset: 0-1475341 | Loss: 0.675 | 913 ms/step , 6887.95 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 03:18:28 | Epoch: 1 | Step: 210500 | Dataset: 0-1475661 | Loss: 0.874 | 913 ms/step , 6885.64 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 03:18:30 | Validation | Step: 210500 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:18:39 | Epoch: 1 | Step: 210510 | Dataset: 0-1475981 | Loss: 0.808 | 914 ms/step , 6883.87 GFLOP/s , 15270.6 tokens/s INFO:__main__:2024-11-06 03:18:48 | Epoch: 1 | Step: 210520 | Dataset: 0-1476301 | Loss: 0.754 | 914 ms/step , 6880.96 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 03:18:57 | Epoch: 1 | Step: 210530 | Dataset: 0-1476621 | Loss: 0.858 | 913 ms/step , 6889.87 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 03:19:06 | Epoch: 1 | Step: 210540 | Dataset: 0-1476941 | Loss: 0.808 | 913 ms/step , 6890.66 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 03:19:16 | Epoch: 1 | Step: 210550 | Dataset: 0-1477261 | Loss: 0.854 | 914 ms/step , 6880.89 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 03:19:25 | Epoch: 1 | Step: 210560 | Dataset: 0-1477581 | Loss: 0.702 | 914 ms/step , 6884.03 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 03:19:34 | Epoch: 1 | Step: 210570 | Dataset: 0-1477901 | Loss: 0.470 | 912 ms/step , 6894.57 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 03:19:43 | Epoch: 1 | Step: 210580 | Dataset: 0-1478221 | Loss: 0.629 | 914 ms/step , 6882.13 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 03:19:52 | Epoch: 1 | Step: 210590 | Dataset: 0-1478541 | Loss: 0.675 | 914 ms/step , 6882.17 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 03:20:01 | Epoch: 1 | Step: 210600 | Dataset: 0-1478861 | Loss: 0.709 | 913 ms/step , 6889.34 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 03:20:03 | Validation | Step: 210600 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:20:12 | Epoch: 1 | Step: 210610 | Dataset: 0-1479181 | Loss: 0.638 | 914 ms/step , 6884.55 GFLOP/s , 15257.3 tokens/s INFO:__main__:2024-11-06 03:20:21 | Epoch: 1 | Step: 210620 | Dataset: 0-1479501 | Loss: 0.883 | 914 ms/step , 6878.32 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 03:20:30 | Epoch: 1 | Step: 210630 | Dataset: 0-1479821 | Loss: 0.650 | 912 ms/step , 6895.35 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 03:20:39 | Epoch: 1 | Step: 210640 | Dataset: 0-1480141 | Loss: 0.744 | 913 ms/step , 6889.18 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 03:20:48 | Epoch: 1 | Step: 210650 | Dataset: 0-1480461 | Loss: 0.678 | 912 ms/step , 6892.96 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 03:20:58 | Epoch: 1 | Step: 210660 | Dataset: 0-1480781 | Loss: 0.615 | 914 ms/step , 6884.79 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 03:21:07 | Epoch: 1 | Step: 210670 | Dataset: 0-1481101 | Loss: 0.807 | 915 ms/step , 6877.36 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 03:21:16 | Epoch: 1 | Step: 210680 | Dataset: 0-1481421 | Loss: 0.802 | 913 ms/step , 6886.76 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 03:21:25 | Epoch: 1 | Step: 210690 | Dataset: 0-1481741 | Loss: 0.787 | 915 ms/step , 6877.12 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 03:21:34 | Epoch: 1 | Step: 210700 | Dataset: 0-1482061 | Loss: 0.598 | 915 ms/step , 6876.77 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 03:21:36 | Validation | Step: 210700 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:21:45 | Epoch: 1 | Step: 210710 | Dataset: 0-1482381 | Loss: 0.694 | 915 ms/step , 6871.02 GFLOP/s , 15273.5 tokens/s INFO:__main__:2024-11-06 03:21:54 | Epoch: 1 | Step: 210720 | Dataset: 0-1482701 | Loss: 0.647 | 914 ms/step , 6884.09 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 03:22:03 | Epoch: 1 | Step: 210730 | Dataset: 0-1483021 | Loss: 0.657 | 913 ms/step , 6889.29 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 03:22:12 | Epoch: 1 | Step: 210740 | Dataset: 0-1483341 | Loss: 0.569 | 913 ms/step , 6891.25 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 03:22:21 | Epoch: 1 | Step: 210750 | Dataset: 0-1483661 | Loss: 0.732 | 913 ms/step , 6890.50 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 03:22:31 | Epoch: 1 | Step: 210760 | Dataset: 0-1483981 | Loss: 0.782 | 914 ms/step , 6884.83 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-06 03:22:40 | Epoch: 1 | Step: 210770 | Dataset: 0-1484301 | Loss: 0.738 | 915 ms/step , 6874.58 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 03:22:49 | Epoch: 1 | Step: 210780 | Dataset: 0-1484621 | Loss: 0.860 | 913 ms/step , 6887.22 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 03:22:58 | Epoch: 1 | Step: 210790 | Dataset: 0-1484941 | Loss: 0.693 | 912 ms/step , 6893.98 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-06 03:23:07 | Epoch: 1 | Step: 210800 | Dataset: 0-1485261 | Loss: 0.651 | 913 ms/step , 6889.58 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-06 03:23:09 | Validation | Step: 210800 | Val_loss: 0.755 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:23:18 | Epoch: 1 | Step: 210810 | Dataset: 0-1485581 | Loss: 0.678 | 913 ms/step , 6885.07 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-06 03:23:27 | Epoch: 1 | Step: 210820 | Dataset: 0-1485901 | Loss: 0.684 | 914 ms/step , 6881.20 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 03:23:36 | Epoch: 1 | Step: 210830 | Dataset: 0-1486221 | Loss: 0.706 | 912 ms/step , 6894.74 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 03:23:45 | Epoch: 1 | Step: 210840 | Dataset: 0-1486541 | Loss: 0.785 | 914 ms/step , 6884.27 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 03:23:54 | Epoch: 1 | Step: 210850 | Dataset: 0-1486861 | Loss: 0.826 | 913 ms/step , 6885.19 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 03:24:04 | Epoch: 1 | Step: 210860 | Dataset: 0-1487181 | Loss: 0.760 | 913 ms/step , 6892.21 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 03:24:13 | Epoch: 1 | Step: 210870 | Dataset: 0-1487501 | Loss: 0.776 | 913 ms/step , 6885.96 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 03:24:22 | Epoch: 1 | Step: 210880 | Dataset: 0-1487821 | Loss: 0.708 | 913 ms/step , 6890.91 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 03:24:31 | Epoch: 1 | Step: 210890 | Dataset: 0-1488141 | Loss: 0.804 | 914 ms/step , 6885.05 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 03:24:40 | Epoch: 1 | Step: 210900 | Dataset: 0-1488461 | Loss: 0.743 | 913 ms/step , 6887.00 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 03:24:42 | Validation | Step: 210900 | Val_loss: 0.769 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:24:51 | Epoch: 1 | Step: 210910 | Dataset: 0-1488781 | Loss: 0.748 | 913 ms/step , 6891.46 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-06 03:25:00 | Epoch: 1 | Step: 210920 | Dataset: 0-1489101 | Loss: 0.723 | 913 ms/step , 6886.52 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 03:25:09 | Epoch: 1 | Step: 210930 | Dataset: 0-1489421 | Loss: 0.587 | 915 ms/step , 6877.23 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 03:25:18 | Epoch: 1 | Step: 210940 | Dataset: 0-1489741 | Loss: 0.762 | 914 ms/step , 6882.63 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 03:25:27 | Epoch: 1 | Step: 210950 | Dataset: 0-1490061 | Loss: 0.822 | 913 ms/step , 6886.60 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 03:25:37 | Epoch: 1 | Step: 210960 | Dataset: 0-1490381 | Loss: 0.655 | 914 ms/step , 6883.20 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 03:25:46 | Epoch: 1 | Step: 210970 | Dataset: 0-1490701 | Loss: 0.562 | 914 ms/step , 6880.64 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 03:25:55 | Epoch: 1 | Step: 210980 | Dataset: 0-1491021 | Loss: 0.756 | 913 ms/step , 6886.93 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 03:26:04 | Epoch: 1 | Step: 210990 | Dataset: 0-1491341 | Loss: 0.799 | 915 ms/step , 6877.41 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 03:26:13 | Epoch: 1 | Step: 211000 | Dataset: 0-1491661 | Loss: 0.755 | 914 ms/step , 6881.78 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-06 03:26:15 | Validation | Step: 211000 | Val_loss: 0.691 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:26:15 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_032615_step_211000.pt` INFO:__main__:2024-11-06 03:26:25 | Epoch: 1 | Step: 211010 | Dataset: 0-1491981 | Loss: 0.742 | 913 ms/step , 6888.55 GFLOP/s , 13782.6 tokens/s INFO:__main__:2024-11-06 03:26:34 | Epoch: 1 | Step: 211020 | Dataset: 0-1492301 | Loss: 0.801 | 914 ms/step , 6884.66 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 03:26:43 | Epoch: 1 | Step: 211030 | Dataset: 0-1492621 | Loss: 0.762 | 913 ms/step , 6886.05 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 03:26:52 | Epoch: 1 | Step: 211040 | Dataset: 0-1492941 | Loss: 0.750 | 916 ms/step , 6869.14 GFLOP/s , 17884.5 tokens/s INFO:__main__:2024-11-06 03:27:02 | Epoch: 1 | Step: 211050 | Dataset: 0-1493261 | Loss: 0.738 | 915 ms/step , 6873.91 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 03:27:11 | Epoch: 1 | Step: 211060 | Dataset: 0-1493581 | Loss: 0.782 | 914 ms/step , 6880.73 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 03:27:20 | Epoch: 1 | Step: 211070 | Dataset: 0-1493901 | Loss: 0.586 | 913 ms/step , 6888.36 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 03:27:29 | Epoch: 1 | Step: 211080 | Dataset: 0-1494221 | Loss: 0.690 | 914 ms/step , 6879.64 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 03:27:38 | Epoch: 1 | Step: 211090 | Dataset: 0-1494541 | Loss: 0.434 | 912 ms/step , 6895.00 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 03:27:47 | Epoch: 1 | Step: 211100 | Dataset: 0-1494861 | Loss: 0.711 | 913 ms/step , 6887.95 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 03:27:49 | Validation | Step: 211100 | Val_loss: 0.762 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:27:58 | Epoch: 1 | Step: 211110 | Dataset: 0-1495181 | Loss: 0.664 | 912 ms/step , 6894.06 GFLOP/s , 15272.1 tokens/s INFO:__main__:2024-11-06 03:28:07 | Epoch: 1 | Step: 211120 | Dataset: 0-1495501 | Loss: 0.656 | 913 ms/step , 6888.52 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 03:28:16 | Epoch: 1 | Step: 211130 | Dataset: 0-1495821 | Loss: 0.687 | 913 ms/step , 6891.21 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 03:28:25 | Epoch: 1 | Step: 211140 | Dataset: 0-1496141 | Loss: 0.700 | 913 ms/step , 6887.21 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 03:28:35 | Epoch: 1 | Step: 211150 | Dataset: 0-1496461 | Loss: 0.701 | 913 ms/step , 6885.12 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 03:28:44 | Epoch: 1 | Step: 211160 | Dataset: 0-1496781 | Loss: 0.660 | 912 ms/step , 6897.57 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 03:28:53 | Epoch: 1 | Step: 211170 | Dataset: 0-1497101 | Loss: 0.621 | 912 ms/step , 6893.98 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 03:29:02 | Epoch: 1 | Step: 211180 | Dataset: 0-1497421 | Loss: 0.641 | 913 ms/step , 6888.06 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 03:29:11 | Epoch: 1 | Step: 211190 | Dataset: 0-1497741 | Loss: 0.630 | 912 ms/step , 6894.83 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 03:29:20 | Epoch: 1 | Step: 211200 | Dataset: 0-1498061 | Loss: 0.772 | 915 ms/step , 6875.06 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-06 03:29:22 | Validation | Step: 211200 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:29:31 | Epoch: 1 | Step: 211210 | Dataset: 0-1498381 | Loss: 0.880 | 913 ms/step , 6887.88 GFLOP/s , 15273.9 tokens/s INFO:__main__:2024-11-06 03:29:40 | Epoch: 1 | Step: 211220 | Dataset: 0-1498701 | Loss: 0.810 | 914 ms/step , 6884.71 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 03:29:49 | Epoch: 1 | Step: 211230 | Dataset: 0-1499021 | Loss: 0.754 | 914 ms/step , 6884.94 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 03:29:58 | Epoch: 1 | Step: 211240 | Dataset: 0-1499341 | Loss: 0.889 | 913 ms/step , 6891.65 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 03:30:08 | Epoch: 1 | Step: 211250 | Dataset: 0-1499661 | Loss: 0.556 | 914 ms/step , 6879.35 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 03:30:17 | Epoch: 1 | Step: 211260 | Dataset: 0-1499981 | Loss: 0.741 | 915 ms/step , 6874.80 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-06 03:30:26 | Epoch: 1 | Step: 211270 | Dataset: 0-1500301 | Loss: 0.707 | 913 ms/step , 6887.96 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 03:30:35 | Epoch: 1 | Step: 211280 | Dataset: 0-1500621 | Loss: 0.765 | 912 ms/step , 6896.23 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 03:30:44 | Epoch: 1 | Step: 211290 | Dataset: 0-1500941 | Loss: 0.785 | 913 ms/step , 6886.43 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 03:30:53 | Epoch: 1 | Step: 211300 | Dataset: 0-1501261 | Loss: 0.779 | 913 ms/step , 6887.28 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 03:30:55 | Validation | Step: 211300 | Val_loss: 0.706 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:31:04 | Epoch: 1 | Step: 211310 | Dataset: 0-1501581 | Loss: 0.523 | 913 ms/step , 6890.88 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-06 03:31:13 | Epoch: 1 | Step: 211320 | Dataset: 0-1501901 | Loss: 0.681 | 913 ms/step , 6890.09 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 03:31:22 | Epoch: 1 | Step: 211330 | Dataset: 0-1502221 | Loss: 0.754 | 913 ms/step , 6891.92 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 03:31:31 | Epoch: 1 | Step: 211340 | Dataset: 0-1502541 | Loss: 0.705 | 914 ms/step , 6882.93 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 03:31:40 | Epoch: 1 | Step: 211350 | Dataset: 0-1502861 | Loss: 0.731 | 913 ms/step , 6890.72 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 03:31:50 | Epoch: 1 | Step: 211360 | Dataset: 0-1503181 | Loss: 0.660 | 913 ms/step , 6888.48 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 03:31:59 | Epoch: 1 | Step: 211370 | Dataset: 0-1503501 | Loss: 0.569 | 913 ms/step , 6886.35 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 03:32:08 | Epoch: 1 | Step: 211380 | Dataset: 0-1503821 | Loss: 0.673 | 914 ms/step , 6882.89 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 03:32:17 | Epoch: 1 | Step: 211390 | Dataset: 0-1504141 | Loss: 0.777 | 913 ms/step , 6888.18 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 03:32:26 | Epoch: 1 | Step: 211400 | Dataset: 0-1504461 | Loss: 0.309 | 912 ms/step , 6897.68 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 03:32:28 | Validation | Step: 211400 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:32:37 | Epoch: 1 | Step: 211410 | Dataset: 0-1504781 | Loss: 0.526 | 913 ms/step , 6891.05 GFLOP/s , 15277.7 tokens/s INFO:__main__:2024-11-06 03:32:46 | Epoch: 1 | Step: 211420 | Dataset: 0-1505101 | Loss: 0.829 | 913 ms/step , 6891.40 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 03:32:55 | Epoch: 1 | Step: 211430 | Dataset: 0-1505421 | Loss: 0.691 | 913 ms/step , 6887.43 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 03:33:04 | Epoch: 1 | Step: 211440 | Dataset: 0-1505741 | Loss: 0.667 | 913 ms/step , 6886.33 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 03:33:13 | Epoch: 1 | Step: 211450 | Dataset: 0-1506061 | Loss: 0.771 | 913 ms/step , 6892.20 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 03:33:23 | Epoch: 1 | Step: 211460 | Dataset: 0-1506381 | Loss: 0.664 | 913 ms/step , 6892.33 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 03:33:32 | Epoch: 1 | Step: 211470 | Dataset: 0-1506701 | Loss: 0.739 | 913 ms/step , 6888.93 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 03:33:41 | Epoch: 1 | Step: 211480 | Dataset: 0-1507021 | Loss: 0.822 | 914 ms/step , 6882.46 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 03:33:50 | Epoch: 1 | Step: 211490 | Dataset: 0-1507341 | Loss: 0.747 | 913 ms/step , 6887.70 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 03:33:59 | Epoch: 1 | Step: 211500 | Dataset: 0-1507661 | Loss: 0.717 | 915 ms/step , 6874.83 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-06 03:34:01 | Validation | Step: 211500 | Val_loss: 0.627 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:34:10 | Epoch: 1 | Step: 211510 | Dataset: 0-1507981 | Loss: 0.670 | 912 ms/step , 6892.83 GFLOP/s , 15276.6 tokens/s INFO:__main__:2024-11-06 03:34:19 | Epoch: 1 | Step: 211520 | Dataset: 0-1508301 | Loss: 0.777 | 914 ms/step , 6884.01 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 03:34:28 | Epoch: 1 | Step: 211530 | Dataset: 0-1508621 | Loss: 0.810 | 913 ms/step , 6886.01 GFLOP/s , 17943.6 tokens/s INFO:__main__:2024-11-06 03:34:37 | Epoch: 1 | Step: 211540 | Dataset: 0-1508941 | Loss: 0.755 | 914 ms/step , 6882.85 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 03:34:46 | Epoch: 1 | Step: 211550 | Dataset: 0-1509261 | Loss: 0.618 | 913 ms/step , 6888.87 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-06 03:34:56 | Epoch: 1 | Step: 211560 | Dataset: 0-1509581 | Loss: 0.749 | 913 ms/step , 6886.77 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 03:35:05 | Epoch: 1 | Step: 211570 | Dataset: 0-1509901 | Loss: 0.652 | 913 ms/step , 6885.37 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 03:35:14 | Epoch: 1 | Step: 211580 | Dataset: 0-1510221 | Loss: 0.706 | 912 ms/step , 6898.26 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 03:35:23 | Epoch: 1 | Step: 211590 | Dataset: 0-1510541 | Loss: 0.616 | 914 ms/step , 6881.14 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 03:35:32 | Epoch: 1 | Step: 211600 | Dataset: 0-1510861 | Loss: 0.592 | 913 ms/step , 6891.60 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 03:35:34 | Validation | Step: 211600 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:35:43 | Epoch: 1 | Step: 211610 | Dataset: 0-1511181 | Loss: 0.702 | 912 ms/step , 6894.94 GFLOP/s , 15288.4 tokens/s INFO:__main__:2024-11-06 03:35:52 | Epoch: 1 | Step: 211620 | Dataset: 0-1511501 | Loss: 0.822 | 913 ms/step , 6891.25 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 03:36:01 | Epoch: 1 | Step: 211630 | Dataset: 0-1511821 | Loss: 0.638 | 914 ms/step , 6884.36 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 03:36:10 | Epoch: 1 | Step: 211640 | Dataset: 0-1512141 | Loss: 0.711 | 914 ms/step , 6881.58 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 03:36:19 | Epoch: 1 | Step: 211650 | Dataset: 0-1512461 | Loss: 0.742 | 915 ms/step , 6876.71 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 03:36:28 | Epoch: 1 | Step: 211660 | Dataset: 0-1512781 | Loss: 0.712 | 913 ms/step , 6886.92 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 03:36:38 | Epoch: 1 | Step: 211670 | Dataset: 0-1513101 | Loss: 0.787 | 913 ms/step , 6887.22 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 03:36:47 | Epoch: 1 | Step: 211680 | Dataset: 0-1513421 | Loss: 0.643 | 914 ms/step , 6881.13 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 03:36:56 | Epoch: 1 | Step: 211690 | Dataset: 0-1513741 | Loss: 0.611 | 912 ms/step , 6892.83 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 03:37:05 | Epoch: 1 | Step: 211700 | Dataset: 0-1514061 | Loss: 0.705 | 912 ms/step , 6899.23 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 03:37:07 | Validation | Step: 211700 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:37:16 | Epoch: 1 | Step: 211710 | Dataset: 0-1514381 | Loss: 0.649 | 913 ms/step , 6891.01 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-06 03:37:25 | Epoch: 1 | Step: 211720 | Dataset: 0-1514701 | Loss: 0.768 | 913 ms/step , 6885.54 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 03:37:34 | Epoch: 1 | Step: 211730 | Dataset: 0-1515021 | Loss: 0.688 | 914 ms/step , 6883.85 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 03:37:43 | Epoch: 1 | Step: 211740 | Dataset: 0-1515341 | Loss: 0.696 | 914 ms/step , 6881.85 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 03:37:52 | Epoch: 1 | Step: 211750 | Dataset: 0-1515661 | Loss: 0.754 | 912 ms/step , 6892.68 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-06 03:38:01 | Epoch: 1 | Step: 211760 | Dataset: 0-1515981 | Loss: 0.787 | 912 ms/step , 6893.51 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 03:38:11 | Epoch: 1 | Step: 211770 | Dataset: 0-1516301 | Loss: 0.817 | 912 ms/step , 6892.67 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-06 03:38:20 | Epoch: 1 | Step: 211780 | Dataset: 0-1516621 | Loss: 0.721 | 912 ms/step , 6894.77 GFLOP/s , 17940.7 tokens/s INFO:__main__:2024-11-06 03:38:29 | Epoch: 1 | Step: 211790 | Dataset: 0-1516941 | Loss: 0.849 | 912 ms/step , 6892.92 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 03:38:38 | Epoch: 1 | Step: 211800 | Dataset: 0-1517261 | Loss: 0.652 | 915 ms/step , 6871.32 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 03:38:40 | Validation | Step: 211800 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:38:49 | Epoch: 1 | Step: 211810 | Dataset: 0-1517581 | Loss: 0.619 | 913 ms/step , 6885.70 GFLOP/s , 15286.2 tokens/s INFO:__main__:2024-11-06 03:38:58 | Epoch: 1 | Step: 211820 | Dataset: 0-1517901 | Loss: 0.722 | 913 ms/step , 6885.47 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 03:39:07 | Epoch: 1 | Step: 211830 | Dataset: 0-1518221 | Loss: 0.705 | 912 ms/step , 6895.00 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 03:39:16 | Epoch: 1 | Step: 211840 | Dataset: 0-1518541 | Loss: 0.721 | 914 ms/step , 6881.29 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 03:39:25 | Epoch: 1 | Step: 211850 | Dataset: 0-1518861 | Loss: 0.665 | 912 ms/step , 6895.68 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 03:39:34 | Epoch: 1 | Step: 211860 | Dataset: 0-1519181 | Loss: 0.692 | 913 ms/step , 6885.89 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 03:39:43 | Epoch: 1 | Step: 211870 | Dataset: 0-1519501 | Loss: 0.811 | 914 ms/step , 6882.54 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 03:39:53 | Epoch: 1 | Step: 211880 | Dataset: 0-1519821 | Loss: 0.718 | 913 ms/step , 6890.86 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 03:40:02 | Epoch: 1 | Step: 211890 | Dataset: 0-1520141 | Loss: 0.636 | 912 ms/step , 6893.70 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 03:40:11 | Epoch: 1 | Step: 211900 | Dataset: 0-1520461 | Loss: 0.941 | 912 ms/step , 6895.40 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 03:40:12 | Validation | Step: 211900 | Val_loss: 0.749 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:40:22 | Epoch: 1 | Step: 211910 | Dataset: 0-1520781 | Loss: 0.690 | 913 ms/step , 6888.20 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 03:40:31 | Epoch: 1 | Step: 211920 | Dataset: 0-1521101 | Loss: 0.854 | 913 ms/step , 6891.37 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 03:40:40 | Epoch: 1 | Step: 211930 | Dataset: 0-1521421 | Loss: 0.665 | 912 ms/step , 6895.28 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 03:40:49 | Epoch: 1 | Step: 211940 | Dataset: 0-1521741 | Loss: 0.663 | 913 ms/step , 6888.77 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 03:40:58 | Epoch: 1 | Step: 211950 | Dataset: 0-1522061 | Loss: 0.697 | 914 ms/step , 6882.76 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 03:41:07 | Epoch: 1 | Step: 211960 | Dataset: 0-1522381 | Loss: 0.683 | 913 ms/step , 6888.21 GFLOP/s , 17941.9 tokens/s INFO:__main__:2024-11-06 03:41:16 | Epoch: 1 | Step: 211970 | Dataset: 0-1522701 | Loss: 0.627 | 913 ms/step , 6892.23 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 03:41:26 | Epoch: 1 | Step: 211980 | Dataset: 0-1523021 | Loss: 0.649 | 913 ms/step , 6892.47 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-06 03:41:35 | Epoch: 1 | Step: 211990 | Dataset: 0-1523341 | Loss: 0.704 | 913 ms/step , 6885.73 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 03:41:44 | Epoch: 1 | Step: 212000 | Dataset: 0-1523661 | Loss: 0.723 | 912 ms/step , 6898.76 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 03:41:45 | Validation | Step: 212000 | Val_loss: 0.675 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:41:45 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_034145_step_212000.pt` INFO:__main__:2024-11-06 03:41:56 | Epoch: 1 | Step: 212010 | Dataset: 0-1523981 | Loss: 0.734 | 913 ms/step , 6887.78 GFLOP/s , 13765.8 tokens/s INFO:__main__:2024-11-06 03:42:05 | Epoch: 1 | Step: 212020 | Dataset: 0-1524301 | Loss: 0.553 | 913 ms/step , 6891.60 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 03:42:14 | Epoch: 1 | Step: 212030 | Dataset: 0-1524621 | Loss: 0.665 | 912 ms/step , 6894.20 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-06 03:42:23 | Epoch: 1 | Step: 212040 | Dataset: 0-1524941 | Loss: 0.554 | 913 ms/step , 6889.97 GFLOP/s , 17906.6 tokens/s INFO:__main__:2024-11-06 03:42:32 | Epoch: 1 | Step: 212050 | Dataset: 0-1525261 | Loss: 0.883 | 913 ms/step , 6891.04 GFLOP/s , 17946.0 tokens/s INFO:__main__:2024-11-06 03:42:41 | Epoch: 1 | Step: 212060 | Dataset: 0-1525581 | Loss: 0.624 | 913 ms/step , 6892.56 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 03:42:51 | Epoch: 1 | Step: 212070 | Dataset: 0-1525901 | Loss: 0.782 | 912 ms/step , 6893.48 GFLOP/s , 17942.2 tokens/s INFO:__main__:2024-11-06 03:43:00 | Epoch: 1 | Step: 212080 | Dataset: 0-1526221 | Loss: 0.669 | 911 ms/step , 6900.85 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 03:43:09 | Epoch: 1 | Step: 212090 | Dataset: 0-1526541 | Loss: 0.651 | 912 ms/step , 6892.79 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-06 03:43:18 | Epoch: 1 | Step: 212100 | Dataset: 0-1526861 | Loss: 0.717 | 914 ms/step , 6881.11 GFLOP/s , 17942.9 tokens/s INFO:__main__:2024-11-06 03:43:20 | Validation | Step: 212100 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:43:29 | Epoch: 1 | Step: 212110 | Dataset: 0-1527181 | Loss: 0.732 | 911 ms/step , 6901.94 GFLOP/s , 15282.6 tokens/s INFO:__main__:2024-11-06 03:43:38 | Epoch: 1 | Step: 212120 | Dataset: 0-1527501 | Loss: 0.829 | 914 ms/step , 6877.65 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 03:43:47 | Epoch: 1 | Step: 212130 | Dataset: 0-1527821 | Loss: 0.743 | 913 ms/step , 6888.08 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 03:43:56 | Epoch: 1 | Step: 212140 | Dataset: 0-1528141 | Loss: 0.732 | 912 ms/step , 6893.82 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 03:44:05 | Epoch: 1 | Step: 212150 | Dataset: 0-1528461 | Loss: 0.713 | 913 ms/step , 6891.81 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 03:44:14 | Epoch: 1 | Step: 212160 | Dataset: 0-1528781 | Loss: 0.653 | 912 ms/step , 6895.16 GFLOP/s , 17947.5 tokens/s INFO:__main__:2024-11-06 03:44:23 | Epoch: 1 | Step: 212170 | Dataset: 0-1529101 | Loss: 0.628 | 913 ms/step , 6886.07 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 03:44:33 | Epoch: 1 | Step: 212180 | Dataset: 0-1529421 | Loss: 0.643 | 913 ms/step , 6888.15 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 03:44:42 | Epoch: 1 | Step: 212190 | Dataset: 0-1529741 | Loss: 0.725 | 914 ms/step , 6884.34 GFLOP/s , 17948.1 tokens/s INFO:__main__:2024-11-06 03:44:51 | Epoch: 1 | Step: 212200 | Dataset: 0-1530061 | Loss: 0.734 | 913 ms/step , 6892.55 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 03:44:52 | Validation | Step: 212200 | Val_loss: 0.727 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:45:02 | Epoch: 1 | Step: 212210 | Dataset: 0-1530381 | Loss: 0.726 | 913 ms/step , 6887.88 GFLOP/s , 15286.0 tokens/s INFO:__main__:2024-11-06 03:45:11 | Epoch: 1 | Step: 212220 | Dataset: 0-1530701 | Loss: 0.618 | 913 ms/step , 6891.33 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 03:45:20 | Epoch: 1 | Step: 212230 | Dataset: 0-1531021 | Loss: 0.655 | 912 ms/step , 6895.94 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 03:45:29 | Epoch: 1 | Step: 212240 | Dataset: 0-1531341 | Loss: 0.706 | 913 ms/step , 6892.57 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 03:45:38 | Epoch: 1 | Step: 212250 | Dataset: 0-1531661 | Loss: 0.716 | 912 ms/step , 6894.24 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 03:45:47 | Epoch: 1 | Step: 212260 | Dataset: 0-1531981 | Loss: 0.659 | 913 ms/step , 6889.04 GFLOP/s , 17945.4 tokens/s INFO:__main__:2024-11-06 03:45:56 | Epoch: 1 | Step: 212270 | Dataset: 0-1532301 | Loss: 0.669 | 914 ms/step , 6881.72 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 03:46:06 | Epoch: 1 | Step: 212280 | Dataset: 0-1532621 | Loss: 0.623 | 912 ms/step , 6898.17 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-06 03:46:15 | Epoch: 1 | Step: 212290 | Dataset: 0-1532941 | Loss: 0.761 | 913 ms/step , 6888.93 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 03:46:24 | Epoch: 1 | Step: 212300 | Dataset: 0-1533261 | Loss: 0.727 | 913 ms/step , 6890.48 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 03:46:25 | Validation | Step: 212300 | Val_loss: 0.728 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:46:35 | Epoch: 1 | Step: 212310 | Dataset: 0-1533581 | Loss: 0.698 | 912 ms/step , 6895.65 GFLOP/s , 15278.6 tokens/s INFO:__main__:2024-11-06 03:46:44 | Epoch: 1 | Step: 212320 | Dataset: 0-1533901 | Loss: 0.831 | 913 ms/step , 6886.17 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 03:46:53 | Epoch: 1 | Step: 212330 | Dataset: 0-1534221 | Loss: 0.742 | 912 ms/step , 6894.81 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 03:47:02 | Epoch: 1 | Step: 212340 | Dataset: 0-1534541 | Loss: 0.807 | 913 ms/step , 6888.85 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 03:47:11 | Epoch: 1 | Step: 212350 | Dataset: 0-1534861 | Loss: 0.798 | 913 ms/step , 6891.34 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-06 03:47:20 | Epoch: 1 | Step: 212360 | Dataset: 0-1535181 | Loss: 0.658 | 913 ms/step , 6891.22 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 03:47:29 | Epoch: 1 | Step: 212370 | Dataset: 0-1535501 | Loss: 0.689 | 913 ms/step , 6886.29 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 03:47:38 | Epoch: 1 | Step: 212380 | Dataset: 0-1535821 | Loss: 0.687 | 913 ms/step , 6889.22 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-06 03:47:48 | Epoch: 1 | Step: 212390 | Dataset: 0-1536141 | Loss: 0.711 | 912 ms/step , 6897.83 GFLOP/s , 17946.1 tokens/s INFO:__main__:2024-11-06 03:47:57 | Epoch: 1 | Step: 212400 | Dataset: 0-1536461 | Loss: 0.615 | 913 ms/step , 6890.17 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-06 03:47:58 | Validation | Step: 212400 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:48:07 | Epoch: 1 | Step: 212410 | Dataset: 0-1536781 | Loss: 0.663 | 914 ms/step , 6884.28 GFLOP/s , 15285.9 tokens/s INFO:__main__:2024-11-06 03:48:17 | Epoch: 1 | Step: 212420 | Dataset: 0-1537101 | Loss: 0.683 | 913 ms/step , 6891.15 GFLOP/s , 17945.1 tokens/s INFO:__main__:2024-11-06 03:48:26 | Epoch: 1 | Step: 212430 | Dataset: 0-1537421 | Loss: 0.714 | 913 ms/step , 6888.93 GFLOP/s , 17942.7 tokens/s INFO:__main__:2024-11-06 03:48:35 | Epoch: 1 | Step: 212440 | Dataset: 0-1537741 | Loss: 0.759 | 913 ms/step , 6886.89 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 03:48:44 | Epoch: 1 | Step: 212450 | Dataset: 0-1538061 | Loss: 0.627 | 913 ms/step , 6887.79 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 03:48:53 | Epoch: 1 | Step: 212460 | Dataset: 0-1538381 | Loss: 0.573 | 913 ms/step , 6885.93 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 03:49:02 | Epoch: 1 | Step: 212470 | Dataset: 0-1538701 | Loss: 0.787 | 914 ms/step , 6878.57 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-06 03:49:11 | Epoch: 1 | Step: 212480 | Dataset: 0-1539021 | Loss: 0.671 | 913 ms/step , 6892.30 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 03:49:20 | Epoch: 1 | Step: 212490 | Dataset: 0-1539341 | Loss: 0.749 | 914 ms/step , 6880.84 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 03:49:30 | Epoch: 1 | Step: 212500 | Dataset: 0-1539661 | Loss: 0.614 | 911 ms/step , 6900.43 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 03:49:31 | Validation | Step: 212500 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:49:40 | Epoch: 1 | Step: 212510 | Dataset: 0-1539981 | Loss: 0.675 | 913 ms/step , 6891.14 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-06 03:49:49 | Epoch: 1 | Step: 212520 | Dataset: 0-1540301 | Loss: 0.700 | 913 ms/step , 6892.08 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 03:49:59 | Epoch: 1 | Step: 212530 | Dataset: 0-1540621 | Loss: 0.784 | 912 ms/step , 6894.90 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 03:50:08 | Epoch: 1 | Step: 212540 | Dataset: 0-1540941 | Loss: 0.554 | 912 ms/step , 6897.94 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 03:50:17 | Epoch: 1 | Step: 212550 | Dataset: 0-1541261 | Loss: 0.707 | 914 ms/step , 6883.17 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 03:50:26 | Epoch: 1 | Step: 212560 | Dataset: 0-1541581 | Loss: 0.779 | 914 ms/step , 6881.02 GFLOP/s , 17944.5 tokens/s INFO:__main__:2024-11-06 03:50:35 | Epoch: 1 | Step: 212570 | Dataset: 0-1541901 | Loss: 0.699 | 913 ms/step , 6890.80 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 03:50:44 | Epoch: 1 | Step: 212580 | Dataset: 0-1542221 | Loss: 0.588 | 912 ms/step , 6897.15 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 03:50:53 | Epoch: 1 | Step: 212590 | Dataset: 0-1542541 | Loss: 0.738 | 913 ms/step , 6890.81 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 03:51:03 | Epoch: 1 | Step: 212600 | Dataset: 0-1542861 | Loss: 0.541 | 912 ms/step , 6896.27 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 03:51:04 | Validation | Step: 212600 | Val_loss: 0.776 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:51:13 | Epoch: 1 | Step: 212610 | Dataset: 0-1543181 | Loss: 0.338 | 912 ms/step , 6899.98 GFLOP/s , 15294.5 tokens/s INFO:__main__:2024-11-06 03:51:22 | Epoch: 1 | Step: 212620 | Dataset: 0-1543501 | Loss: 0.356 | 912 ms/step , 6892.68 GFLOP/s , 17956.2 tokens/s INFO:__main__:2024-11-06 03:51:32 | Epoch: 1 | Step: 212630 | Dataset: 0-1543821 | Loss: 0.382 | 912 ms/step , 6892.71 GFLOP/s , 17955.0 tokens/s INFO:__main__:2024-11-06 03:51:41 | Epoch: 1 | Step: 212640 | Dataset: 0-1544141 | Loss: 0.447 | 914 ms/step , 6884.41 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-06 03:51:50 | Epoch: 1 | Step: 212650 | Dataset: 0-1544461 | Loss: 0.455 | 912 ms/step , 6895.76 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-06 03:51:59 | Epoch: 1 | Step: 212660 | Dataset: 0-1544781 | Loss: 0.380 | 912 ms/step , 6894.93 GFLOP/s , 17955.4 tokens/s INFO:__main__:2024-11-06 03:52:08 | Epoch: 1 | Step: 212670 | Dataset: 0-1545101 | Loss: 0.276 | 911 ms/step , 6900.52 GFLOP/s , 17956.9 tokens/s INFO:__main__:2024-11-06 03:52:17 | Epoch: 1 | Step: 212680 | Dataset: 0-1545421 | Loss: 0.472 | 912 ms/step , 6899.70 GFLOP/s , 17954.7 tokens/s INFO:__main__:2024-11-06 03:52:26 | Epoch: 1 | Step: 212690 | Dataset: 0-1545741 | Loss: 0.422 | 913 ms/step , 6885.35 GFLOP/s , 17953.0 tokens/s INFO:__main__:2024-11-06 03:52:35 | Epoch: 1 | Step: 212700 | Dataset: 0-1546061 | Loss: 0.366 | 911 ms/step , 6900.92 GFLOP/s , 17963.0 tokens/s INFO:__main__:2024-11-06 03:52:37 | Validation | Step: 212700 | Val_loss: 0.724 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:52:46 | Epoch: 1 | Step: 212710 | Dataset: 0-1546381 | Loss: 0.546 | 912 ms/step , 6893.82 GFLOP/s , 15293.5 tokens/s INFO:__main__:2024-11-06 03:52:55 | Epoch: 1 | Step: 212720 | Dataset: 0-1546701 | Loss: 0.444 | 911 ms/step , 6902.29 GFLOP/s , 17953.5 tokens/s INFO:__main__:2024-11-06 03:53:04 | Epoch: 1 | Step: 212730 | Dataset: 0-1547021 | Loss: 0.458 | 913 ms/step , 6891.38 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-06 03:53:13 | Epoch: 1 | Step: 212740 | Dataset: 0-1547341 | Loss: 0.454 | 911 ms/step , 6900.35 GFLOP/s , 17956.5 tokens/s INFO:__main__:2024-11-06 03:53:23 | Epoch: 1 | Step: 212750 | Dataset: 0-1547661 | Loss: 0.344 | 911 ms/step , 6900.43 GFLOP/s , 17958.6 tokens/s INFO:__main__:2024-11-06 03:53:32 | Epoch: 1 | Step: 212760 | Dataset: 0-1547981 | Loss: 0.388 | 912 ms/step , 6897.90 GFLOP/s , 17958.3 tokens/s INFO:__main__:2024-11-06 03:53:41 | Epoch: 1 | Step: 212770 | Dataset: 0-1548301 | Loss: 0.423 | 912 ms/step , 6894.94 GFLOP/s , 17964.5 tokens/s INFO:__main__:2024-11-06 03:53:50 | Epoch: 1 | Step: 212780 | Dataset: 0-1548621 | Loss: 0.261 | 912 ms/step , 6898.81 GFLOP/s , 17964.5 tokens/s INFO:__main__:2024-11-06 03:53:59 | Epoch: 1 | Step: 212790 | Dataset: 0-1548941 | Loss: 0.403 | 912 ms/step , 6898.20 GFLOP/s , 17964.5 tokens/s INFO:__main__:2024-11-06 03:54:08 | Epoch: 1 | Step: 212800 | Dataset: 0-1549261 | Loss: 0.370 | 910 ms/step , 6908.02 GFLOP/s , 17962.7 tokens/s INFO:__main__:2024-11-06 03:54:10 | Validation | Step: 212800 | Val_loss: 0.722 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:54:19 | Epoch: 1 | Step: 212810 | Dataset: 0-1549581 | Loss: 0.334 | 911 ms/step , 6901.82 GFLOP/s , 15291.5 tokens/s INFO:__main__:2024-11-06 03:54:28 | Epoch: 1 | Step: 212820 | Dataset: 0-1549901 | Loss: 0.329 | 912 ms/step , 6894.47 GFLOP/s , 17956.4 tokens/s INFO:__main__:2024-11-06 03:54:37 | Epoch: 1 | Step: 212830 | Dataset: 0-1550221 | Loss: 0.395 | 911 ms/step , 6901.83 GFLOP/s , 17962.8 tokens/s INFO:__main__:2024-11-06 03:54:46 | Epoch: 1 | Step: 212840 | Dataset: 0-1550541 | Loss: 0.388 | 911 ms/step , 6901.88 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-06 03:54:55 | Epoch: 1 | Step: 212850 | Dataset: 0-1550861 | Loss: 0.402 | 913 ms/step , 6891.14 GFLOP/s , 17947.6 tokens/s INFO:__main__:2024-11-06 03:55:05 | Epoch: 1 | Step: 212860 | Dataset: 0-1551181 | Loss: 0.341 | 911 ms/step , 6903.38 GFLOP/s , 17961.6 tokens/s INFO:__main__:2024-11-06 03:55:14 | Epoch: 1 | Step: 212870 | Dataset: 0-1551501 | Loss: 0.465 | 912 ms/step , 6898.86 GFLOP/s , 17956.0 tokens/s INFO:__main__:2024-11-06 03:55:23 | Epoch: 1 | Step: 212880 | Dataset: 0-1551821 | Loss: 0.382 | 912 ms/step , 6893.38 GFLOP/s , 17955.3 tokens/s INFO:__main__:2024-11-06 03:55:32 | Epoch: 1 | Step: 212890 | Dataset: 0-1552141 | Loss: 0.385 | 913 ms/step , 6889.43 GFLOP/s , 17959.0 tokens/s INFO:__main__:2024-11-06 03:55:41 | Epoch: 1 | Step: 212900 | Dataset: 0-1552461 | Loss: 0.449 | 911 ms/step , 6901.50 GFLOP/s , 17957.8 tokens/s INFO:__main__:2024-11-06 03:55:43 | Validation | Step: 212900 | Val_loss: 0.720 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:55:52 | Epoch: 1 | Step: 212910 | Dataset: 0-1552781 | Loss: 0.723 | 913 ms/step , 6886.96 GFLOP/s , 15273.8 tokens/s INFO:__main__:2024-11-06 03:56:01 | Epoch: 1 | Step: 212920 | Dataset: 0-1553101 | Loss: 0.623 | 914 ms/step , 6880.78 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 03:56:10 | Epoch: 1 | Step: 212930 | Dataset: 0-1553421 | Loss: 0.772 | 913 ms/step , 6889.68 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 03:56:19 | Epoch: 1 | Step: 212940 | Dataset: 0-1553741 | Loss: 0.828 | 913 ms/step , 6887.46 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 03:56:28 | Epoch: 1 | Step: 212950 | Dataset: 0-1554061 | Loss: 0.655 | 914 ms/step , 6884.89 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 03:56:37 | Epoch: 1 | Step: 212960 | Dataset: 0-1554381 | Loss: 0.812 | 913 ms/step , 6891.72 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-06 03:56:47 | Epoch: 1 | Step: 212970 | Dataset: 0-1554701 | Loss: 0.758 | 913 ms/step , 6887.37 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 03:56:56 | Epoch: 1 | Step: 212980 | Dataset: 0-1555021 | Loss: 0.667 | 914 ms/step , 6884.50 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 03:57:05 | Epoch: 1 | Step: 212990 | Dataset: 0-1555341 | Loss: 0.729 | 913 ms/step , 6886.03 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 03:57:14 | Epoch: 1 | Step: 213000 | Dataset: 0-1555661 | Loss: 0.713 | 913 ms/step , 6885.83 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 03:57:16 | Validation | Step: 213000 | Val_loss: 0.701 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:57:16 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_035716_step_213000.pt` INFO:__main__:2024-11-06 03:57:26 | Epoch: 1 | Step: 213010 | Dataset: 0-1555981 | Loss: 0.689 | 915 ms/step , 6873.48 GFLOP/s , 13769.1 tokens/s INFO:__main__:2024-11-06 03:57:35 | Epoch: 1 | Step: 213020 | Dataset: 0-1556301 | Loss: 0.634 | 914 ms/step , 6880.61 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 03:57:44 | Epoch: 1 | Step: 213030 | Dataset: 0-1556621 | Loss: 0.666 | 913 ms/step , 6889.69 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 03:57:53 | Epoch: 1 | Step: 213040 | Dataset: 0-1556941 | Loss: 0.728 | 913 ms/step , 6888.51 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 03:58:02 | Epoch: 1 | Step: 213050 | Dataset: 0-1557261 | Loss: 0.690 | 912 ms/step , 6898.43 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 03:58:12 | Epoch: 1 | Step: 213060 | Dataset: 0-1557581 | Loss: 0.625 | 913 ms/step , 6886.13 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-06 03:58:21 | Epoch: 1 | Step: 213070 | Dataset: 0-1557901 | Loss: 0.655 | 914 ms/step , 6882.28 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 03:58:30 | Epoch: 1 | Step: 213080 | Dataset: 0-1558221 | Loss: 0.660 | 913 ms/step , 6888.48 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 03:58:39 | Epoch: 1 | Step: 213090 | Dataset: 0-1558541 | Loss: 0.708 | 915 ms/step , 6874.93 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 03:58:48 | Epoch: 1 | Step: 213100 | Dataset: 0-1558861 | Loss: 0.728 | 914 ms/step , 6883.51 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 03:58:50 | Validation | Step: 213100 | Val_loss: 0.710 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 03:58:59 | Epoch: 1 | Step: 213110 | Dataset: 0-1559181 | Loss: 0.635 | 913 ms/step , 6890.20 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 03:59:08 | Epoch: 1 | Step: 213120 | Dataset: 0-1559501 | Loss: 0.675 | 914 ms/step , 6884.66 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 03:59:17 | Epoch: 1 | Step: 213130 | Dataset: 0-1559821 | Loss: 0.753 | 914 ms/step , 6880.62 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 03:59:26 | Epoch: 1 | Step: 213140 | Dataset: 0-1560141 | Loss: 0.689 | 913 ms/step , 6889.42 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-06 03:59:35 | Epoch: 1 | Step: 213150 | Dataset: 0-1560461 | Loss: 0.648 | 914 ms/step , 6882.36 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 03:59:45 | Epoch: 1 | Step: 213160 | Dataset: 0-1560781 | Loss: 0.760 | 913 ms/step , 6890.91 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 03:59:54 | Epoch: 1 | Step: 213170 | Dataset: 0-1561101 | Loss: 0.734 | 913 ms/step , 6890.42 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 04:00:03 | Epoch: 1 | Step: 213180 | Dataset: 0-1561421 | Loss: 0.662 | 912 ms/step , 6897.35 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 04:00:12 | Epoch: 1 | Step: 213190 | Dataset: 0-1561741 | Loss: 0.688 | 914 ms/step , 6884.72 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 04:00:21 | Epoch: 1 | Step: 213200 | Dataset: 0-1562061 | Loss: 0.683 | 913 ms/step , 6886.78 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-06 04:00:23 | Validation | Step: 213200 | Val_loss: 0.708 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:00:32 | Epoch: 1 | Step: 213210 | Dataset: 0-1562381 | Loss: 0.668 | 913 ms/step , 6887.44 GFLOP/s , 15280.1 tokens/s INFO:__main__:2024-11-06 04:00:41 | Epoch: 1 | Step: 213220 | Dataset: 0-1562701 | Loss: 0.625 | 914 ms/step , 6882.53 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 04:00:50 | Epoch: 1 | Step: 213230 | Dataset: 0-1563021 | Loss: 0.662 | 915 ms/step , 6877.49 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 04:00:59 | Epoch: 1 | Step: 213240 | Dataset: 0-1563341 | Loss: 0.654 | 913 ms/step , 6887.58 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 04:01:08 | Epoch: 1 | Step: 213250 | Dataset: 0-1563661 | Loss: 0.689 | 912 ms/step , 6895.17 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 04:01:18 | Epoch: 1 | Step: 213260 | Dataset: 0-1563981 | Loss: 0.742 | 913 ms/step , 6888.32 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-06 04:01:27 | Epoch: 1 | Step: 213270 | Dataset: 0-1564301 | Loss: 0.648 | 913 ms/step , 6886.59 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 04:01:36 | Epoch: 1 | Step: 213280 | Dataset: 0-1564621 | Loss: 0.719 | 915 ms/step , 6875.39 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 04:01:45 | Epoch: 1 | Step: 213290 | Dataset: 0-1564941 | Loss: 0.711 | 913 ms/step , 6885.40 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 04:01:54 | Epoch: 1 | Step: 213300 | Dataset: 0-1565261 | Loss: 0.749 | 914 ms/step , 6877.60 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 04:01:56 | Validation | Step: 213300 | Val_loss: 0.717 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:02:05 | Epoch: 1 | Step: 213310 | Dataset: 0-1565581 | Loss: 0.660 | 912 ms/step , 6892.73 GFLOP/s , 15281.2 tokens/s INFO:__main__:2024-11-06 04:02:14 | Epoch: 1 | Step: 213320 | Dataset: 0-1565901 | Loss: 0.649 | 914 ms/step , 6883.73 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 04:02:23 | Epoch: 1 | Step: 213330 | Dataset: 0-1566221 | Loss: 0.635 | 912 ms/step , 6893.20 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 04:02:32 | Epoch: 1 | Step: 213340 | Dataset: 0-1566541 | Loss: 0.705 | 914 ms/step , 6881.20 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 04:02:41 | Epoch: 1 | Step: 213350 | Dataset: 0-1566861 | Loss: 0.605 | 913 ms/step , 6888.03 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:02:50 | Epoch: 1 | Step: 213360 | Dataset: 0-1567181 | Loss: 0.716 | 915 ms/step , 6875.47 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 04:03:00 | Epoch: 1 | Step: 213370 | Dataset: 0-1567501 | Loss: 0.711 | 912 ms/step , 6895.26 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 04:03:09 | Epoch: 1 | Step: 213380 | Dataset: 0-1567821 | Loss: 0.719 | 915 ms/step , 6876.93 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 04:03:18 | Epoch: 1 | Step: 213390 | Dataset: 0-1568141 | Loss: 0.681 | 914 ms/step , 6883.98 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 04:03:27 | Epoch: 1 | Step: 213400 | Dataset: 0-1568461 | Loss: 0.765 | 914 ms/step , 6881.55 GFLOP/s , 17921.7 tokens/s INFO:__main__:2024-11-06 04:03:29 | Validation | Step: 213400 | Val_loss: 0.665 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:03:38 | Epoch: 1 | Step: 213410 | Dataset: 0-1568781 | Loss: 0.672 | 912 ms/step , 6898.56 GFLOP/s , 15279.7 tokens/s INFO:__main__:2024-11-06 04:03:47 | Epoch: 1 | Step: 213420 | Dataset: 0-1569101 | Loss: 0.600 | 914 ms/step , 6880.25 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 04:03:56 | Epoch: 1 | Step: 213430 | Dataset: 0-1569421 | Loss: 0.640 | 913 ms/step , 6890.88 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 04:04:05 | Epoch: 1 | Step: 213440 | Dataset: 0-1569741 | Loss: 0.705 | 913 ms/step , 6891.44 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-06 04:04:14 | Epoch: 1 | Step: 213450 | Dataset: 0-1570061 | Loss: 0.676 | 914 ms/step , 6879.90 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 04:04:23 | Epoch: 1 | Step: 213460 | Dataset: 0-1570381 | Loss: 0.626 | 913 ms/step , 6888.16 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 04:04:33 | Epoch: 1 | Step: 213470 | Dataset: 0-1570701 | Loss: 0.719 | 913 ms/step , 6887.85 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 04:04:42 | Epoch: 1 | Step: 213480 | Dataset: 0-1571021 | Loss: 0.782 | 913 ms/step , 6888.43 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 04:04:51 | Epoch: 1 | Step: 213490 | Dataset: 0-1571341 | Loss: 0.733 | 913 ms/step , 6889.46 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 04:05:00 | Epoch: 1 | Step: 213500 | Dataset: 0-1571661 | Loss: 0.804 | 913 ms/step , 6885.60 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-06 04:05:02 | Validation | Step: 213500 | Val_loss: 0.711 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:05:11 | Epoch: 1 | Step: 213510 | Dataset: 0-1571981 | Loss: 0.689 | 914 ms/step , 6884.37 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-06 04:05:20 | Epoch: 1 | Step: 213520 | Dataset: 0-1572301 | Loss: 0.760 | 914 ms/step , 6878.53 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 04:05:29 | Epoch: 1 | Step: 213530 | Dataset: 0-1572621 | Loss: 0.725 | 914 ms/step , 6883.18 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 04:05:38 | Epoch: 1 | Step: 213540 | Dataset: 0-1572941 | Loss: 0.672 | 913 ms/step , 6889.48 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 04:05:47 | Epoch: 1 | Step: 213550 | Dataset: 0-1573261 | Loss: 0.675 | 913 ms/step , 6887.62 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 04:05:56 | Epoch: 1 | Step: 213560 | Dataset: 0-1573581 | Loss: 0.664 | 913 ms/step , 6889.07 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 04:06:06 | Epoch: 1 | Step: 213570 | Dataset: 0-1573901 | Loss: 0.677 | 913 ms/step , 6891.62 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 04:06:15 | Epoch: 1 | Step: 213580 | Dataset: 0-1574221 | Loss: 0.655 | 914 ms/step , 6883.85 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 04:06:24 | Epoch: 1 | Step: 213590 | Dataset: 0-1574541 | Loss: 0.736 | 912 ms/step , 6893.76 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-06 04:06:33 | Epoch: 1 | Step: 213600 | Dataset: 0-1574861 | Loss: 0.707 | 912 ms/step , 6893.37 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 04:06:35 | Validation | Step: 213600 | Val_loss: 0.678 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:06:44 | Epoch: 1 | Step: 213610 | Dataset: 0-1575181 | Loss: 0.703 | 914 ms/step , 6883.35 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-06 04:06:53 | Epoch: 1 | Step: 213620 | Dataset: 0-1575501 | Loss: 0.683 | 912 ms/step , 6894.50 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 04:07:02 | Epoch: 1 | Step: 213630 | Dataset: 0-1575821 | Loss: 0.637 | 913 ms/step , 6885.40 GFLOP/s , 17921.3 tokens/s INFO:__main__:2024-11-06 04:07:11 | Epoch: 1 | Step: 213640 | Dataset: 0-1576141 | Loss: 0.563 | 912 ms/step , 6897.93 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 04:07:20 | Epoch: 1 | Step: 213650 | Dataset: 0-1576461 | Loss: 0.670 | 914 ms/step , 6882.91 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-06 04:07:29 | Epoch: 1 | Step: 213660 | Dataset: 0-1576781 | Loss: 0.651 | 913 ms/step , 6885.11 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 04:07:39 | Epoch: 1 | Step: 213670 | Dataset: 0-1577101 | Loss: 0.690 | 916 ms/step , 6866.40 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 04:07:48 | Epoch: 1 | Step: 213680 | Dataset: 0-1577421 | Loss: 0.753 | 913 ms/step , 6889.97 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 04:07:57 | Epoch: 1 | Step: 213690 | Dataset: 0-1577741 | Loss: 0.706 | 914 ms/step , 6884.66 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 04:08:06 | Epoch: 1 | Step: 213700 | Dataset: 0-1578061 | Loss: 0.782 | 914 ms/step , 6879.45 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 04:08:08 | Validation | Step: 213700 | Val_loss: 0.673 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:08:17 | Epoch: 1 | Step: 213710 | Dataset: 0-1578381 | Loss: 0.675 | 912 ms/step , 6895.37 GFLOP/s , 15275.0 tokens/s INFO:__main__:2024-11-06 04:08:26 | Epoch: 1 | Step: 213720 | Dataset: 0-1578701 | Loss: 0.718 | 914 ms/step , 6884.95 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 04:08:35 | Epoch: 1 | Step: 213730 | Dataset: 0-1579021 | Loss: 0.688 | 913 ms/step , 6887.43 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 04:08:44 | Epoch: 1 | Step: 213740 | Dataset: 0-1579341 | Loss: 0.646 | 913 ms/step , 6885.36 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 04:08:53 | Epoch: 1 | Step: 213750 | Dataset: 0-1579661 | Loss: 0.666 | 914 ms/step , 6879.10 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 04:09:02 | Epoch: 1 | Step: 213760 | Dataset: 0-1579981 | Loss: 0.714 | 913 ms/step , 6887.04 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 04:09:11 | Epoch: 1 | Step: 213770 | Dataset: 0-1580301 | Loss: 0.713 | 912 ms/step , 6893.38 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 04:09:21 | Epoch: 1 | Step: 213780 | Dataset: 0-1580621 | Loss: 0.695 | 915 ms/step , 6877.10 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 04:09:30 | Epoch: 1 | Step: 213790 | Dataset: 0-1580941 | Loss: 0.711 | 913 ms/step , 6889.72 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 04:09:39 | Epoch: 1 | Step: 213800 | Dataset: 0-1581261 | Loss: 0.672 | 913 ms/step , 6890.96 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 04:09:40 | Validation | Step: 213800 | Val_loss: 0.679 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:09:50 | Epoch: 1 | Step: 213810 | Dataset: 0-1581581 | Loss: 0.722 | 913 ms/step , 6888.38 GFLOP/s , 15279.2 tokens/s INFO:__main__:2024-11-06 04:09:59 | Epoch: 1 | Step: 213820 | Dataset: 0-1581901 | Loss: 0.658 | 912 ms/step , 6896.64 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 04:10:08 | Epoch: 1 | Step: 213830 | Dataset: 0-1582221 | Loss: 0.727 | 913 ms/step , 6888.60 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 04:10:17 | Epoch: 1 | Step: 213840 | Dataset: 0-1582541 | Loss: 0.699 | 913 ms/step , 6889.90 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 04:10:26 | Epoch: 1 | Step: 213850 | Dataset: 0-1582861 | Loss: 0.731 | 914 ms/step , 6884.75 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 04:10:35 | Epoch: 1 | Step: 213860 | Dataset: 0-1583181 | Loss: 0.624 | 913 ms/step , 6891.16 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 04:10:44 | Epoch: 1 | Step: 213870 | Dataset: 0-1583501 | Loss: 0.667 | 913 ms/step , 6891.05 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 04:10:54 | Epoch: 1 | Step: 213880 | Dataset: 0-1583821 | Loss: 0.575 | 914 ms/step , 6883.31 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 04:11:03 | Epoch: 1 | Step: 213890 | Dataset: 0-1584141 | Loss: 0.663 | 914 ms/step , 6879.95 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 04:11:12 | Epoch: 1 | Step: 213900 | Dataset: 0-1584461 | Loss: 0.773 | 916 ms/step , 6868.92 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 04:11:13 | Validation | Step: 213900 | Val_loss: 0.682 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:11:23 | Epoch: 1 | Step: 213910 | Dataset: 0-1584781 | Loss: 0.755 | 913 ms/step , 6885.96 GFLOP/s , 15271.4 tokens/s INFO:__main__:2024-11-06 04:11:32 | Epoch: 1 | Step: 213920 | Dataset: 0-1585101 | Loss: 0.659 | 913 ms/step , 6891.15 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 04:11:41 | Epoch: 1 | Step: 213930 | Dataset: 0-1585421 | Loss: 0.676 | 913 ms/step , 6888.93 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 04:11:50 | Epoch: 1 | Step: 213940 | Dataset: 0-1585741 | Loss: 0.765 | 915 ms/step , 6874.35 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 04:11:59 | Epoch: 1 | Step: 213950 | Dataset: 0-1586061 | Loss: 0.833 | 914 ms/step , 6882.69 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 04:12:08 | Epoch: 1 | Step: 213960 | Dataset: 0-1586381 | Loss: 0.714 | 913 ms/step , 6887.77 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 04:12:17 | Epoch: 1 | Step: 213970 | Dataset: 0-1586701 | Loss: 0.810 | 914 ms/step , 6883.26 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 04:12:27 | Epoch: 1 | Step: 213980 | Dataset: 0-1587021 | Loss: 0.628 | 912 ms/step , 6896.77 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:12:36 | Epoch: 1 | Step: 213990 | Dataset: 0-1587341 | Loss: 0.759 | 914 ms/step , 6884.16 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 04:12:45 | Epoch: 1 | Step: 214000 | Dataset: 0-1587661 | Loss: 0.670 | 913 ms/step , 6887.88 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 04:12:46 | Validation | Step: 214000 | Val_loss: 0.750 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:12:46 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_041246_step_214000.pt` INFO:__main__:2024-11-06 04:12:57 | Epoch: 1 | Step: 214010 | Dataset: 0-1587981 | Loss: 0.622 | 914 ms/step , 6877.64 GFLOP/s , 13671.5 tokens/s INFO:__main__:2024-11-06 04:13:06 | Epoch: 1 | Step: 214020 | Dataset: 0-1588301 | Loss: 0.711 | 913 ms/step , 6887.85 GFLOP/s , 17894.4 tokens/s INFO:__main__:2024-11-06 04:13:15 | Epoch: 1 | Step: 214030 | Dataset: 0-1588621 | Loss: 0.745 | 914 ms/step , 6882.48 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 04:13:24 | Epoch: 1 | Step: 214040 | Dataset: 0-1588941 | Loss: 0.752 | 912 ms/step , 6892.77 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 04:13:33 | Epoch: 1 | Step: 214050 | Dataset: 0-1589261 | Loss: 0.571 | 913 ms/step , 6892.11 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 04:13:43 | Epoch: 1 | Step: 214060 | Dataset: 0-1589581 | Loss: 0.723 | 913 ms/step , 6887.01 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 04:13:52 | Epoch: 1 | Step: 214070 | Dataset: 0-1589901 | Loss: 0.746 | 914 ms/step , 6881.15 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 04:14:01 | Epoch: 1 | Step: 214080 | Dataset: 0-1590221 | Loss: 0.652 | 915 ms/step , 6877.11 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 04:14:10 | Epoch: 1 | Step: 214090 | Dataset: 0-1590541 | Loss: 0.601 | 913 ms/step , 6887.43 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 04:14:19 | Epoch: 1 | Step: 214100 | Dataset: 0-1590861 | Loss: 0.716 | 915 ms/step , 6874.35 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-06 04:14:21 | Validation | Step: 214100 | Val_loss: 0.656 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:14:30 | Epoch: 1 | Step: 214110 | Dataset: 0-1591181 | Loss: 0.677 | 913 ms/step , 6891.88 GFLOP/s , 15296.8 tokens/s INFO:__main__:2024-11-06 04:14:39 | Epoch: 1 | Step: 214120 | Dataset: 0-1591501 | Loss: 0.714 | 914 ms/step , 6882.99 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 04:14:48 | Epoch: 1 | Step: 214130 | Dataset: 0-1591821 | Loss: 0.765 | 913 ms/step , 6885.60 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 04:14:57 | Epoch: 1 | Step: 214140 | Dataset: 0-1592141 | Loss: 0.704 | 913 ms/step , 6888.89 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 04:15:06 | Epoch: 1 | Step: 214150 | Dataset: 0-1592461 | Loss: 0.704 | 913 ms/step , 6890.48 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 04:15:15 | Epoch: 1 | Step: 214160 | Dataset: 0-1592781 | Loss: 0.729 | 914 ms/step , 6883.30 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 04:15:25 | Epoch: 1 | Step: 214170 | Dataset: 0-1593101 | Loss: 0.652 | 912 ms/step , 6895.43 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 04:15:34 | Epoch: 1 | Step: 214180 | Dataset: 0-1593421 | Loss: 0.684 | 913 ms/step , 6889.79 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 04:15:43 | Epoch: 1 | Step: 214190 | Dataset: 0-1593741 | Loss: 0.638 | 913 ms/step , 6885.75 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 04:15:52 | Epoch: 1 | Step: 214200 | Dataset: 0-1594061 | Loss: 0.685 | 912 ms/step , 6897.70 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 04:15:54 | Validation | Step: 214200 | Val_loss: 0.695 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:16:03 | Epoch: 1 | Step: 214210 | Dataset: 0-1594381 | Loss: 0.697 | 913 ms/step , 6886.73 GFLOP/s , 15267.4 tokens/s INFO:__main__:2024-11-06 04:16:12 | Epoch: 1 | Step: 214220 | Dataset: 0-1594701 | Loss: 0.854 | 914 ms/step , 6878.44 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 04:16:21 | Epoch: 1 | Step: 214230 | Dataset: 0-1595021 | Loss: 0.652 | 914 ms/step , 6881.65 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 04:16:30 | Epoch: 1 | Step: 214240 | Dataset: 0-1595341 | Loss: 0.672 | 914 ms/step , 6881.25 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 04:16:39 | Epoch: 1 | Step: 214250 | Dataset: 0-1595661 | Loss: 0.778 | 913 ms/step , 6885.89 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 04:16:48 | Epoch: 1 | Step: 214260 | Dataset: 0-1595981 | Loss: 0.766 | 915 ms/step , 6876.89 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 04:16:58 | Epoch: 1 | Step: 214270 | Dataset: 0-1596301 | Loss: 0.717 | 914 ms/step , 6882.23 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 04:17:07 | Epoch: 1 | Step: 214280 | Dataset: 0-1596621 | Loss: 0.801 | 915 ms/step , 6877.29 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-06 04:17:16 | Epoch: 1 | Step: 214290 | Dataset: 0-1596941 | Loss: 0.661 | 913 ms/step , 6890.94 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 04:17:25 | Epoch: 1 | Step: 214300 | Dataset: 0-1597261 | Loss: 0.630 | 914 ms/step , 6881.34 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 04:17:27 | Validation | Step: 214300 | Val_loss: 0.695 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:17:36 | Epoch: 1 | Step: 214310 | Dataset: 0-1597581 | Loss: 0.763 | 914 ms/step , 6882.06 GFLOP/s , 15270.9 tokens/s INFO:__main__:2024-11-06 04:17:45 | Epoch: 1 | Step: 214320 | Dataset: 0-1597901 | Loss: 0.738 | 914 ms/step , 6878.35 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 04:17:54 | Epoch: 1 | Step: 214330 | Dataset: 0-1598221 | Loss: 0.713 | 913 ms/step , 6888.06 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 04:18:03 | Epoch: 1 | Step: 214340 | Dataset: 0-1598541 | Loss: 0.738 | 912 ms/step , 6894.10 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 04:18:12 | Epoch: 1 | Step: 214350 | Dataset: 0-1598861 | Loss: 0.715 | 914 ms/step , 6877.99 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 04:18:21 | Epoch: 1 | Step: 214360 | Dataset: 0-1599181 | Loss: 0.759 | 912 ms/step , 6892.78 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 04:18:31 | Epoch: 1 | Step: 214370 | Dataset: 0-1599501 | Loss: 0.723 | 913 ms/step , 6886.09 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 04:18:40 | Epoch: 1 | Step: 214380 | Dataset: 0-1599821 | Loss: 0.715 | 913 ms/step , 6886.31 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 04:18:49 | Epoch: 1 | Step: 214390 | Dataset: 0-1600141 | Loss: 0.648 | 913 ms/step , 6891.53 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 04:18:58 | Epoch: 1 | Step: 214400 | Dataset: 0-1600461 | Loss: 0.794 | 914 ms/step , 6879.37 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-06 04:19:00 | Validation | Step: 214400 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:19:09 | Epoch: 1 | Step: 214410 | Dataset: 0-1600781 | Loss: 0.725 | 914 ms/step , 6885.00 GFLOP/s , 15272.3 tokens/s INFO:__main__:2024-11-06 04:19:18 | Epoch: 1 | Step: 214420 | Dataset: 0-1601101 | Loss: 0.675 | 913 ms/step , 6885.54 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 04:19:27 | Epoch: 1 | Step: 214430 | Dataset: 0-1601421 | Loss: 0.656 | 913 ms/step , 6885.64 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 04:19:36 | Epoch: 1 | Step: 214440 | Dataset: 0-1601741 | Loss: 0.663 | 912 ms/step , 6895.66 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 04:19:45 | Epoch: 1 | Step: 214450 | Dataset: 0-1602061 | Loss: 0.723 | 913 ms/step , 6891.19 GFLOP/s , 17923.1 tokens/s INFO:__main__:2024-11-06 04:19:54 | Epoch: 1 | Step: 214460 | Dataset: 0-1602381 | Loss: 0.753 | 913 ms/step , 6886.40 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 04:20:04 | Epoch: 1 | Step: 214470 | Dataset: 0-1602701 | Loss: 0.698 | 914 ms/step , 6884.86 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 04:20:13 | Epoch: 1 | Step: 214480 | Dataset: 0-1603021 | Loss: 0.752 | 915 ms/step , 6872.77 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 04:20:22 | Epoch: 1 | Step: 214490 | Dataset: 0-1603341 | Loss: 0.688 | 915 ms/step , 6875.93 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 04:20:31 | Epoch: 1 | Step: 214500 | Dataset: 0-1603661 | Loss: 0.704 | 913 ms/step , 6886.80 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 04:20:33 | Validation | Step: 214500 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:20:42 | Epoch: 1 | Step: 214510 | Dataset: 0-1603981 | Loss: 0.664 | 914 ms/step , 6879.75 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-06 04:20:51 | Epoch: 1 | Step: 214520 | Dataset: 0-1604301 | Loss: 0.683 | 913 ms/step , 6887.46 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 04:21:00 | Epoch: 1 | Step: 214530 | Dataset: 0-1604621 | Loss: 0.714 | 913 ms/step , 6892.51 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 04:21:09 | Epoch: 1 | Step: 214540 | Dataset: 0-1604941 | Loss: 0.741 | 914 ms/step , 6883.51 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 04:21:18 | Epoch: 1 | Step: 214550 | Dataset: 0-1605261 | Loss: 0.763 | 914 ms/step , 6881.36 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 04:21:27 | Epoch: 1 | Step: 214560 | Dataset: 0-1605581 | Loss: 0.687 | 913 ms/step , 6886.32 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 04:21:37 | Epoch: 1 | Step: 214570 | Dataset: 0-1605901 | Loss: 0.687 | 914 ms/step , 6882.54 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 04:21:46 | Epoch: 1 | Step: 214580 | Dataset: 0-1606221 | Loss: 0.800 | 912 ms/step , 6893.62 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 04:21:55 | Epoch: 1 | Step: 214590 | Dataset: 0-1606541 | Loss: 0.645 | 915 ms/step , 6874.79 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 04:22:04 | Epoch: 1 | Step: 214600 | Dataset: 0-1606861 | Loss: 0.699 | 913 ms/step , 6887.66 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 04:22:06 | Validation | Step: 214600 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:22:15 | Epoch: 1 | Step: 214610 | Dataset: 0-1607181 | Loss: 0.875 | 915 ms/step , 6871.55 GFLOP/s , 15259.0 tokens/s INFO:__main__:2024-11-06 04:22:24 | Epoch: 1 | Step: 214620 | Dataset: 0-1607501 | Loss: 0.671 | 913 ms/step , 6891.93 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 04:22:33 | Epoch: 1 | Step: 214630 | Dataset: 0-1607821 | Loss: 0.683 | 914 ms/step , 6882.07 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 04:22:42 | Epoch: 1 | Step: 214640 | Dataset: 0-1608141 | Loss: 0.701 | 914 ms/step , 6884.90 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-06 04:22:51 | Epoch: 1 | Step: 214650 | Dataset: 0-1608461 | Loss: 0.748 | 915 ms/step , 6876.63 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 04:23:00 | Epoch: 1 | Step: 214660 | Dataset: 0-1608781 | Loss: 0.646 | 912 ms/step , 6896.98 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 04:23:10 | Epoch: 1 | Step: 214670 | Dataset: 0-1609101 | Loss: 0.703 | 914 ms/step , 6879.41 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-06 04:23:19 | Epoch: 1 | Step: 214680 | Dataset: 0-1609421 | Loss: 0.830 | 913 ms/step , 6890.14 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 04:23:28 | Epoch: 1 | Step: 214690 | Dataset: 0-1609741 | Loss: 0.685 | 914 ms/step , 6881.33 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 04:23:37 | Epoch: 1 | Step: 214700 | Dataset: 0-1610061 | Loss: 0.686 | 914 ms/step , 6884.83 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 04:23:39 | Validation | Step: 214700 | Val_loss: 0.728 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:23:48 | Epoch: 1 | Step: 214710 | Dataset: 0-1610381 | Loss: 0.626 | 912 ms/step , 6898.92 GFLOP/s , 15268.8 tokens/s INFO:__main__:2024-11-06 04:23:57 | Epoch: 1 | Step: 214720 | Dataset: 0-1610701 | Loss: 0.682 | 913 ms/step , 6892.49 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 04:24:06 | Epoch: 1 | Step: 214730 | Dataset: 0-1611021 | Loss: 0.683 | 914 ms/step , 6883.94 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 04:24:15 | Epoch: 1 | Step: 214740 | Dataset: 0-1611341 | Loss: 0.695 | 913 ms/step , 6891.72 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 04:24:24 | Epoch: 1 | Step: 214750 | Dataset: 0-1611661 | Loss: 0.706 | 913 ms/step , 6890.16 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 04:24:33 | Epoch: 1 | Step: 214760 | Dataset: 0-1611981 | Loss: 0.502 | 912 ms/step , 6895.22 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:24:42 | Epoch: 1 | Step: 214770 | Dataset: 0-1612301 | Loss: 0.687 | 912 ms/step , 6894.83 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-06 04:24:52 | Epoch: 1 | Step: 214780 | Dataset: 0-1612621 | Loss: 0.689 | 913 ms/step , 6885.42 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 04:25:01 | Epoch: 1 | Step: 214790 | Dataset: 0-1612941 | Loss: 0.680 | 913 ms/step , 6891.93 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 04:25:10 | Epoch: 1 | Step: 214800 | Dataset: 0-1613261 | Loss: 0.694 | 915 ms/step , 6877.49 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 04:25:11 | Validation | Step: 214800 | Val_loss: 0.647 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:25:21 | Epoch: 1 | Step: 214810 | Dataset: 0-1613581 | Loss: 0.711 | 915 ms/step , 6876.38 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-06 04:25:30 | Epoch: 1 | Step: 214820 | Dataset: 0-1613901 | Loss: 0.748 | 914 ms/step , 6882.84 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 04:25:39 | Epoch: 1 | Step: 214830 | Dataset: 0-1614221 | Loss: 0.682 | 914 ms/step , 6881.94 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 04:25:48 | Epoch: 1 | Step: 214840 | Dataset: 0-1614541 | Loss: 0.764 | 914 ms/step , 6877.77 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 04:25:57 | Epoch: 1 | Step: 214850 | Dataset: 0-1614861 | Loss: 0.716 | 913 ms/step , 6885.77 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 04:26:06 | Epoch: 1 | Step: 214860 | Dataset: 0-1615181 | Loss: 0.635 | 914 ms/step , 6883.01 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 04:26:15 | Epoch: 1 | Step: 214870 | Dataset: 0-1615501 | Loss: 0.686 | 913 ms/step , 6886.04 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 04:26:25 | Epoch: 1 | Step: 214880 | Dataset: 0-1615821 | Loss: 0.771 | 914 ms/step , 6884.09 GFLOP/s , 17924.9 tokens/s INFO:__main__:2024-11-06 04:26:34 | Epoch: 1 | Step: 214890 | Dataset: 0-1616141 | Loss: 0.679 | 915 ms/step , 6871.92 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 04:26:43 | Epoch: 1 | Step: 214900 | Dataset: 0-1616461 | Loss: 0.738 | 914 ms/step , 6883.24 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 04:26:44 | Validation | Step: 214900 | Val_loss: 0.743 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:26:54 | Epoch: 1 | Step: 214910 | Dataset: 0-1616781 | Loss: 0.726 | 913 ms/step , 6890.09 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 04:27:03 | Epoch: 1 | Step: 214920 | Dataset: 0-1617101 | Loss: 0.738 | 914 ms/step , 6882.05 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 04:27:12 | Epoch: 1 | Step: 214930 | Dataset: 0-1617421 | Loss: 0.675 | 913 ms/step , 6889.61 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 04:27:21 | Epoch: 1 | Step: 214940 | Dataset: 0-1617741 | Loss: 0.461 | 912 ms/step , 6897.48 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 04:27:30 | Epoch: 1 | Step: 214950 | Dataset: 0-1618061 | Loss: 0.745 | 913 ms/step , 6886.01 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 04:27:39 | Epoch: 1 | Step: 214960 | Dataset: 0-1618381 | Loss: 0.774 | 913 ms/step , 6888.33 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:27:48 | Epoch: 1 | Step: 214970 | Dataset: 0-1618701 | Loss: 0.752 | 914 ms/step , 6880.33 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 04:27:58 | Epoch: 1 | Step: 214980 | Dataset: 0-1619021 | Loss: 0.711 | 913 ms/step , 6888.66 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 04:28:07 | Epoch: 1 | Step: 214990 | Dataset: 0-1619341 | Loss: 0.626 | 913 ms/step , 6885.78 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-06 04:28:16 | Epoch: 1 | Step: 215000 | Dataset: 0-1619661 | Loss: 0.765 | 913 ms/step , 6889.96 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 04:28:17 | Validation | Step: 215000 | Val_loss: 0.701 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:28:17 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_042817_step_215000.pt` INFO:__main__:2024-11-06 04:28:28 | Epoch: 1 | Step: 215010 | Dataset: 0-1619981 | Loss: 0.716 | 914 ms/step , 6884.77 GFLOP/s , 13793.9 tokens/s INFO:__main__:2024-11-06 04:28:37 | Epoch: 1 | Step: 215020 | Dataset: 0-1620301 | Loss: 0.642 | 913 ms/step , 6892.20 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 04:28:46 | Epoch: 1 | Step: 215030 | Dataset: 0-1620621 | Loss: 0.653 | 913 ms/step , 6888.31 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 04:28:55 | Epoch: 1 | Step: 215040 | Dataset: 0-1620941 | Loss: 0.633 | 913 ms/step , 6891.78 GFLOP/s , 17898.7 tokens/s INFO:__main__:2024-11-06 04:29:04 | Epoch: 1 | Step: 215050 | Dataset: 0-1621261 | Loss: 0.659 | 913 ms/step , 6888.85 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 04:29:13 | Epoch: 1 | Step: 215060 | Dataset: 0-1621581 | Loss: 0.726 | 913 ms/step , 6889.59 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-06 04:29:23 | Epoch: 1 | Step: 215070 | Dataset: 0-1621901 | Loss: 0.666 | 913 ms/step , 6891.21 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 04:29:32 | Epoch: 1 | Step: 215080 | Dataset: 0-1622221 | Loss: 0.706 | 913 ms/step , 6885.65 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 04:29:41 | Epoch: 1 | Step: 215090 | Dataset: 0-1622541 | Loss: 0.732 | 915 ms/step , 6872.87 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 04:29:50 | Epoch: 1 | Step: 215100 | Dataset: 0-1622861 | Loss: 0.735 | 914 ms/step , 6881.79 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 04:29:52 | Validation | Step: 215100 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:30:01 | Epoch: 1 | Step: 215110 | Dataset: 0-1623181 | Loss: 0.567 | 912 ms/step , 6892.75 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-06 04:30:10 | Epoch: 1 | Step: 215120 | Dataset: 0-1623501 | Loss: 0.707 | 913 ms/step , 6888.39 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 04:30:19 | Epoch: 1 | Step: 215130 | Dataset: 0-1623821 | Loss: 0.703 | 912 ms/step , 6897.46 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 04:30:28 | Epoch: 1 | Step: 215140 | Dataset: 0-1624141 | Loss: 0.680 | 913 ms/step , 6888.39 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 04:30:37 | Epoch: 1 | Step: 215150 | Dataset: 0-1624461 | Loss: 0.593 | 913 ms/step , 6891.51 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 04:30:46 | Epoch: 1 | Step: 215160 | Dataset: 0-1624781 | Loss: 0.638 | 913 ms/step , 6886.75 GFLOP/s , 17926.2 tokens/s INFO:__main__:2024-11-06 04:30:56 | Epoch: 1 | Step: 215170 | Dataset: 0-1625101 | Loss: 0.735 | 914 ms/step , 6883.38 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 04:31:05 | Epoch: 1 | Step: 215180 | Dataset: 0-1625421 | Loss: 0.663 | 914 ms/step , 6883.88 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 04:31:14 | Epoch: 1 | Step: 215190 | Dataset: 0-1625741 | Loss: 0.681 | 912 ms/step , 6894.28 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 04:31:23 | Epoch: 1 | Step: 215200 | Dataset: 0-1626061 | Loss: 0.630 | 913 ms/step , 6892.42 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-06 04:31:25 | Validation | Step: 215200 | Val_loss: 0.714 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:31:34 | Epoch: 1 | Step: 215210 | Dataset: 0-1626381 | Loss: 0.795 | 913 ms/step , 6887.10 GFLOP/s , 15270.8 tokens/s INFO:__main__:2024-11-06 04:31:43 | Epoch: 1 | Step: 215220 | Dataset: 0-1626701 | Loss: 0.641 | 912 ms/step , 6896.80 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 04:31:52 | Epoch: 1 | Step: 215230 | Dataset: 0-1627021 | Loss: 0.691 | 913 ms/step , 6888.30 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 04:32:01 | Epoch: 1 | Step: 215240 | Dataset: 0-1627341 | Loss: 0.683 | 913 ms/step , 6890.99 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:32:10 | Epoch: 1 | Step: 215250 | Dataset: 0-1627661 | Loss: 0.722 | 914 ms/step , 6883.95 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 04:32:19 | Epoch: 1 | Step: 215260 | Dataset: 0-1627981 | Loss: 0.664 | 913 ms/step , 6886.09 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 04:32:28 | Epoch: 1 | Step: 215270 | Dataset: 0-1628301 | Loss: 0.716 | 913 ms/step , 6886.99 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 04:32:38 | Epoch: 1 | Step: 215280 | Dataset: 0-1628621 | Loss: 0.681 | 914 ms/step , 6880.81 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 04:32:47 | Epoch: 1 | Step: 215290 | Dataset: 0-1628941 | Loss: 0.587 | 914 ms/step , 6877.99 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 04:32:56 | Epoch: 1 | Step: 215300 | Dataset: 0-1629261 | Loss: 0.662 | 913 ms/step , 6885.67 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:32:57 | Validation | Step: 215300 | Val_loss: 0.678 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:33:07 | Epoch: 1 | Step: 215310 | Dataset: 0-1629581 | Loss: 0.770 | 913 ms/step , 6887.61 GFLOP/s , 15275.8 tokens/s INFO:__main__:2024-11-06 04:33:16 | Epoch: 1 | Step: 215320 | Dataset: 0-1629901 | Loss: 0.637 | 914 ms/step , 6883.98 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 04:33:25 | Epoch: 1 | Step: 215330 | Dataset: 0-1630221 | Loss: 0.633 | 913 ms/step , 6887.99 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 04:33:34 | Epoch: 1 | Step: 215340 | Dataset: 0-1630541 | Loss: 0.749 | 914 ms/step , 6884.09 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 04:33:43 | Epoch: 1 | Step: 215350 | Dataset: 0-1630861 | Loss: 0.750 | 914 ms/step , 6880.34 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 04:33:52 | Epoch: 1 | Step: 215360 | Dataset: 0-1631181 | Loss: 0.692 | 913 ms/step , 6891.30 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 04:34:01 | Epoch: 1 | Step: 215370 | Dataset: 0-1631501 | Loss: 0.646 | 912 ms/step , 6895.84 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 04:34:11 | Epoch: 1 | Step: 215380 | Dataset: 0-1631821 | Loss: 0.668 | 913 ms/step , 6886.94 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 04:34:20 | Epoch: 1 | Step: 215390 | Dataset: 0-1632141 | Loss: 0.593 | 913 ms/step , 6886.70 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 04:34:29 | Epoch: 1 | Step: 215400 | Dataset: 0-1632461 | Loss: 0.587 | 912 ms/step , 6893.41 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 04:34:30 | Validation | Step: 215400 | Val_loss: 0.704 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:34:40 | Epoch: 1 | Step: 215410 | Dataset: 0-1632781 | Loss: 0.772 | 914 ms/step , 6884.78 GFLOP/s , 15264.1 tokens/s INFO:__main__:2024-11-06 04:34:49 | Epoch: 1 | Step: 215420 | Dataset: 0-1633101 | Loss: 0.612 | 913 ms/step , 6885.44 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-06 04:34:58 | Epoch: 1 | Step: 215430 | Dataset: 0-1633421 | Loss: 0.697 | 916 ms/step , 6864.59 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 04:35:07 | Epoch: 1 | Step: 215440 | Dataset: 0-1633741 | Loss: 0.660 | 912 ms/step , 6894.69 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 04:35:16 | Epoch: 1 | Step: 215450 | Dataset: 0-1634061 | Loss: 0.790 | 913 ms/step , 6890.75 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 04:35:25 | Epoch: 1 | Step: 215460 | Dataset: 0-1634381 | Loss: 0.794 | 913 ms/step , 6888.23 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 04:35:34 | Epoch: 1 | Step: 215470 | Dataset: 0-1634701 | Loss: 0.666 | 915 ms/step , 6876.27 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 04:35:44 | Epoch: 1 | Step: 215480 | Dataset: 0-1635021 | Loss: 0.773 | 915 ms/step , 6876.04 GFLOP/s , 17874.1 tokens/s INFO:__main__:2024-11-06 04:35:53 | Epoch: 1 | Step: 215490 | Dataset: 0-1635341 | Loss: 0.575 | 913 ms/step , 6887.50 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 04:36:02 | Epoch: 1 | Step: 215500 | Dataset: 0-1635661 | Loss: 0.660 | 912 ms/step , 6893.04 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 04:36:03 | Validation | Step: 215500 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:36:13 | Epoch: 1 | Step: 215510 | Dataset: 0-1635981 | Loss: 0.603 | 913 ms/step , 6887.13 GFLOP/s , 15279.4 tokens/s INFO:__main__:2024-11-06 04:36:22 | Epoch: 1 | Step: 215520 | Dataset: 0-1636301 | Loss: 0.768 | 912 ms/step , 6895.17 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-06 04:36:31 | Epoch: 1 | Step: 215530 | Dataset: 0-1636621 | Loss: 0.516 | 913 ms/step , 6890.70 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 04:36:40 | Epoch: 1 | Step: 215540 | Dataset: 0-1636941 | Loss: 0.815 | 915 ms/step , 6873.33 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 04:36:49 | Epoch: 1 | Step: 215550 | Dataset: 0-1637261 | Loss: 0.747 | 914 ms/step , 6884.68 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:36:58 | Epoch: 1 | Step: 215560 | Dataset: 0-1637581 | Loss: 0.677 | 913 ms/step , 6888.71 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 04:37:07 | Epoch: 1 | Step: 215570 | Dataset: 0-1637901 | Loss: 0.778 | 915 ms/step , 6875.21 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 04:37:17 | Epoch: 1 | Step: 215580 | Dataset: 0-1638221 | Loss: 0.757 | 914 ms/step , 6877.73 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 04:37:26 | Epoch: 1 | Step: 215590 | Dataset: 0-1638541 | Loss: 0.751 | 914 ms/step , 6880.93 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:37:35 | Epoch: 1 | Step: 215600 | Dataset: 0-1638861 | Loss: 0.760 | 914 ms/step , 6883.00 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 04:37:36 | Validation | Step: 215600 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:37:46 | Epoch: 1 | Step: 215610 | Dataset: 0-1639181 | Loss: 0.635 | 913 ms/step , 6891.32 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-06 04:37:55 | Epoch: 1 | Step: 215620 | Dataset: 0-1639501 | Loss: 0.704 | 912 ms/step , 6894.62 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 04:38:04 | Epoch: 1 | Step: 215630 | Dataset: 0-1639821 | Loss: 0.687 | 913 ms/step , 6887.29 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 04:38:13 | Epoch: 1 | Step: 215640 | Dataset: 0-1640141 | Loss: 0.639 | 914 ms/step , 6885.03 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 04:38:22 | Epoch: 1 | Step: 215650 | Dataset: 0-1640461 | Loss: 0.770 | 913 ms/step , 6886.80 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 04:38:31 | Epoch: 1 | Step: 215660 | Dataset: 0-1640781 | Loss: 0.725 | 914 ms/step , 6878.06 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 04:38:40 | Epoch: 1 | Step: 215670 | Dataset: 0-1641101 | Loss: 0.744 | 912 ms/step , 6895.95 GFLOP/s , 17945.2 tokens/s INFO:__main__:2024-11-06 04:38:50 | Epoch: 1 | Step: 215680 | Dataset: 0-1641421 | Loss: 0.704 | 913 ms/step , 6887.94 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 04:38:59 | Epoch: 1 | Step: 215690 | Dataset: 0-1641741 | Loss: 0.770 | 913 ms/step , 6891.58 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 04:39:08 | Epoch: 1 | Step: 215700 | Dataset: 0-1642061 | Loss: 0.690 | 914 ms/step , 6880.42 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 04:39:09 | Validation | Step: 215700 | Val_loss: 0.716 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:39:19 | Epoch: 1 | Step: 215710 | Dataset: 0-1642381 | Loss: 0.769 | 913 ms/step , 6888.58 GFLOP/s , 15279.3 tokens/s INFO:__main__:2024-11-06 04:39:28 | Epoch: 1 | Step: 215720 | Dataset: 0-1642701 | Loss: 0.612 | 912 ms/step , 6896.53 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 04:39:37 | Epoch: 1 | Step: 215730 | Dataset: 0-1643021 | Loss: 0.659 | 913 ms/step , 6889.16 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 04:39:46 | Epoch: 1 | Step: 215740 | Dataset: 0-1643341 | Loss: 0.808 | 913 ms/step , 6890.54 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 04:39:55 | Epoch: 1 | Step: 215750 | Dataset: 0-1643661 | Loss: 0.742 | 914 ms/step , 6884.46 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 04:40:04 | Epoch: 1 | Step: 215760 | Dataset: 0-1643981 | Loss: 0.583 | 913 ms/step , 6890.62 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 04:40:13 | Epoch: 1 | Step: 215770 | Dataset: 0-1644301 | Loss: 0.754 | 913 ms/step , 6887.45 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 04:40:22 | Epoch: 1 | Step: 215780 | Dataset: 0-1644621 | Loss: 0.722 | 912 ms/step , 6893.50 GFLOP/s , 17937.6 tokens/s INFO:__main__:2024-11-06 04:40:32 | Epoch: 1 | Step: 215790 | Dataset: 0-1644941 | Loss: 0.661 | 913 ms/step , 6892.56 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 04:40:41 | Epoch: 1 | Step: 215800 | Dataset: 0-1645261 | Loss: 0.787 | 913 ms/step , 6886.36 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:40:42 | Validation | Step: 215800 | Val_loss: 0.711 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:40:51 | Epoch: 1 | Step: 215810 | Dataset: 0-1645581 | Loss: 0.714 | 914 ms/step , 6880.33 GFLOP/s , 15280.4 tokens/s INFO:__main__:2024-11-06 04:41:01 | Epoch: 1 | Step: 215820 | Dataset: 0-1645901 | Loss: 0.723 | 914 ms/step , 6881.80 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:41:10 | Epoch: 1 | Step: 215830 | Dataset: 0-1646221 | Loss: 0.731 | 913 ms/step , 6886.35 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-06 04:41:19 | Epoch: 1 | Step: 215840 | Dataset: 0-1646541 | Loss: 0.745 | 914 ms/step , 6882.02 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 04:41:28 | Epoch: 1 | Step: 215850 | Dataset: 0-1646861 | Loss: 0.716 | 913 ms/step , 6888.89 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 04:41:37 | Epoch: 1 | Step: 215860 | Dataset: 0-1647181 | Loss: 0.635 | 914 ms/step , 6882.00 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 04:41:46 | Epoch: 1 | Step: 215870 | Dataset: 0-1647501 | Loss: 0.600 | 914 ms/step , 6883.65 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 04:41:55 | Epoch: 1 | Step: 215880 | Dataset: 0-1647821 | Loss: 0.763 | 916 ms/step , 6864.10 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 04:42:05 | Epoch: 1 | Step: 215890 | Dataset: 0-1648141 | Loss: 0.734 | 914 ms/step , 6881.52 GFLOP/s , 17926.4 tokens/s INFO:__main__:2024-11-06 04:42:14 | Epoch: 1 | Step: 215900 | Dataset: 0-1648461 | Loss: 0.753 | 914 ms/step , 6882.03 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 04:42:15 | Validation | Step: 215900 | Val_loss: 0.725 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:42:24 | Epoch: 1 | Step: 215910 | Dataset: 0-1648781 | Loss: 0.682 | 912 ms/step , 6895.91 GFLOP/s , 15284.9 tokens/s INFO:__main__:2024-11-06 04:42:34 | Epoch: 1 | Step: 215920 | Dataset: 0-1649101 | Loss: 0.841 | 912 ms/step , 6894.36 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 04:42:43 | Epoch: 1 | Step: 215930 | Dataset: 0-1649421 | Loss: 0.537 | 911 ms/step , 6903.21 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:42:52 | Epoch: 1 | Step: 215940 | Dataset: 0-1649741 | Loss: 0.684 | 912 ms/step , 6893.48 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 04:43:01 | Epoch: 1 | Step: 215950 | Dataset: 0-1650061 | Loss: 0.751 | 914 ms/step , 6882.43 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 04:43:10 | Epoch: 1 | Step: 215960 | Dataset: 0-1650381 | Loss: 0.735 | 913 ms/step , 6887.73 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 04:43:19 | Epoch: 1 | Step: 215970 | Dataset: 0-1650701 | Loss: 0.730 | 913 ms/step , 6890.13 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:43:28 | Epoch: 1 | Step: 215980 | Dataset: 0-1651021 | Loss: 0.728 | 913 ms/step , 6885.42 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 04:43:38 | Epoch: 1 | Step: 215990 | Dataset: 0-1651341 | Loss: 0.644 | 912 ms/step , 6893.89 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 04:43:47 | Epoch: 1 | Step: 216000 | Dataset: 0-1651661 | Loss: 0.865 | 913 ms/step , 6887.17 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 04:43:48 | Validation | Step: 216000 | Val_loss: 0.696 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:43:48 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_044348_step_216000.pt` INFO:__main__:2024-11-06 04:43:59 | Epoch: 1 | Step: 216010 | Dataset: 0-1651981 | Loss: 0.878 | 915 ms/step , 6873.43 GFLOP/s , 13783.2 tokens/s INFO:__main__:2024-11-06 04:44:08 | Epoch: 1 | Step: 216020 | Dataset: 0-1652301 | Loss: 0.621 | 913 ms/step , 6885.32 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 04:44:17 | Epoch: 1 | Step: 216030 | Dataset: 0-1652621 | Loss: 0.673 | 912 ms/step , 6894.24 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 04:44:26 | Epoch: 1 | Step: 216040 | Dataset: 0-1652941 | Loss: 0.799 | 914 ms/step , 6884.99 GFLOP/s , 17898.5 tokens/s INFO:__main__:2024-11-06 04:44:35 | Epoch: 1 | Step: 216050 | Dataset: 0-1653261 | Loss: 0.716 | 914 ms/step , 6878.97 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 04:44:44 | Epoch: 1 | Step: 216060 | Dataset: 0-1653581 | Loss: 0.728 | 913 ms/step , 6887.88 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 04:44:53 | Epoch: 1 | Step: 216070 | Dataset: 0-1653901 | Loss: 0.650 | 913 ms/step , 6890.56 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 04:45:03 | Epoch: 1 | Step: 216080 | Dataset: 0-1654221 | Loss: 0.732 | 912 ms/step , 6898.96 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 04:45:12 | Epoch: 1 | Step: 216090 | Dataset: 0-1654541 | Loss: 0.741 | 913 ms/step , 6885.91 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 04:45:21 | Epoch: 1 | Step: 216100 | Dataset: 0-1654861 | Loss: 0.528 | 912 ms/step , 6895.65 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 04:45:22 | Validation | Step: 216100 | Val_loss: 0.733 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:45:32 | Epoch: 1 | Step: 216110 | Dataset: 0-1655181 | Loss: 0.812 | 913 ms/step , 6890.37 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-06 04:45:41 | Epoch: 1 | Step: 216120 | Dataset: 0-1655501 | Loss: 0.606 | 915 ms/step , 6874.46 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 04:45:50 | Epoch: 1 | Step: 216130 | Dataset: 0-1655821 | Loss: 0.649 | 912 ms/step , 6896.18 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 04:45:59 | Epoch: 1 | Step: 216140 | Dataset: 0-1656141 | Loss: 0.751 | 913 ms/step , 6886.19 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 04:46:08 | Epoch: 1 | Step: 216150 | Dataset: 0-1656461 | Loss: 0.731 | 912 ms/step , 6894.33 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 04:46:17 | Epoch: 1 | Step: 216160 | Dataset: 0-1656781 | Loss: 0.702 | 914 ms/step , 6884.37 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 04:46:26 | Epoch: 1 | Step: 216170 | Dataset: 0-1657101 | Loss: 0.744 | 912 ms/step , 6895.05 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:46:35 | Epoch: 1 | Step: 216180 | Dataset: 0-1657421 | Loss: 0.778 | 914 ms/step , 6878.08 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 04:46:45 | Epoch: 1 | Step: 216190 | Dataset: 0-1657741 | Loss: 0.749 | 912 ms/step , 6893.20 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 04:46:54 | Epoch: 1 | Step: 216200 | Dataset: 0-1658061 | Loss: 0.720 | 913 ms/step , 6888.26 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 04:46:55 | Validation | Step: 216200 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:47:04 | Epoch: 1 | Step: 216210 | Dataset: 0-1658381 | Loss: 0.691 | 914 ms/step , 6881.87 GFLOP/s , 15291.1 tokens/s INFO:__main__:2024-11-06 04:47:14 | Epoch: 1 | Step: 216220 | Dataset: 0-1658701 | Loss: 0.785 | 915 ms/step , 6877.17 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 04:47:23 | Epoch: 1 | Step: 216230 | Dataset: 0-1659021 | Loss: 0.781 | 913 ms/step , 6890.86 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:47:32 | Epoch: 1 | Step: 216240 | Dataset: 0-1659341 | Loss: 0.823 | 913 ms/step , 6891.25 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 04:47:41 | Epoch: 1 | Step: 216250 | Dataset: 0-1659661 | Loss: 0.724 | 912 ms/step , 6893.82 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 04:47:50 | Epoch: 1 | Step: 216260 | Dataset: 0-1659981 | Loss: 0.786 | 914 ms/step , 6882.31 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:47:59 | Epoch: 1 | Step: 216270 | Dataset: 0-1660301 | Loss: 0.757 | 912 ms/step , 6898.39 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 04:48:08 | Epoch: 1 | Step: 216280 | Dataset: 0-1660621 | Loss: 0.756 | 913 ms/step , 6888.36 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 04:48:18 | Epoch: 1 | Step: 216290 | Dataset: 0-1660941 | Loss: 0.694 | 914 ms/step , 6881.58 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 04:48:27 | Epoch: 1 | Step: 216300 | Dataset: 0-1661261 | Loss: 0.746 | 913 ms/step , 6886.48 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 04:48:28 | Validation | Step: 216300 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:48:37 | Epoch: 1 | Step: 216310 | Dataset: 0-1661581 | Loss: 0.725 | 913 ms/step , 6892.15 GFLOP/s , 15277.8 tokens/s INFO:__main__:2024-11-06 04:48:47 | Epoch: 1 | Step: 216320 | Dataset: 0-1661901 | Loss: 0.829 | 914 ms/step , 6881.24 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 04:48:56 | Epoch: 1 | Step: 216330 | Dataset: 0-1662221 | Loss: 0.876 | 915 ms/step , 6874.87 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 04:49:05 | Epoch: 1 | Step: 216340 | Dataset: 0-1662541 | Loss: 0.852 | 914 ms/step , 6878.06 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 04:49:14 | Epoch: 1 | Step: 216350 | Dataset: 0-1662861 | Loss: 0.768 | 913 ms/step , 6892.15 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 04:49:23 | Epoch: 1 | Step: 216360 | Dataset: 0-1663181 | Loss: 0.723 | 913 ms/step , 6889.48 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 04:49:32 | Epoch: 1 | Step: 216370 | Dataset: 0-1663501 | Loss: 0.790 | 914 ms/step , 6881.63 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 04:49:41 | Epoch: 1 | Step: 216380 | Dataset: 0-1663821 | Loss: 0.816 | 914 ms/step , 6882.50 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 04:49:51 | Epoch: 1 | Step: 216390 | Dataset: 0-1664141 | Loss: 0.694 | 914 ms/step , 6878.39 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 04:50:00 | Epoch: 1 | Step: 216400 | Dataset: 0-1664461 | Loss: 0.843 | 915 ms/step , 6875.92 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 04:50:01 | Validation | Step: 216400 | Val_loss: 0.708 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:50:10 | Epoch: 1 | Step: 216410 | Dataset: 0-1664781 | Loss: 0.776 | 914 ms/step , 6883.95 GFLOP/s , 15271.2 tokens/s INFO:__main__:2024-11-06 04:50:20 | Epoch: 1 | Step: 216420 | Dataset: 0-1665101 | Loss: 0.703 | 913 ms/step , 6891.73 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 04:50:29 | Epoch: 1 | Step: 216430 | Dataset: 0-1665421 | Loss: 0.571 | 913 ms/step , 6888.06 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 04:50:38 | Epoch: 1 | Step: 216440 | Dataset: 0-1665741 | Loss: 0.776 | 913 ms/step , 6888.38 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 04:50:47 | Epoch: 1 | Step: 216450 | Dataset: 0-1666061 | Loss: 0.710 | 913 ms/step , 6885.63 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 04:50:56 | Epoch: 1 | Step: 216460 | Dataset: 0-1666381 | Loss: 0.665 | 912 ms/step , 6893.30 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 04:51:05 | Epoch: 1 | Step: 216470 | Dataset: 0-1666701 | Loss: 0.660 | 914 ms/step , 6878.06 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 04:51:14 | Epoch: 1 | Step: 216480 | Dataset: 0-1667021 | Loss: 0.773 | 914 ms/step , 6884.84 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 04:51:23 | Epoch: 1 | Step: 216490 | Dataset: 0-1667341 | Loss: 0.821 | 913 ms/step , 6885.23 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 04:51:33 | Epoch: 1 | Step: 216500 | Dataset: 0-1667661 | Loss: 0.744 | 914 ms/step , 6883.62 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 04:51:34 | Validation | Step: 216500 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:51:43 | Epoch: 1 | Step: 216510 | Dataset: 0-1667981 | Loss: 0.734 | 913 ms/step , 6885.50 GFLOP/s , 15284.8 tokens/s INFO:__main__:2024-11-06 04:51:52 | Epoch: 1 | Step: 216520 | Dataset: 0-1668301 | Loss: 0.818 | 913 ms/step , 6889.72 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 04:52:02 | Epoch: 1 | Step: 216530 | Dataset: 0-1668621 | Loss: 0.671 | 913 ms/step , 6891.94 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 04:52:11 | Epoch: 1 | Step: 216540 | Dataset: 0-1668941 | Loss: 0.744 | 913 ms/step , 6886.05 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 04:52:20 | Epoch: 1 | Step: 216550 | Dataset: 0-1669261 | Loss: 0.655 | 913 ms/step , 6892.26 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 04:52:29 | Epoch: 1 | Step: 216560 | Dataset: 0-1669581 | Loss: 0.687 | 915 ms/step , 6875.87 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 04:52:38 | Epoch: 1 | Step: 216570 | Dataset: 0-1669901 | Loss: 0.712 | 912 ms/step , 6898.19 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 04:52:47 | Epoch: 1 | Step: 216580 | Dataset: 0-1670221 | Loss: 0.827 | 913 ms/step , 6890.12 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 04:52:56 | Epoch: 1 | Step: 216590 | Dataset: 0-1670541 | Loss: 0.693 | 912 ms/step , 6893.18 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 04:53:06 | Epoch: 1 | Step: 216600 | Dataset: 0-1670861 | Loss: 0.639 | 912 ms/step , 6892.71 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 04:53:07 | Validation | Step: 216600 | Val_loss: 0.719 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:53:16 | Epoch: 1 | Step: 216610 | Dataset: 0-1671181 | Loss: 0.722 | 913 ms/step , 6887.96 GFLOP/s , 15268.4 tokens/s INFO:__main__:2024-11-06 04:53:25 | Epoch: 1 | Step: 216620 | Dataset: 0-1671501 | Loss: 0.809 | 915 ms/step , 6872.85 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 04:53:35 | Epoch: 1 | Step: 216630 | Dataset: 0-1671821 | Loss: 0.726 | 913 ms/step , 6886.53 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 04:53:44 | Epoch: 1 | Step: 216640 | Dataset: 0-1672141 | Loss: 0.580 | 912 ms/step , 6895.66 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-06 04:53:53 | Epoch: 1 | Step: 216650 | Dataset: 0-1672461 | Loss: 0.709 | 912 ms/step , 6896.75 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 04:54:02 | Epoch: 1 | Step: 216660 | Dataset: 0-1672781 | Loss: 0.706 | 912 ms/step , 6896.13 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 04:54:11 | Epoch: 1 | Step: 216670 | Dataset: 0-1673101 | Loss: 0.710 | 914 ms/step , 6881.07 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 04:54:20 | Epoch: 1 | Step: 216680 | Dataset: 0-1673421 | Loss: 0.895 | 914 ms/step , 6879.26 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 04:54:29 | Epoch: 1 | Step: 216690 | Dataset: 0-1673741 | Loss: 0.770 | 915 ms/step , 6873.52 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 04:54:39 | Epoch: 1 | Step: 216700 | Dataset: 0-1674061 | Loss: 0.755 | 915 ms/step , 6874.70 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 04:54:40 | Validation | Step: 216700 | Val_loss: 0.660 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:54:49 | Epoch: 1 | Step: 216710 | Dataset: 0-1674381 | Loss: 0.728 | 913 ms/step , 6886.18 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 04:54:58 | Epoch: 1 | Step: 216720 | Dataset: 0-1674701 | Loss: 0.811 | 914 ms/step , 6881.44 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 04:55:08 | Epoch: 1 | Step: 216730 | Dataset: 0-1675021 | Loss: 0.746 | 913 ms/step , 6887.79 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 04:55:17 | Epoch: 1 | Step: 216740 | Dataset: 0-1675341 | Loss: 0.714 | 913 ms/step , 6889.02 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 04:55:26 | Epoch: 1 | Step: 216750 | Dataset: 0-1675661 | Loss: 0.821 | 913 ms/step , 6885.86 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 04:55:35 | Epoch: 1 | Step: 216760 | Dataset: 0-1675981 | Loss: 0.559 | 913 ms/step , 6890.96 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 04:55:44 | Epoch: 1 | Step: 216770 | Dataset: 0-1676301 | Loss: 0.766 | 913 ms/step , 6891.28 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 04:55:53 | Epoch: 1 | Step: 216780 | Dataset: 0-1676621 | Loss: 0.650 | 913 ms/step , 6889.68 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 04:56:02 | Epoch: 1 | Step: 216790 | Dataset: 0-1676941 | Loss: 0.735 | 913 ms/step , 6892.09 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 04:56:11 | Epoch: 1 | Step: 216800 | Dataset: 0-1677261 | Loss: 0.687 | 913 ms/step , 6889.66 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-06 04:56:13 | Validation | Step: 216800 | Val_loss: 0.673 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:56:22 | Epoch: 1 | Step: 216810 | Dataset: 0-1677581 | Loss: 0.725 | 914 ms/step , 6884.57 GFLOP/s , 15270.1 tokens/s INFO:__main__:2024-11-06 04:56:31 | Epoch: 1 | Step: 216820 | Dataset: 0-1677901 | Loss: 0.643 | 913 ms/step , 6888.29 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 04:56:40 | Epoch: 1 | Step: 216830 | Dataset: 0-1678221 | Loss: 0.681 | 912 ms/step , 6899.68 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 04:56:50 | Epoch: 1 | Step: 216840 | Dataset: 0-1678541 | Loss: 0.680 | 914 ms/step , 6881.83 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 04:56:59 | Epoch: 1 | Step: 216850 | Dataset: 0-1678861 | Loss: 0.641 | 913 ms/step , 6886.73 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 04:57:08 | Epoch: 1 | Step: 216860 | Dataset: 0-1679181 | Loss: 0.668 | 913 ms/step , 6891.43 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 04:57:17 | Epoch: 1 | Step: 216870 | Dataset: 0-1679501 | Loss: 0.817 | 914 ms/step , 6880.89 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 04:57:26 | Epoch: 1 | Step: 216880 | Dataset: 0-1679821 | Loss: 0.664 | 912 ms/step , 6894.10 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 04:57:35 | Epoch: 1 | Step: 216890 | Dataset: 0-1680141 | Loss: 0.615 | 914 ms/step , 6881.38 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 04:57:44 | Epoch: 1 | Step: 216900 | Dataset: 0-1680461 | Loss: 0.614 | 911 ms/step , 6901.36 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 04:57:46 | Validation | Step: 216900 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:57:55 | Epoch: 1 | Step: 216910 | Dataset: 0-1680781 | Loss: 0.610 | 913 ms/step , 6890.27 GFLOP/s , 15293.2 tokens/s INFO:__main__:2024-11-06 04:58:04 | Epoch: 1 | Step: 216920 | Dataset: 0-1681101 | Loss: 0.717 | 914 ms/step , 6878.69 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 04:58:13 | Epoch: 1 | Step: 216930 | Dataset: 0-1681421 | Loss: 0.829 | 914 ms/step , 6879.51 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 04:58:23 | Epoch: 1 | Step: 216940 | Dataset: 0-1681741 | Loss: 0.736 | 912 ms/step , 6895.65 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 04:58:32 | Epoch: 1 | Step: 216950 | Dataset: 0-1682061 | Loss: 0.646 | 914 ms/step , 6883.24 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 04:58:41 | Epoch: 1 | Step: 216960 | Dataset: 0-1682381 | Loss: 0.529 | 913 ms/step , 6885.87 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 04:58:50 | Epoch: 1 | Step: 216970 | Dataset: 0-1682701 | Loss: 0.623 | 912 ms/step , 6894.92 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 04:58:59 | Epoch: 1 | Step: 216980 | Dataset: 0-1683021 | Loss: 0.670 | 913 ms/step , 6890.83 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 04:59:08 | Epoch: 1 | Step: 216990 | Dataset: 0-1683341 | Loss: 0.745 | 913 ms/step , 6888.78 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 04:59:17 | Epoch: 1 | Step: 217000 | Dataset: 0-1683661 | Loss: 0.752 | 913 ms/step , 6886.24 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 04:59:19 | Validation | Step: 217000 | Val_loss: 0.743 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 04:59:19 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_045919_step_217000.pt` INFO:__main__:2024-11-06 04:59:29 | Epoch: 1 | Step: 217010 | Dataset: 0-1683981 | Loss: 0.569 | 919 ms/step , 6846.16 GFLOP/s , 13778.9 tokens/s INFO:__main__:2024-11-06 04:59:38 | Epoch: 1 | Step: 217020 | Dataset: 0-1684301 | Loss: 0.728 | 914 ms/step , 6879.51 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 04:59:48 | Epoch: 1 | Step: 217030 | Dataset: 0-1684621 | Loss: 0.679 | 912 ms/step , 6898.39 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 04:59:57 | Epoch: 1 | Step: 217040 | Dataset: 0-1684941 | Loss: 0.563 | 912 ms/step , 6895.12 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 05:00:06 | Epoch: 1 | Step: 217050 | Dataset: 0-1685261 | Loss: 0.637 | 913 ms/step , 6887.04 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 05:00:15 | Epoch: 1 | Step: 217060 | Dataset: 0-1685581 | Loss: 0.686 | 913 ms/step , 6892.21 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 05:00:24 | Epoch: 1 | Step: 217070 | Dataset: 0-1685901 | Loss: 0.628 | 914 ms/step , 6883.26 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 05:00:33 | Epoch: 1 | Step: 217080 | Dataset: 0-1686221 | Loss: 0.611 | 913 ms/step , 6891.89 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 05:00:42 | Epoch: 1 | Step: 217090 | Dataset: 0-1686541 | Loss: 0.593 | 913 ms/step , 6888.38 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 05:00:51 | Epoch: 1 | Step: 217100 | Dataset: 0-1686861 | Loss: 0.693 | 914 ms/step , 6881.48 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 05:00:53 | Validation | Step: 217100 | Val_loss: 0.763 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:01:02 | Epoch: 1 | Step: 217110 | Dataset: 0-1687181 | Loss: 0.694 | 913 ms/step , 6890.22 GFLOP/s , 15284.5 tokens/s INFO:__main__:2024-11-06 05:01:11 | Epoch: 1 | Step: 217120 | Dataset: 0-1687501 | Loss: 0.771 | 913 ms/step , 6891.29 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 05:01:20 | Epoch: 1 | Step: 217130 | Dataset: 0-1687821 | Loss: 0.728 | 914 ms/step , 6878.67 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 05:01:30 | Epoch: 1 | Step: 217140 | Dataset: 0-1688141 | Loss: 0.678 | 913 ms/step , 6885.57 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 05:01:39 | Epoch: 1 | Step: 217150 | Dataset: 0-1688461 | Loss: 0.694 | 913 ms/step , 6891.36 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 05:01:48 | Epoch: 1 | Step: 217160 | Dataset: 0-1688781 | Loss: 0.708 | 913 ms/step , 6887.98 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 05:01:57 | Epoch: 1 | Step: 217170 | Dataset: 0-1689101 | Loss: 0.741 | 913 ms/step , 6888.41 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 05:02:06 | Epoch: 1 | Step: 217180 | Dataset: 0-1689421 | Loss: 0.679 | 915 ms/step , 6876.92 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 05:02:15 | Epoch: 1 | Step: 217190 | Dataset: 0-1689741 | Loss: 0.808 | 914 ms/step , 6879.76 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 05:02:24 | Epoch: 1 | Step: 217200 | Dataset: 0-1690061 | Loss: 0.704 | 912 ms/step , 6897.97 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 05:02:26 | Validation | Step: 217200 | Val_loss: 0.747 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:02:35 | Epoch: 1 | Step: 217210 | Dataset: 0-1690381 | Loss: 0.591 | 913 ms/step , 6889.34 GFLOP/s , 15272.2 tokens/s INFO:__main__:2024-11-06 05:02:44 | Epoch: 1 | Step: 217220 | Dataset: 0-1690701 | Loss: 0.667 | 914 ms/step , 6877.82 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 05:02:53 | Epoch: 1 | Step: 217230 | Dataset: 0-1691021 | Loss: 0.668 | 914 ms/step , 6881.31 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 05:03:03 | Epoch: 1 | Step: 217240 | Dataset: 0-1691341 | Loss: 0.743 | 913 ms/step , 6885.22 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 05:03:12 | Epoch: 1 | Step: 217250 | Dataset: 0-1691661 | Loss: 0.670 | 915 ms/step , 6871.56 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 05:03:21 | Epoch: 1 | Step: 217260 | Dataset: 0-1691981 | Loss: 0.772 | 914 ms/step , 6878.55 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 05:03:30 | Epoch: 1 | Step: 217270 | Dataset: 0-1692301 | Loss: 0.701 | 913 ms/step , 6888.82 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 05:03:39 | Epoch: 1 | Step: 217280 | Dataset: 0-1692621 | Loss: 0.779 | 913 ms/step , 6886.49 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 05:03:48 | Epoch: 1 | Step: 217290 | Dataset: 0-1692941 | Loss: 0.707 | 914 ms/step , 6882.93 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 05:03:57 | Epoch: 1 | Step: 217300 | Dataset: 0-1693261 | Loss: 0.730 | 914 ms/step , 6884.27 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 05:03:59 | Validation | Step: 217300 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:04:08 | Epoch: 1 | Step: 217310 | Dataset: 0-1693581 | Loss: 0.695 | 914 ms/step , 6878.93 GFLOP/s , 15269.6 tokens/s INFO:__main__:2024-11-06 05:04:17 | Epoch: 1 | Step: 217320 | Dataset: 0-1693901 | Loss: 0.736 | 913 ms/step , 6886.87 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-06 05:04:26 | Epoch: 1 | Step: 217330 | Dataset: 0-1694221 | Loss: 0.859 | 914 ms/step , 6883.93 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 05:04:36 | Epoch: 1 | Step: 217340 | Dataset: 0-1694541 | Loss: 0.672 | 912 ms/step , 6893.12 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 05:04:45 | Epoch: 1 | Step: 217350 | Dataset: 0-1694861 | Loss: 0.620 | 913 ms/step , 6890.90 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 05:04:54 | Epoch: 1 | Step: 217360 | Dataset: 0-1695181 | Loss: 0.692 | 911 ms/step , 6900.29 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 05:05:03 | Epoch: 1 | Step: 217370 | Dataset: 0-1695501 | Loss: 0.642 | 913 ms/step , 6890.57 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 05:05:12 | Epoch: 1 | Step: 217380 | Dataset: 0-1695821 | Loss: 0.638 | 913 ms/step , 6890.91 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 05:05:21 | Epoch: 1 | Step: 217390 | Dataset: 0-1696141 | Loss: 0.618 | 913 ms/step , 6890.32 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 05:05:30 | Epoch: 1 | Step: 217400 | Dataset: 0-1696461 | Loss: 0.812 | 914 ms/step , 6884.58 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 05:05:32 | Validation | Step: 217400 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:05:41 | Epoch: 1 | Step: 217410 | Dataset: 0-1696781 | Loss: 0.608 | 914 ms/step , 6882.01 GFLOP/s , 15269.7 tokens/s INFO:__main__:2024-11-06 05:05:50 | Epoch: 1 | Step: 217420 | Dataset: 0-1697101 | Loss: 0.629 | 912 ms/step , 6893.52 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 05:05:59 | Epoch: 1 | Step: 217430 | Dataset: 0-1697421 | Loss: 0.725 | 914 ms/step , 6880.26 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 05:06:08 | Epoch: 1 | Step: 217440 | Dataset: 0-1697741 | Loss: 0.709 | 914 ms/step , 6882.37 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 05:06:18 | Epoch: 1 | Step: 217450 | Dataset: 0-1698061 | Loss: 0.744 | 913 ms/step , 6889.64 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 05:06:27 | Epoch: 1 | Step: 217460 | Dataset: 0-1698381 | Loss: 0.651 | 914 ms/step , 6883.33 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 05:06:36 | Epoch: 1 | Step: 217470 | Dataset: 0-1698701 | Loss: 0.710 | 913 ms/step , 6886.88 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 05:06:45 | Epoch: 1 | Step: 217480 | Dataset: 0-1699021 | Loss: 0.685 | 913 ms/step , 6889.08 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-06 05:06:54 | Epoch: 1 | Step: 217490 | Dataset: 0-1699341 | Loss: 0.735 | 914 ms/step , 6885.03 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 05:07:03 | Epoch: 1 | Step: 217500 | Dataset: 0-1699661 | Loss: 0.745 | 914 ms/step , 6882.31 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 05:07:05 | Validation | Step: 217500 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:07:14 | Epoch: 1 | Step: 217510 | Dataset: 0-1699981 | Loss: 0.750 | 913 ms/step , 6889.46 GFLOP/s , 15278.6 tokens/s INFO:__main__:2024-11-06 05:07:23 | Epoch: 1 | Step: 217520 | Dataset: 0-1700301 | Loss: 0.632 | 913 ms/step , 6891.09 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 05:07:32 | Epoch: 1 | Step: 217530 | Dataset: 0-1700621 | Loss: 0.709 | 914 ms/step , 6884.94 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 05:07:41 | Epoch: 1 | Step: 217540 | Dataset: 0-1700941 | Loss: 0.676 | 912 ms/step , 6894.81 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 05:07:51 | Epoch: 1 | Step: 217550 | Dataset: 0-1701261 | Loss: 0.704 | 914 ms/step , 6879.30 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 05:08:00 | Epoch: 1 | Step: 217560 | Dataset: 0-1701581 | Loss: 0.744 | 914 ms/step , 6880.18 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 05:08:09 | Epoch: 1 | Step: 217570 | Dataset: 0-1701901 | Loss: 0.620 | 912 ms/step , 6896.26 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 05:08:18 | Epoch: 1 | Step: 217580 | Dataset: 0-1702221 | Loss: 0.716 | 912 ms/step , 6895.85 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 05:08:27 | Epoch: 1 | Step: 217590 | Dataset: 0-1702541 | Loss: 0.712 | 914 ms/step , 6880.60 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 05:08:36 | Epoch: 1 | Step: 217600 | Dataset: 0-1702861 | Loss: 0.675 | 914 ms/step , 6880.72 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 05:08:38 | Validation | Step: 217600 | Val_loss: 0.632 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:08:47 | Epoch: 1 | Step: 217610 | Dataset: 0-1703181 | Loss: 0.753 | 913 ms/step , 6890.50 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-06 05:08:56 | Epoch: 1 | Step: 217620 | Dataset: 0-1703501 | Loss: 0.615 | 915 ms/step , 6875.23 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 05:09:05 | Epoch: 1 | Step: 217630 | Dataset: 0-1703821 | Loss: 0.549 | 912 ms/step , 6895.17 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 05:09:14 | Epoch: 1 | Step: 217640 | Dataset: 0-1704141 | Loss: 0.664 | 913 ms/step , 6891.01 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 05:09:24 | Epoch: 1 | Step: 217650 | Dataset: 0-1704461 | Loss: 0.698 | 913 ms/step , 6886.38 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 05:09:33 | Epoch: 1 | Step: 217660 | Dataset: 0-1704781 | Loss: 0.742 | 913 ms/step , 6891.58 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 05:09:42 | Epoch: 1 | Step: 217670 | Dataset: 0-1705101 | Loss: 0.792 | 914 ms/step , 6880.25 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 05:09:51 | Epoch: 1 | Step: 217680 | Dataset: 0-1705421 | Loss: 0.720 | 914 ms/step , 6880.68 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 05:10:00 | Epoch: 1 | Step: 217690 | Dataset: 0-1705741 | Loss: 0.793 | 913 ms/step , 6889.90 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 05:10:09 | Epoch: 1 | Step: 217700 | Dataset: 0-1706061 | Loss: 0.622 | 914 ms/step , 6880.51 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-06 05:10:11 | Validation | Step: 217700 | Val_loss: 0.742 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:10:20 | Epoch: 1 | Step: 217710 | Dataset: 0-1706381 | Loss: 0.666 | 913 ms/step , 6888.73 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-06 05:10:29 | Epoch: 1 | Step: 217720 | Dataset: 0-1706701 | Loss: 0.686 | 913 ms/step , 6889.77 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 05:10:38 | Epoch: 1 | Step: 217730 | Dataset: 0-1707021 | Loss: 0.723 | 913 ms/step , 6890.16 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 05:10:47 | Epoch: 1 | Step: 217740 | Dataset: 0-1707341 | Loss: 0.690 | 912 ms/step , 6897.62 GFLOP/s , 17940.0 tokens/s INFO:__main__:2024-11-06 05:10:57 | Epoch: 1 | Step: 217750 | Dataset: 0-1707661 | Loss: 0.770 | 914 ms/step , 6884.89 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 05:11:06 | Epoch: 1 | Step: 217760 | Dataset: 0-1707981 | Loss: 0.691 | 914 ms/step , 6882.88 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 05:11:15 | Epoch: 1 | Step: 217770 | Dataset: 0-1708301 | Loss: 0.763 | 913 ms/step , 6885.73 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 05:11:24 | Epoch: 1 | Step: 217780 | Dataset: 0-1708621 | Loss: 0.704 | 913 ms/step , 6890.66 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-06 05:11:33 | Epoch: 1 | Step: 217790 | Dataset: 0-1708941 | Loss: 0.746 | 912 ms/step , 6893.07 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 05:11:42 | Epoch: 1 | Step: 217800 | Dataset: 0-1709261 | Loss: 0.653 | 914 ms/step , 6878.41 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 05:11:44 | Validation | Step: 217800 | Val_loss: 0.721 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:11:53 | Epoch: 1 | Step: 217810 | Dataset: 0-1709581 | Loss: 0.676 | 913 ms/step , 6885.16 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-06 05:12:02 | Epoch: 1 | Step: 217820 | Dataset: 0-1709901 | Loss: 0.676 | 913 ms/step , 6886.13 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 05:12:11 | Epoch: 1 | Step: 217830 | Dataset: 0-1710221 | Loss: 0.719 | 913 ms/step , 6891.30 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 05:12:20 | Epoch: 1 | Step: 217840 | Dataset: 0-1710541 | Loss: 0.677 | 913 ms/step , 6886.77 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 05:12:29 | Epoch: 1 | Step: 217850 | Dataset: 0-1710861 | Loss: 0.636 | 913 ms/step , 6887.86 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 05:12:39 | Epoch: 1 | Step: 217860 | Dataset: 0-1711181 | Loss: 0.640 | 913 ms/step , 6888.04 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 05:12:48 | Epoch: 1 | Step: 217870 | Dataset: 0-1711501 | Loss: 0.704 | 915 ms/step , 6876.79 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 05:12:57 | Epoch: 1 | Step: 217880 | Dataset: 0-1711821 | Loss: 0.615 | 912 ms/step , 6898.67 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 05:13:06 | Epoch: 1 | Step: 217890 | Dataset: 0-1712141 | Loss: 0.632 | 913 ms/step , 6888.95 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 05:13:15 | Epoch: 1 | Step: 217900 | Dataset: 0-1712461 | Loss: 0.638 | 913 ms/step , 6887.50 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 05:13:17 | Validation | Step: 217900 | Val_loss: 0.760 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:13:26 | Epoch: 1 | Step: 217910 | Dataset: 0-1712781 | Loss: 0.756 | 912 ms/step , 6894.42 GFLOP/s , 15275.5 tokens/s INFO:__main__:2024-11-06 05:13:35 | Epoch: 1 | Step: 217920 | Dataset: 0-1713101 | Loss: 0.669 | 914 ms/step , 6881.54 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 05:13:44 | Epoch: 1 | Step: 217930 | Dataset: 0-1713421 | Loss: 0.701 | 914 ms/step , 6884.19 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 05:13:53 | Epoch: 1 | Step: 217940 | Dataset: 0-1713741 | Loss: 0.622 | 913 ms/step , 6891.85 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 05:14:02 | Epoch: 1 | Step: 217950 | Dataset: 0-1714061 | Loss: 0.742 | 912 ms/step , 6897.00 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 05:14:12 | Epoch: 1 | Step: 217960 | Dataset: 0-1714381 | Loss: 0.758 | 913 ms/step , 6889.49 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 05:14:21 | Epoch: 1 | Step: 217970 | Dataset: 0-1714701 | Loss: 0.643 | 913 ms/step , 6891.70 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 05:14:30 | Epoch: 1 | Step: 217980 | Dataset: 0-1715021 | Loss: 0.717 | 914 ms/step , 6881.62 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 05:14:39 | Epoch: 1 | Step: 217990 | Dataset: 0-1715341 | Loss: 0.670 | 913 ms/step , 6890.33 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 05:14:48 | Epoch: 1 | Step: 218000 | Dataset: 0-1715661 | Loss: 0.679 | 912 ms/step , 6894.22 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 05:14:50 | Validation | Step: 218000 | Val_loss: 0.699 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:14:50 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_051450_step_218000.pt` INFO:__main__:2024-11-06 05:15:00 | Epoch: 1 | Step: 218010 | Dataset: 0-1715981 | Loss: 0.620 | 914 ms/step , 6878.65 GFLOP/s , 13813.8 tokens/s INFO:__main__:2024-11-06 05:15:09 | Epoch: 1 | Step: 218020 | Dataset: 0-1716301 | Loss: 0.748 | 914 ms/step , 6883.69 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 05:15:18 | Epoch: 1 | Step: 218030 | Dataset: 0-1716621 | Loss: 0.741 | 913 ms/step , 6889.39 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 05:15:27 | Epoch: 1 | Step: 218040 | Dataset: 0-1716941 | Loss: 0.736 | 913 ms/step , 6890.94 GFLOP/s , 17905.3 tokens/s INFO:__main__:2024-11-06 05:15:37 | Epoch: 1 | Step: 218050 | Dataset: 0-1717261 | Loss: 0.732 | 913 ms/step , 6888.11 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 05:15:46 | Epoch: 1 | Step: 218060 | Dataset: 0-1717581 | Loss: 0.760 | 915 ms/step , 6877.05 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-06 05:15:55 | Epoch: 1 | Step: 218070 | Dataset: 0-1717901 | Loss: 0.748 | 914 ms/step , 6881.12 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 05:16:04 | Epoch: 1 | Step: 218080 | Dataset: 0-1718221 | Loss: 0.654 | 913 ms/step , 6891.56 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 05:16:13 | Epoch: 1 | Step: 218090 | Dataset: 0-1718541 | Loss: 0.685 | 914 ms/step , 6880.87 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 05:16:22 | Epoch: 1 | Step: 218100 | Dataset: 0-1718861 | Loss: 0.636 | 914 ms/step , 6884.88 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 05:16:24 | Validation | Step: 218100 | Val_loss: 0.737 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:16:33 | Epoch: 1 | Step: 218110 | Dataset: 0-1719181 | Loss: 0.765 | 913 ms/step , 6889.54 GFLOP/s , 15288.3 tokens/s INFO:__main__:2024-11-06 05:16:42 | Epoch: 1 | Step: 218120 | Dataset: 0-1719501 | Loss: 0.759 | 913 ms/step , 6885.88 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 05:16:51 | Epoch: 1 | Step: 218130 | Dataset: 0-1719821 | Loss: 0.677 | 913 ms/step , 6885.12 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 05:17:00 | Epoch: 1 | Step: 218140 | Dataset: 0-1720141 | Loss: 0.555 | 912 ms/step , 6893.18 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-06 05:17:09 | Epoch: 1 | Step: 218150 | Dataset: 0-1720461 | Loss: 0.626 | 913 ms/step , 6888.03 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 05:17:19 | Epoch: 1 | Step: 218160 | Dataset: 0-1720781 | Loss: 0.648 | 913 ms/step , 6888.17 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 05:17:28 | Epoch: 1 | Step: 218170 | Dataset: 0-1721101 | Loss: 0.655 | 913 ms/step , 6891.69 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 05:17:37 | Epoch: 1 | Step: 218180 | Dataset: 0-1721421 | Loss: 0.686 | 914 ms/step , 6880.36 GFLOP/s , 17927.0 tokens/s INFO:__main__:2024-11-06 05:17:46 | Epoch: 1 | Step: 218190 | Dataset: 0-1721741 | Loss: 0.648 | 914 ms/step , 6882.85 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 05:17:55 | Epoch: 1 | Step: 218200 | Dataset: 0-1722061 | Loss: 0.595 | 912 ms/step , 6896.41 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 05:17:57 | Validation | Step: 218200 | Val_loss: 0.649 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:18:06 | Epoch: 1 | Step: 218210 | Dataset: 0-1722381 | Loss: 0.625 | 913 ms/step , 6886.25 GFLOP/s , 15280.7 tokens/s INFO:__main__:2024-11-06 05:18:15 | Epoch: 1 | Step: 218220 | Dataset: 0-1722701 | Loss: 0.670 | 913 ms/step , 6889.37 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 05:18:24 | Epoch: 1 | Step: 218230 | Dataset: 0-1723021 | Loss: 0.751 | 914 ms/step , 6884.44 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 05:18:33 | Epoch: 1 | Step: 218240 | Dataset: 0-1723341 | Loss: 0.753 | 914 ms/step , 6883.67 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 05:18:42 | Epoch: 1 | Step: 218250 | Dataset: 0-1723661 | Loss: 0.732 | 913 ms/step , 6892.08 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 05:18:52 | Epoch: 1 | Step: 218260 | Dataset: 0-1723981 | Loss: 0.739 | 913 ms/step , 6887.92 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 05:19:01 | Epoch: 1 | Step: 218270 | Dataset: 0-1724301 | Loss: 0.755 | 913 ms/step , 6885.15 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 05:19:10 | Epoch: 1 | Step: 218280 | Dataset: 0-1724621 | Loss: 0.731 | 913 ms/step , 6887.34 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 05:19:19 | Epoch: 1 | Step: 218290 | Dataset: 0-1724941 | Loss: 0.748 | 914 ms/step , 6879.58 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 05:19:28 | Epoch: 1 | Step: 218300 | Dataset: 0-1725261 | Loss: 0.744 | 913 ms/step , 6888.20 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 05:19:30 | Validation | Step: 218300 | Val_loss: 0.691 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:19:39 | Epoch: 1 | Step: 218310 | Dataset: 0-1725581 | Loss: 0.657 | 914 ms/step , 6884.35 GFLOP/s , 15286.8 tokens/s INFO:__main__:2024-11-06 05:19:48 | Epoch: 1 | Step: 218320 | Dataset: 0-1725901 | Loss: 0.683 | 913 ms/step , 6889.01 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 05:19:57 | Epoch: 1 | Step: 218330 | Dataset: 0-1726221 | Loss: 0.835 | 912 ms/step , 6895.81 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 05:20:06 | Epoch: 1 | Step: 218340 | Dataset: 0-1726541 | Loss: 0.776 | 912 ms/step , 6894.88 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 05:20:15 | Epoch: 1 | Step: 218350 | Dataset: 0-1726861 | Loss: 0.558 | 912 ms/step , 6893.88 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 05:20:25 | Epoch: 1 | Step: 218360 | Dataset: 0-1727181 | Loss: 0.655 | 913 ms/step , 6891.11 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 05:20:34 | Epoch: 1 | Step: 218370 | Dataset: 0-1727501 | Loss: 0.774 | 914 ms/step , 6884.44 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 05:20:43 | Epoch: 1 | Step: 218380 | Dataset: 0-1727821 | Loss: 0.792 | 915 ms/step , 6870.79 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 05:20:52 | Epoch: 1 | Step: 218390 | Dataset: 0-1728141 | Loss: 0.483 | 913 ms/step , 6888.88 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 05:21:01 | Epoch: 1 | Step: 218400 | Dataset: 0-1728461 | Loss: 0.738 | 913 ms/step , 6891.99 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 05:21:03 | Validation | Step: 218400 | Val_loss: 0.693 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:21:12 | Epoch: 1 | Step: 218410 | Dataset: 0-1728781 | Loss: 0.630 | 913 ms/step , 6888.50 GFLOP/s , 15277.0 tokens/s INFO:__main__:2024-11-06 05:21:21 | Epoch: 1 | Step: 218420 | Dataset: 0-1729101 | Loss: 0.735 | 914 ms/step , 6878.84 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 05:21:30 | Epoch: 1 | Step: 218430 | Dataset: 0-1729421 | Loss: 0.711 | 912 ms/step , 6896.34 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 05:21:39 | Epoch: 1 | Step: 218440 | Dataset: 0-1729741 | Loss: 0.671 | 913 ms/step , 6890.96 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 05:21:48 | Epoch: 1 | Step: 218450 | Dataset: 0-1730061 | Loss: 0.849 | 912 ms/step , 6893.62 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 05:21:57 | Epoch: 1 | Step: 218460 | Dataset: 0-1730381 | Loss: 0.819 | 913 ms/step , 6890.40 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 05:22:07 | Epoch: 1 | Step: 218470 | Dataset: 0-1730701 | Loss: 0.594 | 913 ms/step , 6891.77 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 05:22:16 | Epoch: 1 | Step: 218480 | Dataset: 0-1731021 | Loss: 0.770 | 913 ms/step , 6888.63 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 05:22:25 | Epoch: 1 | Step: 218490 | Dataset: 0-1731341 | Loss: 0.752 | 913 ms/step , 6889.70 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 05:22:34 | Epoch: 1 | Step: 218500 | Dataset: 0-1731661 | Loss: 0.630 | 912 ms/step , 6894.25 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 05:22:36 | Validation | Step: 218500 | Val_loss: 0.695 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:22:45 | Epoch: 1 | Step: 218510 | Dataset: 0-1731981 | Loss: 0.773 | 913 ms/step , 6889.45 GFLOP/s , 15278.2 tokens/s INFO:__main__:2024-11-06 05:22:54 | Epoch: 1 | Step: 218520 | Dataset: 0-1732301 | Loss: 0.606 | 913 ms/step , 6889.77 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 05:23:03 | Epoch: 1 | Step: 218530 | Dataset: 0-1732621 | Loss: 0.747 | 913 ms/step , 6892.34 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 05:23:12 | Epoch: 1 | Step: 218540 | Dataset: 0-1732941 | Loss: 0.611 | 914 ms/step , 6882.04 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 05:23:21 | Epoch: 1 | Step: 218550 | Dataset: 0-1733261 | Loss: 0.727 | 914 ms/step , 6880.85 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 05:23:30 | Epoch: 1 | Step: 218560 | Dataset: 0-1733581 | Loss: 0.663 | 913 ms/step , 6891.22 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 05:23:40 | Epoch: 1 | Step: 218570 | Dataset: 0-1733901 | Loss: 0.629 | 912 ms/step , 6896.32 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-06 05:23:49 | Epoch: 1 | Step: 218580 | Dataset: 0-1734221 | Loss: 0.636 | 913 ms/step , 6886.87 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 05:23:58 | Epoch: 1 | Step: 218590 | Dataset: 0-1734541 | Loss: 0.743 | 912 ms/step , 6896.23 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 05:24:07 | Epoch: 1 | Step: 218600 | Dataset: 0-1734861 | Loss: 0.667 | 914 ms/step , 6878.36 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 05:24:09 | Validation | Step: 218600 | Val_loss: 0.703 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:24:18 | Epoch: 1 | Step: 218610 | Dataset: 0-1735181 | Loss: 0.724 | 914 ms/step , 6879.70 GFLOP/s , 15278.0 tokens/s INFO:__main__:2024-11-06 05:24:27 | Epoch: 1 | Step: 218620 | Dataset: 0-1735501 | Loss: 0.726 | 914 ms/step , 6881.01 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 05:24:36 | Epoch: 1 | Step: 218630 | Dataset: 0-1735821 | Loss: 0.602 | 912 ms/step , 6893.70 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 05:24:45 | Epoch: 1 | Step: 218640 | Dataset: 0-1736141 | Loss: 0.825 | 912 ms/step , 6894.30 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 05:24:54 | Epoch: 1 | Step: 218650 | Dataset: 0-1736461 | Loss: 0.628 | 912 ms/step , 6893.78 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 05:25:03 | Epoch: 1 | Step: 218660 | Dataset: 0-1736781 | Loss: 0.632 | 913 ms/step , 6892.37 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-06 05:25:12 | Epoch: 1 | Step: 218670 | Dataset: 0-1737101 | Loss: 0.650 | 912 ms/step , 6897.53 GFLOP/s , 17943.4 tokens/s INFO:__main__:2024-11-06 05:25:22 | Epoch: 1 | Step: 218680 | Dataset: 0-1737421 | Loss: 0.770 | 913 ms/step , 6886.15 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 05:25:31 | Epoch: 1 | Step: 218690 | Dataset: 0-1737741 | Loss: 0.682 | 913 ms/step , 6887.92 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 05:25:40 | Epoch: 1 | Step: 218700 | Dataset: 0-1738061 | Loss: 0.721 | 914 ms/step , 6882.05 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 05:25:41 | Validation | Step: 218700 | Val_loss: 0.739 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:25:51 | Epoch: 1 | Step: 218710 | Dataset: 0-1738381 | Loss: 0.792 | 913 ms/step , 6890.70 GFLOP/s , 15280.8 tokens/s INFO:__main__:2024-11-06 05:26:00 | Epoch: 1 | Step: 218720 | Dataset: 0-1738701 | Loss: 0.716 | 913 ms/step , 6886.26 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 05:26:09 | Epoch: 1 | Step: 218730 | Dataset: 0-1739021 | Loss: 0.636 | 912 ms/step , 6894.51 GFLOP/s , 17944.6 tokens/s INFO:__main__:2024-11-06 05:26:18 | Epoch: 1 | Step: 218740 | Dataset: 0-1739341 | Loss: 0.643 | 913 ms/step , 6892.15 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 05:26:27 | Epoch: 1 | Step: 218750 | Dataset: 0-1739661 | Loss: 0.740 | 913 ms/step , 6885.59 GFLOP/s , 17943.3 tokens/s INFO:__main__:2024-11-06 05:26:36 | Epoch: 1 | Step: 218760 | Dataset: 0-1739981 | Loss: 0.665 | 912 ms/step , 6894.75 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 05:26:45 | Epoch: 1 | Step: 218770 | Dataset: 0-1740301 | Loss: 0.775 | 912 ms/step , 6895.69 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 05:26:55 | Epoch: 1 | Step: 218780 | Dataset: 0-1740621 | Loss: 0.653 | 912 ms/step , 6892.77 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 05:27:04 | Epoch: 1 | Step: 218790 | Dataset: 0-1740941 | Loss: 0.774 | 913 ms/step , 6885.22 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 05:27:13 | Epoch: 1 | Step: 218800 | Dataset: 0-1741261 | Loss: 0.674 | 913 ms/step , 6885.11 GFLOP/s , 17940.3 tokens/s INFO:__main__:2024-11-06 05:27:14 | Validation | Step: 218800 | Val_loss: 0.636 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:27:24 | Epoch: 1 | Step: 218810 | Dataset: 0-1741581 | Loss: 0.652 | 913 ms/step , 6890.41 GFLOP/s , 15277.0 tokens/s INFO:__main__:2024-11-06 05:27:33 | Epoch: 1 | Step: 218820 | Dataset: 0-1741901 | Loss: 0.598 | 913 ms/step , 6887.43 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 05:27:42 | Epoch: 1 | Step: 218830 | Dataset: 0-1742221 | Loss: 0.711 | 912 ms/step , 6893.28 GFLOP/s , 17943.7 tokens/s INFO:__main__:2024-11-06 05:27:51 | Epoch: 1 | Step: 218840 | Dataset: 0-1742541 | Loss: 0.661 | 913 ms/step , 6889.46 GFLOP/s , 17943.0 tokens/s INFO:__main__:2024-11-06 05:28:00 | Epoch: 1 | Step: 218850 | Dataset: 0-1742861 | Loss: 0.668 | 913 ms/step , 6892.43 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 05:28:09 | Epoch: 1 | Step: 218860 | Dataset: 0-1743181 | Loss: 0.664 | 913 ms/step , 6888.42 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 05:28:18 | Epoch: 1 | Step: 218870 | Dataset: 0-1743501 | Loss: 0.704 | 913 ms/step , 6886.65 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 05:28:27 | Epoch: 1 | Step: 218880 | Dataset: 0-1743821 | Loss: 0.619 | 915 ms/step , 6872.96 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 05:28:37 | Epoch: 1 | Step: 218890 | Dataset: 0-1744141 | Loss: 0.758 | 913 ms/step , 6888.25 GFLOP/s , 17943.5 tokens/s INFO:__main__:2024-11-06 05:28:46 | Epoch: 1 | Step: 218900 | Dataset: 0-1744461 | Loss: 0.763 | 912 ms/step , 6894.81 GFLOP/s , 17940.5 tokens/s INFO:__main__:2024-11-06 05:28:47 | Validation | Step: 218900 | Val_loss: 0.737 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:28:56 | Epoch: 1 | Step: 218910 | Dataset: 0-1744781 | Loss: 0.611 | 912 ms/step , 6894.92 GFLOP/s , 15284.2 tokens/s INFO:__main__:2024-11-06 05:29:06 | Epoch: 1 | Step: 218920 | Dataset: 0-1745101 | Loss: 0.721 | 913 ms/step , 6890.87 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 05:29:15 | Epoch: 1 | Step: 218930 | Dataset: 0-1745421 | Loss: 0.839 | 913 ms/step , 6885.30 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 05:29:24 | Epoch: 1 | Step: 218940 | Dataset: 0-1745741 | Loss: 0.706 | 913 ms/step , 6889.97 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 05:29:33 | Epoch: 1 | Step: 218950 | Dataset: 0-1746061 | Loss: 0.696 | 912 ms/step , 6895.31 GFLOP/s , 17948.0 tokens/s INFO:__main__:2024-11-06 05:29:42 | Epoch: 1 | Step: 218960 | Dataset: 0-1746381 | Loss: 0.614 | 912 ms/step , 6897.42 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-06 05:29:51 | Epoch: 1 | Step: 218970 | Dataset: 0-1746701 | Loss: 0.681 | 913 ms/step , 6892.38 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-06 05:30:00 | Epoch: 1 | Step: 218980 | Dataset: 0-1747021 | Loss: 0.658 | 913 ms/step , 6889.40 GFLOP/s , 17951.0 tokens/s INFO:__main__:2024-11-06 05:30:09 | Epoch: 1 | Step: 218990 | Dataset: 0-1747341 | Loss: 0.727 | 912 ms/step , 6895.83 GFLOP/s , 17945.0 tokens/s INFO:__main__:2024-11-06 05:30:19 | Epoch: 1 | Step: 219000 | Dataset: 0-1747661 | Loss: 0.703 | 913 ms/step , 6891.56 GFLOP/s , 17942.5 tokens/s INFO:__main__:2024-11-06 05:30:20 | Validation | Step: 219000 | Val_loss: 0.698 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:30:20 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_053020_step_219000.pt` INFO:__main__:2024-11-06 05:30:30 | Epoch: 1 | Step: 219010 | Dataset: 0-1747981 | Loss: 0.827 | 913 ms/step , 6889.19 GFLOP/s , 13823.3 tokens/s INFO:__main__:2024-11-06 05:30:40 | Epoch: 1 | Step: 219020 | Dataset: 0-1748301 | Loss: 0.765 | 913 ms/step , 6887.71 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-06 05:30:49 | Epoch: 1 | Step: 219030 | Dataset: 0-1748621 | Loss: 0.728 | 913 ms/step , 6889.77 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 05:30:58 | Epoch: 1 | Step: 219040 | Dataset: 0-1748941 | Loss: 0.610 | 912 ms/step , 6893.55 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-06 05:31:07 | Epoch: 1 | Step: 219050 | Dataset: 0-1749261 | Loss: 0.591 | 913 ms/step , 6886.27 GFLOP/s , 17948.8 tokens/s INFO:__main__:2024-11-06 05:31:16 | Epoch: 1 | Step: 219060 | Dataset: 0-1749581 | Loss: 0.633 | 912 ms/step , 6898.12 GFLOP/s , 17946.6 tokens/s INFO:__main__:2024-11-06 05:31:25 | Epoch: 1 | Step: 219070 | Dataset: 0-1749901 | Loss: 0.734 | 913 ms/step , 6886.76 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 05:31:34 | Epoch: 1 | Step: 219080 | Dataset: 0-1750221 | Loss: 0.653 | 913 ms/step , 6889.36 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-06 05:31:44 | Epoch: 1 | Step: 219090 | Dataset: 0-1750541 | Loss: 0.747 | 913 ms/step , 6889.79 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 05:31:53 | Epoch: 1 | Step: 219100 | Dataset: 0-1750861 | Loss: 0.893 | 914 ms/step , 6881.92 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 05:31:54 | Validation | Step: 219100 | Val_loss: 0.746 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:32:03 | Epoch: 1 | Step: 219110 | Dataset: 0-1751181 | Loss: 0.686 | 913 ms/step , 6888.06 GFLOP/s , 15276.7 tokens/s INFO:__main__:2024-11-06 05:32:13 | Epoch: 1 | Step: 219120 | Dataset: 0-1751501 | Loss: 0.721 | 912 ms/step , 6894.65 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 05:32:22 | Epoch: 1 | Step: 219130 | Dataset: 0-1751821 | Loss: 0.546 | 912 ms/step , 6894.03 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 05:32:31 | Epoch: 1 | Step: 219140 | Dataset: 0-1752141 | Loss: 0.854 | 915 ms/step , 6872.59 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 05:32:40 | Epoch: 1 | Step: 219150 | Dataset: 0-1752461 | Loss: 0.448 | 911 ms/step , 6904.90 GFLOP/s , 17965.4 tokens/s INFO:__main__:2024-11-06 05:32:49 | Epoch: 1 | Step: 219160 | Dataset: 0-1752781 | Loss: 0.489 | 911 ms/step , 6904.11 GFLOP/s , 17956.7 tokens/s INFO:__main__:2024-11-06 05:32:58 | Epoch: 1 | Step: 219170 | Dataset: 0-1753101 | Loss: 0.467 | 911 ms/step , 6905.54 GFLOP/s , 17959.6 tokens/s INFO:__main__:2024-11-06 05:33:07 | Epoch: 1 | Step: 219180 | Dataset: 0-1753421 | Loss: 0.331 | 912 ms/step , 6898.41 GFLOP/s , 17958.9 tokens/s INFO:__main__:2024-11-06 05:33:16 | Epoch: 1 | Step: 219190 | Dataset: 0-1753741 | Loss: 0.484 | 912 ms/step , 6898.85 GFLOP/s , 17960.8 tokens/s INFO:__main__:2024-11-06 05:33:26 | Epoch: 1 | Step: 219200 | Dataset: 0-1754061 | Loss: 0.497 | 911 ms/step , 6900.80 GFLOP/s , 17960.5 tokens/s INFO:__main__:2024-11-06 05:33:27 | Validation | Step: 219200 | Val_loss: 0.747 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:33:36 | Epoch: 1 | Step: 219210 | Dataset: 0-1754381 | Loss: 0.507 | 912 ms/step , 6897.38 GFLOP/s , 15292.4 tokens/s INFO:__main__:2024-11-06 05:33:45 | Epoch: 1 | Step: 219220 | Dataset: 0-1754701 | Loss: 0.314 | 912 ms/step , 6896.77 GFLOP/s , 17948.4 tokens/s INFO:__main__:2024-11-06 05:33:55 | Epoch: 1 | Step: 219230 | Dataset: 0-1755021 | Loss: 0.466 | 911 ms/step , 6901.14 GFLOP/s , 17961.3 tokens/s INFO:__main__:2024-11-06 05:34:04 | Epoch: 1 | Step: 219240 | Dataset: 0-1755341 | Loss: 0.386 | 912 ms/step , 6898.54 GFLOP/s , 17960.9 tokens/s INFO:__main__:2024-11-06 05:34:13 | Epoch: 1 | Step: 219250 | Dataset: 0-1755661 | Loss: 0.496 | 912 ms/step , 6893.93 GFLOP/s , 17953.1 tokens/s INFO:__main__:2024-11-06 05:34:22 | Epoch: 1 | Step: 219260 | Dataset: 0-1755981 | Loss: 0.371 | 912 ms/step , 6898.91 GFLOP/s , 17955.6 tokens/s INFO:__main__:2024-11-06 05:34:31 | Epoch: 1 | Step: 219270 | Dataset: 0-1756301 | Loss: 0.315 | 912 ms/step , 6895.98 GFLOP/s , 17960.8 tokens/s INFO:__main__:2024-11-06 05:34:40 | Epoch: 1 | Step: 219280 | Dataset: 0-1756621 | Loss: 0.394 | 914 ms/step , 6883.74 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-06 05:34:49 | Epoch: 1 | Step: 219290 | Dataset: 0-1756941 | Loss: 0.489 | 911 ms/step , 6900.27 GFLOP/s , 17966.9 tokens/s INFO:__main__:2024-11-06 05:34:58 | Epoch: 1 | Step: 219300 | Dataset: 0-1757261 | Loss: 0.342 | 912 ms/step , 6896.78 GFLOP/s , 17954.3 tokens/s INFO:__main__:2024-11-06 05:35:00 | Validation | Step: 219300 | Val_loss: 0.633 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:35:09 | Epoch: 1 | Step: 219310 | Dataset: 0-1757581 | Loss: 0.448 | 912 ms/step , 6898.99 GFLOP/s , 15302.9 tokens/s INFO:__main__:2024-11-06 05:35:18 | Epoch: 1 | Step: 219320 | Dataset: 0-1757901 | Loss: 0.400 | 911 ms/step , 6902.88 GFLOP/s , 17961.4 tokens/s INFO:__main__:2024-11-06 05:35:27 | Epoch: 1 | Step: 219330 | Dataset: 0-1758221 | Loss: 0.279 | 911 ms/step , 6902.51 GFLOP/s , 17952.1 tokens/s INFO:__main__:2024-11-06 05:35:36 | Epoch: 1 | Step: 219340 | Dataset: 0-1758541 | Loss: 0.474 | 912 ms/step , 6898.13 GFLOP/s , 17957.2 tokens/s INFO:__main__:2024-11-06 05:35:46 | Epoch: 1 | Step: 219350 | Dataset: 0-1758861 | Loss: 0.477 | 912 ms/step , 6893.31 GFLOP/s , 17951.2 tokens/s INFO:__main__:2024-11-06 05:35:55 | Epoch: 1 | Step: 219360 | Dataset: 0-1759181 | Loss: 0.544 | 913 ms/step , 6887.89 GFLOP/s , 17952.3 tokens/s INFO:__main__:2024-11-06 05:36:04 | Epoch: 1 | Step: 219370 | Dataset: 0-1759501 | Loss: 0.360 | 911 ms/step , 6902.07 GFLOP/s , 17965.6 tokens/s INFO:__main__:2024-11-06 05:36:13 | Epoch: 1 | Step: 219380 | Dataset: 0-1759821 | Loss: 0.296 | 911 ms/step , 6903.66 GFLOP/s , 17963.1 tokens/s INFO:__main__:2024-11-06 05:36:22 | Epoch: 1 | Step: 219390 | Dataset: 0-1760141 | Loss: 0.437 | 912 ms/step , 6896.93 GFLOP/s , 17955.1 tokens/s INFO:__main__:2024-11-06 05:36:31 | Epoch: 1 | Step: 219400 | Dataset: 0-1760461 | Loss: 0.375 | 912 ms/step , 6896.23 GFLOP/s , 17958.0 tokens/s INFO:__main__:2024-11-06 05:36:33 | Validation | Step: 219400 | Val_loss: 0.740 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:36:42 | Epoch: 1 | Step: 219410 | Dataset: 0-1760781 | Loss: 0.434 | 912 ms/step , 6897.04 GFLOP/s , 15295.4 tokens/s INFO:__main__:2024-11-06 05:36:51 | Epoch: 1 | Step: 219420 | Dataset: 0-1761101 | Loss: 0.422 | 912 ms/step , 6897.84 GFLOP/s , 17957.4 tokens/s INFO:__main__:2024-11-06 05:37:00 | Epoch: 1 | Step: 219430 | Dataset: 0-1761421 | Loss: 0.492 | 912 ms/step , 6893.14 GFLOP/s , 17949.8 tokens/s INFO:__main__:2024-11-06 05:37:09 | Epoch: 1 | Step: 219440 | Dataset: 0-1761741 | Loss: 0.667 | 914 ms/step , 6884.75 GFLOP/s , 17952.8 tokens/s INFO:__main__:2024-11-06 05:37:18 | Epoch: 1 | Step: 219450 | Dataset: 0-1762061 | Loss: 0.666 | 913 ms/step , 6888.17 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 05:37:28 | Epoch: 1 | Step: 219460 | Dataset: 0-1762381 | Loss: 0.692 | 913 ms/step , 6889.66 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 05:37:37 | Epoch: 1 | Step: 219470 | Dataset: 0-1762701 | Loss: 0.705 | 913 ms/step , 6891.73 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 05:37:46 | Epoch: 1 | Step: 219480 | Dataset: 0-1763021 | Loss: 0.775 | 913 ms/step , 6886.83 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 05:37:55 | Epoch: 1 | Step: 219490 | Dataset: 0-1763341 | Loss: 0.703 | 914 ms/step , 6884.59 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 05:38:04 | Epoch: 1 | Step: 219500 | Dataset: 0-1763661 | Loss: 0.799 | 913 ms/step , 6891.37 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 05:38:06 | Validation | Step: 219500 | Val_loss: 0.671 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:38:15 | Epoch: 1 | Step: 219510 | Dataset: 0-1763981 | Loss: 0.602 | 912 ms/step , 6893.51 GFLOP/s , 15274.4 tokens/s INFO:__main__:2024-11-06 05:38:24 | Epoch: 1 | Step: 219520 | Dataset: 0-1764301 | Loss: 0.735 | 914 ms/step , 6884.18 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 05:38:33 | Epoch: 1 | Step: 219530 | Dataset: 0-1764621 | Loss: 0.720 | 914 ms/step , 6880.62 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 05:38:42 | Epoch: 1 | Step: 219540 | Dataset: 0-1764941 | Loss: 0.609 | 912 ms/step , 6893.20 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 05:38:51 | Epoch: 1 | Step: 219550 | Dataset: 0-1765261 | Loss: 0.689 | 915 ms/step , 6874.18 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 05:39:01 | Epoch: 1 | Step: 219560 | Dataset: 0-1765581 | Loss: 0.547 | 911 ms/step , 6901.61 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 05:39:10 | Epoch: 1 | Step: 219570 | Dataset: 0-1765901 | Loss: 0.762 | 915 ms/step , 6877.48 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 05:39:19 | Epoch: 1 | Step: 219580 | Dataset: 0-1766221 | Loss: 0.588 | 913 ms/step , 6892.40 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 05:39:28 | Epoch: 1 | Step: 219590 | Dataset: 0-1766541 | Loss: 0.734 | 913 ms/step , 6886.09 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 05:39:37 | Epoch: 1 | Step: 219600 | Dataset: 0-1766861 | Loss: 0.628 | 914 ms/step , 6884.54 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 05:39:39 | Validation | Step: 219600 | Val_loss: 0.603 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:39:48 | Epoch: 1 | Step: 219610 | Dataset: 0-1767181 | Loss: 0.696 | 914 ms/step , 6884.30 GFLOP/s , 15273.6 tokens/s INFO:__main__:2024-11-06 05:39:57 | Epoch: 1 | Step: 219620 | Dataset: 0-1767501 | Loss: 0.658 | 913 ms/step , 6888.00 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 05:40:06 | Epoch: 1 | Step: 219630 | Dataset: 0-1767821 | Loss: 0.766 | 914 ms/step , 6882.12 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 05:40:15 | Epoch: 1 | Step: 219640 | Dataset: 0-1768141 | Loss: 0.495 | 914 ms/step , 6879.42 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 05:40:24 | Epoch: 1 | Step: 219650 | Dataset: 0-1768461 | Loss: 0.694 | 912 ms/step , 6894.46 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 05:40:33 | Epoch: 1 | Step: 219660 | Dataset: 0-1768781 | Loss: 0.667 | 913 ms/step , 6890.87 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 05:40:43 | Epoch: 1 | Step: 219670 | Dataset: 0-1769101 | Loss: 0.787 | 913 ms/step , 6886.55 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 05:40:52 | Epoch: 1 | Step: 219680 | Dataset: 0-1769421 | Loss: 0.641 | 914 ms/step , 6879.84 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 05:41:01 | Epoch: 1 | Step: 219690 | Dataset: 0-1769741 | Loss: 0.670 | 914 ms/step , 6882.10 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 05:41:10 | Epoch: 1 | Step: 219700 | Dataset: 0-1770061 | Loss: 0.591 | 914 ms/step , 6878.71 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 05:41:12 | Validation | Step: 219700 | Val_loss: 0.665 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:41:21 | Epoch: 1 | Step: 219710 | Dataset: 0-1770381 | Loss: 0.775 | 912 ms/step , 6892.97 GFLOP/s , 15286.7 tokens/s INFO:__main__:2024-11-06 05:41:30 | Epoch: 1 | Step: 219720 | Dataset: 0-1770701 | Loss: 0.674 | 914 ms/step , 6880.37 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 05:41:39 | Epoch: 1 | Step: 219730 | Dataset: 0-1771021 | Loss: 0.351 | 912 ms/step , 6897.96 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 05:41:48 | Epoch: 1 | Step: 219740 | Dataset: 0-1771341 | Loss: 0.818 | 914 ms/step , 6879.70 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 05:41:57 | Epoch: 1 | Step: 219750 | Dataset: 0-1771661 | Loss: 0.728 | 915 ms/step , 6872.12 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 05:42:06 | Epoch: 1 | Step: 219760 | Dataset: 0-1771981 | Loss: 0.685 | 913 ms/step , 6891.28 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 05:42:16 | Epoch: 1 | Step: 219770 | Dataset: 0-1772301 | Loss: 0.814 | 913 ms/step , 6892.34 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 05:42:25 | Epoch: 1 | Step: 219780 | Dataset: 0-1772621 | Loss: 0.702 | 913 ms/step , 6888.86 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 05:42:34 | Epoch: 1 | Step: 219790 | Dataset: 0-1772941 | Loss: 0.731 | 913 ms/step , 6886.16 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 05:42:43 | Epoch: 1 | Step: 219800 | Dataset: 0-1773261 | Loss: 0.669 | 913 ms/step , 6890.98 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 05:42:45 | Validation | Step: 219800 | Val_loss: 0.672 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:42:54 | Epoch: 1 | Step: 219810 | Dataset: 0-1773581 | Loss: 0.716 | 913 ms/step , 6887.69 GFLOP/s , 15275.4 tokens/s INFO:__main__:2024-11-06 05:43:03 | Epoch: 1 | Step: 219820 | Dataset: 0-1773901 | Loss: 0.766 | 913 ms/step , 6888.61 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 05:43:12 | Epoch: 1 | Step: 219830 | Dataset: 0-1774221 | Loss: 0.575 | 913 ms/step , 6885.16 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 05:43:21 | Epoch: 1 | Step: 219840 | Dataset: 0-1774541 | Loss: 0.615 | 913 ms/step , 6891.26 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 05:43:30 | Epoch: 1 | Step: 219850 | Dataset: 0-1774861 | Loss: 0.716 | 915 ms/step , 6874.75 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 05:43:39 | Epoch: 1 | Step: 219860 | Dataset: 0-1775181 | Loss: 0.702 | 914 ms/step , 6881.19 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 05:43:49 | Epoch: 1 | Step: 219870 | Dataset: 0-1775501 | Loss: 0.612 | 912 ms/step , 6892.72 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 05:43:58 | Epoch: 1 | Step: 219880 | Dataset: 0-1775821 | Loss: 0.626 | 913 ms/step , 6887.56 GFLOP/s , 17920.5 tokens/s INFO:__main__:2024-11-06 05:44:07 | Epoch: 1 | Step: 219890 | Dataset: 0-1776141 | Loss: 0.699 | 913 ms/step , 6889.78 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 05:44:16 | Epoch: 1 | Step: 219900 | Dataset: 0-1776461 | Loss: 0.738 | 913 ms/step , 6885.52 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 05:44:18 | Validation | Step: 219900 | Val_loss: 0.752 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:44:27 | Epoch: 1 | Step: 219910 | Dataset: 0-1776781 | Loss: 0.614 | 913 ms/step , 6887.91 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-06 05:44:36 | Epoch: 1 | Step: 219920 | Dataset: 0-1777101 | Loss: 0.696 | 913 ms/step , 6890.94 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 05:44:45 | Epoch: 1 | Step: 219930 | Dataset: 0-1777421 | Loss: 0.733 | 914 ms/step , 6884.06 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 05:44:54 | Epoch: 1 | Step: 219940 | Dataset: 0-1777741 | Loss: 0.616 | 915 ms/step , 6870.03 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 05:45:03 | Epoch: 1 | Step: 219950 | Dataset: 0-1778061 | Loss: 0.697 | 912 ms/step , 6894.40 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 05:45:12 | Epoch: 1 | Step: 219960 | Dataset: 0-1778381 | Loss: 0.684 | 912 ms/step , 6893.79 GFLOP/s , 17942.4 tokens/s INFO:__main__:2024-11-06 05:45:22 | Epoch: 1 | Step: 219970 | Dataset: 0-1778701 | Loss: 0.707 | 913 ms/step , 6890.17 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 05:45:31 | Epoch: 1 | Step: 219980 | Dataset: 0-1779021 | Loss: 0.781 | 913 ms/step , 6892.43 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 05:45:40 | Epoch: 1 | Step: 219990 | Dataset: 0-1779341 | Loss: 0.602 | 913 ms/step , 6887.58 GFLOP/s , 17937.3 tokens/s INFO:__main__:2024-11-06 05:45:49 | Epoch: 1 | Step: 220000 | Dataset: 0-1779661 | Loss: 0.716 | 915 ms/step , 6874.17 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 05:45:50 | Validation | Step: 220000 | Val_loss: 0.675 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:45:50 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_054550_step_220000.pt` INFO:__main__:2024-11-06 05:46:01 | Epoch: 1 | Step: 220010 | Dataset: 0-1779981 | Loss: 0.628 | 912 ms/step , 6898.43 GFLOP/s , 13775.2 tokens/s INFO:__main__:2024-11-06 05:46:10 | Epoch: 1 | Step: 220020 | Dataset: 0-1780301 | Loss: 0.641 | 914 ms/step , 6881.50 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 05:46:19 | Epoch: 1 | Step: 220030 | Dataset: 0-1780621 | Loss: 0.679 | 913 ms/step , 6886.62 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 05:46:28 | Epoch: 1 | Step: 220040 | Dataset: 0-1780941 | Loss: 0.677 | 914 ms/step , 6878.14 GFLOP/s , 17922.9 tokens/s INFO:__main__:2024-11-06 05:46:37 | Epoch: 1 | Step: 220050 | Dataset: 0-1781261 | Loss: 0.776 | 913 ms/step , 6889.78 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 05:46:46 | Epoch: 1 | Step: 220060 | Dataset: 0-1781581 | Loss: 0.709 | 914 ms/step , 6880.18 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 05:46:56 | Epoch: 1 | Step: 220070 | Dataset: 0-1781901 | Loss: 0.766 | 913 ms/step , 6888.73 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 05:47:05 | Epoch: 1 | Step: 220080 | Dataset: 0-1782221 | Loss: 0.603 | 914 ms/step , 6880.56 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 05:47:14 | Epoch: 1 | Step: 220090 | Dataset: 0-1782541 | Loss: 0.748 | 914 ms/step , 6884.06 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 05:47:23 | Epoch: 1 | Step: 220100 | Dataset: 0-1782861 | Loss: 0.637 | 913 ms/step , 6887.07 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 05:47:25 | Validation | Step: 220100 | Val_loss: 0.647 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:47:34 | Epoch: 1 | Step: 220110 | Dataset: 0-1783181 | Loss: 0.698 | 913 ms/step , 6887.61 GFLOP/s , 15282.4 tokens/s INFO:__main__:2024-11-06 05:47:43 | Epoch: 1 | Step: 220120 | Dataset: 0-1783501 | Loss: 0.655 | 912 ms/step , 6895.06 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 05:47:52 | Epoch: 1 | Step: 220130 | Dataset: 0-1783821 | Loss: 0.802 | 914 ms/step , 6882.50 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 05:48:01 | Epoch: 1 | Step: 220140 | Dataset: 0-1784141 | Loss: 0.693 | 912 ms/step , 6894.55 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 05:48:10 | Epoch: 1 | Step: 220150 | Dataset: 0-1784461 | Loss: 0.692 | 914 ms/step , 6880.04 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 05:48:19 | Epoch: 1 | Step: 220160 | Dataset: 0-1784781 | Loss: 0.754 | 913 ms/step , 6889.27 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 05:48:29 | Epoch: 1 | Step: 220170 | Dataset: 0-1785101 | Loss: 0.728 | 913 ms/step , 6889.93 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 05:48:38 | Epoch: 1 | Step: 220180 | Dataset: 0-1785421 | Loss: 0.694 | 915 ms/step , 6877.46 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 05:48:47 | Epoch: 1 | Step: 220190 | Dataset: 0-1785741 | Loss: 0.601 | 914 ms/step , 6884.90 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 05:48:56 | Epoch: 1 | Step: 220200 | Dataset: 0-1786061 | Loss: 0.750 | 913 ms/step , 6890.97 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 05:48:58 | Validation | Step: 220200 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:49:07 | Epoch: 1 | Step: 220210 | Dataset: 0-1786381 | Loss: 0.742 | 913 ms/step , 6886.31 GFLOP/s , 15273.7 tokens/s INFO:__main__:2024-11-06 05:49:16 | Epoch: 1 | Step: 220220 | Dataset: 0-1786701 | Loss: 0.505 | 912 ms/step , 6894.83 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 05:49:25 | Epoch: 1 | Step: 220230 | Dataset: 0-1787021 | Loss: 0.695 | 913 ms/step , 6890.57 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 05:49:34 | Epoch: 1 | Step: 220240 | Dataset: 0-1787341 | Loss: 0.631 | 913 ms/step , 6888.30 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 05:49:43 | Epoch: 1 | Step: 220250 | Dataset: 0-1787661 | Loss: 0.712 | 915 ms/step , 6875.16 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 05:49:52 | Epoch: 1 | Step: 220260 | Dataset: 0-1787981 | Loss: 0.684 | 913 ms/step , 6892.20 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 05:50:02 | Epoch: 1 | Step: 220270 | Dataset: 0-1788301 | Loss: 0.593 | 914 ms/step , 6878.99 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 05:50:11 | Epoch: 1 | Step: 220280 | Dataset: 0-1788621 | Loss: 0.593 | 913 ms/step , 6888.13 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 05:50:20 | Epoch: 1 | Step: 220290 | Dataset: 0-1788941 | Loss: 0.724 | 912 ms/step , 6894.48 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 05:50:29 | Epoch: 1 | Step: 220300 | Dataset: 0-1789261 | Loss: 0.661 | 914 ms/step , 6884.61 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 05:50:31 | Validation | Step: 220300 | Val_loss: 0.675 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:50:40 | Epoch: 1 | Step: 220310 | Dataset: 0-1789581 | Loss: 0.723 | 913 ms/step , 6891.00 GFLOP/s , 15288.1 tokens/s INFO:__main__:2024-11-06 05:50:49 | Epoch: 1 | Step: 220320 | Dataset: 0-1789901 | Loss: 0.551 | 912 ms/step , 6896.32 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 05:50:58 | Epoch: 1 | Step: 220330 | Dataset: 0-1790221 | Loss: 0.821 | 913 ms/step , 6886.95 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 05:51:07 | Epoch: 1 | Step: 220340 | Dataset: 0-1790541 | Loss: 0.680 | 914 ms/step , 6882.35 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 05:51:16 | Epoch: 1 | Step: 220350 | Dataset: 0-1790861 | Loss: 0.693 | 913 ms/step , 6891.59 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 05:51:25 | Epoch: 1 | Step: 220360 | Dataset: 0-1791181 | Loss: 0.606 | 912 ms/step , 6898.79 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-06 05:51:35 | Epoch: 1 | Step: 220370 | Dataset: 0-1791501 | Loss: 0.750 | 914 ms/step , 6883.50 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 05:51:44 | Epoch: 1 | Step: 220380 | Dataset: 0-1791821 | Loss: 0.665 | 912 ms/step , 6895.92 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 05:51:53 | Epoch: 1 | Step: 220390 | Dataset: 0-1792141 | Loss: 0.690 | 913 ms/step , 6887.04 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-06 05:52:02 | Epoch: 1 | Step: 220400 | Dataset: 0-1792461 | Loss: 0.606 | 913 ms/step , 6888.03 GFLOP/s , 17942.3 tokens/s INFO:__main__:2024-11-06 05:52:03 | Validation | Step: 220400 | Val_loss: 0.694 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:52:13 | Epoch: 1 | Step: 220410 | Dataset: 0-1792781 | Loss: 0.771 | 914 ms/step , 6884.84 GFLOP/s , 15275.7 tokens/s INFO:__main__:2024-11-06 05:52:22 | Epoch: 1 | Step: 220420 | Dataset: 0-1793101 | Loss: 0.652 | 913 ms/step , 6885.89 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 05:52:31 | Epoch: 1 | Step: 220430 | Dataset: 0-1793421 | Loss: 0.712 | 914 ms/step , 6882.10 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 05:52:40 | Epoch: 1 | Step: 220440 | Dataset: 0-1793741 | Loss: 0.586 | 913 ms/step , 6892.48 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 05:52:49 | Epoch: 1 | Step: 220450 | Dataset: 0-1794061 | Loss: 0.653 | 913 ms/step , 6892.07 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 05:52:58 | Epoch: 1 | Step: 220460 | Dataset: 0-1794381 | Loss: 0.725 | 913 ms/step , 6890.14 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 05:53:07 | Epoch: 1 | Step: 220470 | Dataset: 0-1794701 | Loss: 0.724 | 914 ms/step , 6879.64 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 05:53:17 | Epoch: 1 | Step: 220480 | Dataset: 0-1795021 | Loss: 0.671 | 913 ms/step , 6890.24 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 05:53:26 | Epoch: 1 | Step: 220490 | Dataset: 0-1795341 | Loss: 0.716 | 914 ms/step , 6884.78 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 05:53:35 | Epoch: 1 | Step: 220500 | Dataset: 0-1795661 | Loss: 0.668 | 913 ms/step , 6887.61 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 05:53:36 | Validation | Step: 220500 | Val_loss: 0.706 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:53:46 | Epoch: 1 | Step: 220510 | Dataset: 0-1795981 | Loss: 0.740 | 914 ms/step , 6882.83 GFLOP/s , 15276.6 tokens/s INFO:__main__:2024-11-06 05:53:55 | Epoch: 1 | Step: 220520 | Dataset: 0-1796301 | Loss: 0.699 | 914 ms/step , 6878.66 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 05:54:04 | Epoch: 1 | Step: 220530 | Dataset: 0-1796621 | Loss: 0.722 | 913 ms/step , 6891.15 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 05:54:13 | Epoch: 1 | Step: 220540 | Dataset: 0-1796941 | Loss: 0.714 | 913 ms/step , 6889.57 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 05:54:22 | Epoch: 1 | Step: 220550 | Dataset: 0-1797261 | Loss: 0.666 | 911 ms/step , 6901.09 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 05:54:31 | Epoch: 1 | Step: 220560 | Dataset: 0-1797581 | Loss: 0.710 | 914 ms/step , 6883.12 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 05:54:40 | Epoch: 1 | Step: 220570 | Dataset: 0-1797901 | Loss: 0.656 | 913 ms/step , 6885.32 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 05:54:50 | Epoch: 1 | Step: 220580 | Dataset: 0-1798221 | Loss: 0.717 | 914 ms/step , 6881.95 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 05:54:59 | Epoch: 1 | Step: 220590 | Dataset: 0-1798541 | Loss: 0.725 | 913 ms/step , 6885.92 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 05:55:08 | Epoch: 1 | Step: 220600 | Dataset: 0-1798861 | Loss: 0.711 | 914 ms/step , 6879.28 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 05:55:09 | Validation | Step: 220600 | Val_loss: 0.651 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:55:19 | Epoch: 1 | Step: 220610 | Dataset: 0-1799181 | Loss: 0.688 | 913 ms/step , 6886.32 GFLOP/s , 15276.2 tokens/s INFO:__main__:2024-11-06 05:55:28 | Epoch: 1 | Step: 220620 | Dataset: 0-1799501 | Loss: 0.686 | 914 ms/step , 6879.38 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 05:55:37 | Epoch: 1 | Step: 220630 | Dataset: 0-1799821 | Loss: 0.704 | 915 ms/step , 6874.45 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 05:55:46 | Epoch: 1 | Step: 220640 | Dataset: 0-1800141 | Loss: 0.810 | 913 ms/step , 6886.65 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 05:55:55 | Epoch: 1 | Step: 220650 | Dataset: 0-1800461 | Loss: 0.630 | 914 ms/step , 6879.73 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 05:56:04 | Epoch: 1 | Step: 220660 | Dataset: 0-1800781 | Loss: 0.672 | 915 ms/step , 6874.79 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 05:56:13 | Epoch: 1 | Step: 220670 | Dataset: 0-1801101 | Loss: 0.793 | 912 ms/step , 6894.51 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 05:56:22 | Epoch: 1 | Step: 220680 | Dataset: 0-1801421 | Loss: 0.692 | 913 ms/step , 6888.02 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 05:56:32 | Epoch: 1 | Step: 220690 | Dataset: 0-1801741 | Loss: 0.639 | 913 ms/step , 6887.37 GFLOP/s , 17933.6 tokens/s INFO:__main__:2024-11-06 05:56:41 | Epoch: 1 | Step: 220700 | Dataset: 0-1802061 | Loss: 0.658 | 913 ms/step , 6886.08 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 05:56:42 | Validation | Step: 220700 | Val_loss: 0.686 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:56:51 | Epoch: 1 | Step: 220710 | Dataset: 0-1802381 | Loss: 0.637 | 912 ms/step , 6896.27 GFLOP/s , 15269.3 tokens/s INFO:__main__:2024-11-06 05:57:01 | Epoch: 1 | Step: 220720 | Dataset: 0-1802701 | Loss: 0.698 | 913 ms/step , 6888.05 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 05:57:10 | Epoch: 1 | Step: 220730 | Dataset: 0-1803021 | Loss: 0.774 | 913 ms/step , 6889.58 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-06 05:57:19 | Epoch: 1 | Step: 220740 | Dataset: 0-1803341 | Loss: 0.825 | 914 ms/step , 6881.21 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 05:57:28 | Epoch: 1 | Step: 220750 | Dataset: 0-1803661 | Loss: 0.756 | 912 ms/step , 6893.16 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-06 05:57:37 | Epoch: 1 | Step: 220760 | Dataset: 0-1803981 | Loss: 0.726 | 912 ms/step , 6894.85 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 05:57:46 | Epoch: 1 | Step: 220770 | Dataset: 0-1804301 | Loss: 0.738 | 913 ms/step , 6891.93 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 05:57:55 | Epoch: 1 | Step: 220780 | Dataset: 0-1804621 | Loss: 0.710 | 912 ms/step , 6896.80 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 05:58:05 | Epoch: 1 | Step: 220790 | Dataset: 0-1804941 | Loss: 0.697 | 913 ms/step , 6891.68 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 05:58:14 | Epoch: 1 | Step: 220800 | Dataset: 0-1805261 | Loss: 0.601 | 912 ms/step , 6896.55 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 05:58:15 | Validation | Step: 220800 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:58:24 | Epoch: 1 | Step: 220810 | Dataset: 0-1805581 | Loss: 0.643 | 913 ms/step , 6891.26 GFLOP/s , 15275.4 tokens/s INFO:__main__:2024-11-06 05:58:34 | Epoch: 1 | Step: 220820 | Dataset: 0-1805901 | Loss: 0.767 | 914 ms/step , 6881.51 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 05:58:43 | Epoch: 1 | Step: 220830 | Dataset: 0-1806221 | Loss: 0.654 | 914 ms/step , 6883.77 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 05:58:52 | Epoch: 1 | Step: 220840 | Dataset: 0-1806541 | Loss: 0.811 | 913 ms/step , 6887.32 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 05:59:01 | Epoch: 1 | Step: 220850 | Dataset: 0-1806861 | Loss: 0.880 | 915 ms/step , 6875.22 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 05:59:10 | Epoch: 1 | Step: 220860 | Dataset: 0-1807181 | Loss: 0.709 | 914 ms/step , 6878.60 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 05:59:19 | Epoch: 1 | Step: 220870 | Dataset: 0-1807501 | Loss: 0.632 | 914 ms/step , 6883.62 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 05:59:28 | Epoch: 1 | Step: 220880 | Dataset: 0-1807821 | Loss: 0.617 | 913 ms/step , 6886.25 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 05:59:38 | Epoch: 1 | Step: 220890 | Dataset: 0-1808141 | Loss: 0.800 | 912 ms/step , 6894.11 GFLOP/s , 17931.0 tokens/s INFO:__main__:2024-11-06 05:59:47 | Epoch: 1 | Step: 220900 | Dataset: 0-1808461 | Loss: 0.718 | 913 ms/step , 6892.01 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 05:59:48 | Validation | Step: 220900 | Val_loss: 0.676 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 05:59:57 | Epoch: 1 | Step: 220910 | Dataset: 0-1808781 | Loss: 0.786 | 913 ms/step , 6891.88 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 06:00:07 | Epoch: 1 | Step: 220920 | Dataset: 0-1809101 | Loss: 0.756 | 913 ms/step , 6887.56 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 06:00:16 | Epoch: 1 | Step: 220930 | Dataset: 0-1809421 | Loss: 0.770 | 915 ms/step , 6872.72 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 06:00:25 | Epoch: 1 | Step: 220940 | Dataset: 0-1809741 | Loss: 0.580 | 913 ms/step , 6886.55 GFLOP/s , 17939.8 tokens/s INFO:__main__:2024-11-06 06:00:34 | Epoch: 1 | Step: 220950 | Dataset: 0-1810061 | Loss: 0.641 | 912 ms/step , 6894.42 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:00:43 | Epoch: 1 | Step: 220960 | Dataset: 0-1810381 | Loss: 0.693 | 912 ms/step , 6895.59 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 06:00:52 | Epoch: 1 | Step: 220970 | Dataset: 0-1810701 | Loss: 0.670 | 913 ms/step , 6888.32 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 06:01:01 | Epoch: 1 | Step: 220980 | Dataset: 0-1811021 | Loss: 0.737 | 913 ms/step , 6888.18 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:01:10 | Epoch: 1 | Step: 220990 | Dataset: 0-1811341 | Loss: 0.708 | 914 ms/step , 6883.08 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 06:01:20 | Epoch: 1 | Step: 221000 | Dataset: 0-1811661 | Loss: 0.731 | 913 ms/step , 6891.36 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 06:01:21 | Validation | Step: 221000 | Val_loss: 0.702 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:01:21 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_060121_step_221000.pt` INFO:__main__:2024-11-06 06:01:31 | Epoch: 1 | Step: 221010 | Dataset: 0-1811981 | Loss: 0.736 | 914 ms/step , 6882.41 GFLOP/s , 13821.9 tokens/s INFO:__main__:2024-11-06 06:01:41 | Epoch: 1 | Step: 221020 | Dataset: 0-1812301 | Loss: 0.647 | 912 ms/step , 6898.97 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 06:01:50 | Epoch: 1 | Step: 221030 | Dataset: 0-1812621 | Loss: 0.693 | 914 ms/step , 6880.96 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-06 06:01:59 | Epoch: 1 | Step: 221040 | Dataset: 0-1812941 | Loss: 0.809 | 913 ms/step , 6886.14 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-06 06:02:08 | Epoch: 1 | Step: 221050 | Dataset: 0-1813261 | Loss: 0.654 | 913 ms/step , 6891.31 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:02:17 | Epoch: 1 | Step: 221060 | Dataset: 0-1813581 | Loss: 0.756 | 914 ms/step , 6882.41 GFLOP/s , 17922.5 tokens/s INFO:__main__:2024-11-06 06:02:26 | Epoch: 1 | Step: 221070 | Dataset: 0-1813901 | Loss: 0.685 | 914 ms/step , 6883.37 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 06:02:35 | Epoch: 1 | Step: 221080 | Dataset: 0-1814221 | Loss: 0.623 | 913 ms/step , 6890.53 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 06:02:45 | Epoch: 1 | Step: 221090 | Dataset: 0-1814541 | Loss: 0.664 | 912 ms/step , 6892.71 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 06:02:54 | Epoch: 1 | Step: 221100 | Dataset: 0-1814861 | Loss: 0.740 | 913 ms/step , 6888.81 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 06:02:55 | Validation | Step: 221100 | Val_loss: 0.649 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:03:04 | Epoch: 1 | Step: 221110 | Dataset: 0-1815181 | Loss: 0.681 | 912 ms/step , 6896.22 GFLOP/s , 15276.5 tokens/s INFO:__main__:2024-11-06 06:03:14 | Epoch: 1 | Step: 221120 | Dataset: 0-1815501 | Loss: 0.695 | 913 ms/step , 6892.03 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 06:03:23 | Epoch: 1 | Step: 221130 | Dataset: 0-1815821 | Loss: 0.877 | 914 ms/step , 6883.70 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 06:03:32 | Epoch: 1 | Step: 221140 | Dataset: 0-1816141 | Loss: 0.758 | 913 ms/step , 6885.81 GFLOP/s , 17927.7 tokens/s INFO:__main__:2024-11-06 06:03:41 | Epoch: 1 | Step: 221150 | Dataset: 0-1816461 | Loss: 0.642 | 914 ms/step , 6882.22 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 06:03:50 | Epoch: 1 | Step: 221160 | Dataset: 0-1816781 | Loss: 0.785 | 914 ms/step , 6878.94 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 06:03:59 | Epoch: 1 | Step: 221170 | Dataset: 0-1817101 | Loss: 0.672 | 914 ms/step , 6882.54 GFLOP/s , 17938.9 tokens/s INFO:__main__:2024-11-06 06:04:08 | Epoch: 1 | Step: 221180 | Dataset: 0-1817421 | Loss: 0.709 | 913 ms/step , 6885.65 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 06:04:18 | Epoch: 1 | Step: 221190 | Dataset: 0-1817741 | Loss: 0.610 | 912 ms/step , 6892.77 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 06:04:27 | Epoch: 1 | Step: 221200 | Dataset: 0-1818061 | Loss: 0.865 | 914 ms/step , 6881.26 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 06:04:28 | Validation | Step: 221200 | Val_loss: 0.714 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:04:37 | Epoch: 1 | Step: 221210 | Dataset: 0-1818381 | Loss: 0.712 | 914 ms/step , 6884.59 GFLOP/s , 15299.3 tokens/s INFO:__main__:2024-11-06 06:04:47 | Epoch: 1 | Step: 221220 | Dataset: 0-1818701 | Loss: 0.794 | 915 ms/step , 6874.33 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 06:04:56 | Epoch: 1 | Step: 221230 | Dataset: 0-1819021 | Loss: 0.679 | 913 ms/step , 6890.72 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 06:05:05 | Epoch: 1 | Step: 221240 | Dataset: 0-1819341 | Loss: 0.643 | 913 ms/step , 6890.96 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 06:05:14 | Epoch: 1 | Step: 221250 | Dataset: 0-1819661 | Loss: 0.846 | 914 ms/step , 6884.91 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 06:05:23 | Epoch: 1 | Step: 221260 | Dataset: 0-1819981 | Loss: 0.759 | 914 ms/step , 6880.14 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 06:05:32 | Epoch: 1 | Step: 221270 | Dataset: 0-1820301 | Loss: 0.757 | 913 ms/step , 6891.05 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 06:05:41 | Epoch: 1 | Step: 221280 | Dataset: 0-1820621 | Loss: 0.803 | 914 ms/step , 6881.22 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 06:05:50 | Epoch: 1 | Step: 221290 | Dataset: 0-1820941 | Loss: 0.438 | 913 ms/step , 6888.29 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 06:06:00 | Epoch: 1 | Step: 221300 | Dataset: 0-1821261 | Loss: 0.786 | 913 ms/step , 6890.51 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 06:06:01 | Validation | Step: 221300 | Val_loss: 0.708 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:06:10 | Epoch: 1 | Step: 221310 | Dataset: 0-1821581 | Loss: 0.605 | 913 ms/step , 6892.37 GFLOP/s , 15279.3 tokens/s INFO:__main__:2024-11-06 06:06:19 | Epoch: 1 | Step: 221320 | Dataset: 0-1821901 | Loss: 0.762 | 914 ms/step , 6883.67 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 06:06:29 | Epoch: 1 | Step: 221330 | Dataset: 0-1822221 | Loss: 0.691 | 913 ms/step , 6889.30 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 06:06:38 | Epoch: 1 | Step: 221340 | Dataset: 0-1822541 | Loss: 0.738 | 913 ms/step , 6890.04 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 06:06:47 | Epoch: 1 | Step: 221350 | Dataset: 0-1822861 | Loss: 0.820 | 913 ms/step , 6889.24 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 06:06:56 | Epoch: 1 | Step: 221360 | Dataset: 0-1823181 | Loss: 0.796 | 913 ms/step , 6888.54 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 06:07:05 | Epoch: 1 | Step: 221370 | Dataset: 0-1823501 | Loss: 0.713 | 914 ms/step , 6883.96 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 06:07:14 | Epoch: 1 | Step: 221380 | Dataset: 0-1823821 | Loss: 0.631 | 912 ms/step , 6893.11 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 06:07:23 | Epoch: 1 | Step: 221390 | Dataset: 0-1824141 | Loss: 0.613 | 913 ms/step , 6889.34 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:07:33 | Epoch: 1 | Step: 221400 | Dataset: 0-1824461 | Loss: 0.760 | 916 ms/step , 6868.99 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 06:07:34 | Validation | Step: 221400 | Val_loss: 0.684 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:07:43 | Epoch: 1 | Step: 221410 | Dataset: 0-1824781 | Loss: 0.770 | 914 ms/step , 6882.05 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-06 06:07:52 | Epoch: 1 | Step: 221420 | Dataset: 0-1825101 | Loss: 0.780 | 913 ms/step , 6888.92 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 06:08:02 | Epoch: 1 | Step: 221430 | Dataset: 0-1825421 | Loss: 0.783 | 914 ms/step , 6880.42 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 06:08:11 | Epoch: 1 | Step: 221440 | Dataset: 0-1825741 | Loss: 0.782 | 914 ms/step , 6882.21 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 06:08:20 | Epoch: 1 | Step: 221450 | Dataset: 0-1826061 | Loss: 0.648 | 914 ms/step , 6884.76 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 06:08:29 | Epoch: 1 | Step: 221460 | Dataset: 0-1826381 | Loss: 0.799 | 912 ms/step , 6893.26 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 06:08:38 | Epoch: 1 | Step: 221470 | Dataset: 0-1826701 | Loss: 0.867 | 915 ms/step , 6870.47 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 06:08:47 | Epoch: 1 | Step: 221480 | Dataset: 0-1827021 | Loss: 0.802 | 914 ms/step , 6877.54 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-06 06:08:56 | Epoch: 1 | Step: 221490 | Dataset: 0-1827341 | Loss: 0.771 | 913 ms/step , 6885.72 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 06:09:06 | Epoch: 1 | Step: 221500 | Dataset: 0-1827661 | Loss: 0.682 | 913 ms/step , 6891.75 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:09:07 | Validation | Step: 221500 | Val_loss: 0.706 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:09:16 | Epoch: 1 | Step: 221510 | Dataset: 0-1827981 | Loss: 0.684 | 913 ms/step , 6891.94 GFLOP/s , 15292.9 tokens/s INFO:__main__:2024-11-06 06:09:25 | Epoch: 1 | Step: 221520 | Dataset: 0-1828301 | Loss: 0.712 | 913 ms/step , 6887.61 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 06:09:35 | Epoch: 1 | Step: 221530 | Dataset: 0-1828621 | Loss: 0.774 | 913 ms/step , 6890.69 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 06:09:44 | Epoch: 1 | Step: 221540 | Dataset: 0-1828941 | Loss: 0.690 | 913 ms/step , 6889.07 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 06:09:53 | Epoch: 1 | Step: 221550 | Dataset: 0-1829261 | Loss: 0.799 | 913 ms/step , 6891.42 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:10:02 | Epoch: 1 | Step: 221560 | Dataset: 0-1829581 | Loss: 0.831 | 912 ms/step , 6896.25 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-06 06:10:11 | Epoch: 1 | Step: 221570 | Dataset: 0-1829901 | Loss: 0.807 | 913 ms/step , 6891.47 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 06:10:20 | Epoch: 1 | Step: 221580 | Dataset: 0-1830221 | Loss: 0.772 | 914 ms/step , 6883.99 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 06:10:29 | Epoch: 1 | Step: 221590 | Dataset: 0-1830541 | Loss: 0.726 | 912 ms/step , 6896.07 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 06:10:38 | Epoch: 1 | Step: 221600 | Dataset: 0-1830861 | Loss: 0.733 | 913 ms/step , 6888.09 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 06:10:40 | Validation | Step: 221600 | Val_loss: 0.716 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:10:49 | Epoch: 1 | Step: 221610 | Dataset: 0-1831181 | Loss: 0.588 | 912 ms/step , 6899.61 GFLOP/s , 15278.7 tokens/s INFO:__main__:2024-11-06 06:10:58 | Epoch: 1 | Step: 221620 | Dataset: 0-1831501 | Loss: 0.682 | 913 ms/step , 6888.57 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 06:11:07 | Epoch: 1 | Step: 221630 | Dataset: 0-1831821 | Loss: 0.683 | 913 ms/step , 6888.85 GFLOP/s , 17929.4 tokens/s INFO:__main__:2024-11-06 06:11:17 | Epoch: 1 | Step: 221640 | Dataset: 0-1832141 | Loss: 0.774 | 912 ms/step , 6892.94 GFLOP/s , 17942.0 tokens/s INFO:__main__:2024-11-06 06:11:26 | Epoch: 1 | Step: 221650 | Dataset: 0-1832461 | Loss: 0.468 | 912 ms/step , 6896.67 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 06:11:35 | Epoch: 1 | Step: 221660 | Dataset: 0-1832781 | Loss: 0.690 | 912 ms/step , 6894.83 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 06:11:44 | Epoch: 1 | Step: 221670 | Dataset: 0-1833101 | Loss: 0.670 | 913 ms/step , 6891.11 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 06:11:53 | Epoch: 1 | Step: 221680 | Dataset: 0-1833421 | Loss: 0.793 | 914 ms/step , 6880.85 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 06:12:02 | Epoch: 1 | Step: 221690 | Dataset: 0-1833741 | Loss: 0.765 | 913 ms/step , 6887.93 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 06:12:11 | Epoch: 1 | Step: 221700 | Dataset: 0-1834061 | Loss: 0.737 | 913 ms/step , 6888.55 GFLOP/s , 17941.7 tokens/s INFO:__main__:2024-11-06 06:12:13 | Validation | Step: 221700 | Val_loss: 0.645 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:12:22 | Epoch: 1 | Step: 221710 | Dataset: 0-1834381 | Loss: 0.626 | 912 ms/step , 6899.59 GFLOP/s , 15281.5 tokens/s INFO:__main__:2024-11-06 06:12:31 | Epoch: 1 | Step: 221720 | Dataset: 0-1834701 | Loss: 0.792 | 912 ms/step , 6898.02 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 06:12:40 | Epoch: 1 | Step: 221730 | Dataset: 0-1835021 | Loss: 0.760 | 913 ms/step , 6891.87 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 06:12:50 | Epoch: 1 | Step: 221740 | Dataset: 0-1835341 | Loss: 0.772 | 913 ms/step , 6889.93 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:12:59 | Epoch: 1 | Step: 221750 | Dataset: 0-1835661 | Loss: 0.706 | 914 ms/step , 6883.38 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 06:13:08 | Epoch: 1 | Step: 221760 | Dataset: 0-1835981 | Loss: 0.514 | 914 ms/step , 6882.89 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 06:13:17 | Epoch: 1 | Step: 221770 | Dataset: 0-1836301 | Loss: 0.666 | 913 ms/step , 6889.74 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 06:13:26 | Epoch: 1 | Step: 221780 | Dataset: 0-1836621 | Loss: 0.746 | 913 ms/step , 6887.07 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-06 06:13:35 | Epoch: 1 | Step: 221790 | Dataset: 0-1836941 | Loss: 0.757 | 913 ms/step , 6887.67 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-06 06:13:44 | Epoch: 1 | Step: 221800 | Dataset: 0-1837261 | Loss: 0.760 | 914 ms/step , 6881.12 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 06:13:46 | Validation | Step: 221800 | Val_loss: 0.722 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:13:55 | Epoch: 1 | Step: 221810 | Dataset: 0-1837581 | Loss: 0.527 | 912 ms/step , 6895.55 GFLOP/s , 15280.6 tokens/s INFO:__main__:2024-11-06 06:14:04 | Epoch: 1 | Step: 221820 | Dataset: 0-1837901 | Loss: 0.776 | 913 ms/step , 6889.17 GFLOP/s , 17934.9 tokens/s INFO:__main__:2024-11-06 06:14:13 | Epoch: 1 | Step: 221830 | Dataset: 0-1838221 | Loss: 0.667 | 913 ms/step , 6888.49 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 06:14:22 | Epoch: 1 | Step: 221840 | Dataset: 0-1838541 | Loss: 0.646 | 914 ms/step , 6881.02 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-06 06:14:32 | Epoch: 1 | Step: 221850 | Dataset: 0-1838861 | Loss: 0.913 | 915 ms/step , 6871.95 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 06:14:41 | Epoch: 1 | Step: 221860 | Dataset: 0-1839181 | Loss: 0.684 | 914 ms/step , 6879.59 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 06:14:50 | Epoch: 1 | Step: 221870 | Dataset: 0-1839501 | Loss: 0.715 | 913 ms/step , 6892.39 GFLOP/s , 17934.2 tokens/s INFO:__main__:2024-11-06 06:14:59 | Epoch: 1 | Step: 221880 | Dataset: 0-1839821 | Loss: 0.826 | 914 ms/step , 6884.84 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 06:15:08 | Epoch: 1 | Step: 221890 | Dataset: 0-1840141 | Loss: 0.463 | 912 ms/step , 6893.45 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 06:15:17 | Epoch: 1 | Step: 221900 | Dataset: 0-1840461 | Loss: 0.706 | 912 ms/step , 6897.64 GFLOP/s , 17945.5 tokens/s INFO:__main__:2024-11-06 06:15:19 | Validation | Step: 221900 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:15:28 | Epoch: 1 | Step: 221910 | Dataset: 0-1840781 | Loss: 0.727 | 913 ms/step , 6888.01 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-06 06:15:37 | Epoch: 1 | Step: 221920 | Dataset: 0-1841101 | Loss: 0.868 | 914 ms/step , 6881.27 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 06:15:46 | Epoch: 1 | Step: 221930 | Dataset: 0-1841421 | Loss: 0.530 | 911 ms/step , 6903.36 GFLOP/s , 17942.1 tokens/s INFO:__main__:2024-11-06 06:15:55 | Epoch: 1 | Step: 221940 | Dataset: 0-1841741 | Loss: 0.727 | 912 ms/step , 6895.57 GFLOP/s , 17941.2 tokens/s INFO:__main__:2024-11-06 06:16:05 | Epoch: 1 | Step: 221950 | Dataset: 0-1842061 | Loss: 0.812 | 913 ms/step , 6885.27 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 06:16:14 | Epoch: 1 | Step: 221960 | Dataset: 0-1842381 | Loss: 0.604 | 913 ms/step , 6892.26 GFLOP/s , 17940.6 tokens/s INFO:__main__:2024-11-06 06:16:23 | Epoch: 1 | Step: 221970 | Dataset: 0-1842701 | Loss: 0.655 | 911 ms/step , 6900.74 GFLOP/s , 17938.5 tokens/s INFO:__main__:2024-11-06 06:16:32 | Epoch: 1 | Step: 221980 | Dataset: 0-1843021 | Loss: 0.617 | 913 ms/step , 6889.79 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 06:16:41 | Epoch: 1 | Step: 221990 | Dataset: 0-1843341 | Loss: 0.695 | 914 ms/step , 6882.23 GFLOP/s , 17942.8 tokens/s INFO:__main__:2024-11-06 06:16:50 | Epoch: 1 | Step: 222000 | Dataset: 0-1843661 | Loss: 0.756 | 913 ms/step , 6891.11 GFLOP/s , 17938.3 tokens/s INFO:__main__:2024-11-06 06:16:52 | Validation | Step: 222000 | Val_loss: 0.770 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:16:52 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_061652_step_222000.pt` INFO:__main__:2024-11-06 06:17:02 | Epoch: 1 | Step: 222010 | Dataset: 0-1843981 | Loss: 0.808 | 914 ms/step , 6879.89 GFLOP/s , 13773.5 tokens/s INFO:__main__:2024-11-06 06:17:11 | Epoch: 1 | Step: 222020 | Dataset: 0-1844301 | Loss: 0.663 | 913 ms/step , 6890.71 GFLOP/s , 17938.0 tokens/s INFO:__main__:2024-11-06 06:17:20 | Epoch: 1 | Step: 222030 | Dataset: 0-1844621 | Loss: 0.756 | 914 ms/step , 6881.47 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 06:17:30 | Epoch: 1 | Step: 222040 | Dataset: 0-1844941 | Loss: 0.679 | 914 ms/step , 6879.95 GFLOP/s , 17863.4 tokens/s INFO:__main__:2024-11-06 06:17:39 | Epoch: 1 | Step: 222050 | Dataset: 0-1845261 | Loss: 0.784 | 913 ms/step , 6890.05 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 06:17:48 | Epoch: 1 | Step: 222060 | Dataset: 0-1845581 | Loss: 0.737 | 913 ms/step , 6890.65 GFLOP/s , 17899.0 tokens/s INFO:__main__:2024-11-06 06:17:57 | Epoch: 1 | Step: 222070 | Dataset: 0-1845901 | Loss: 0.736 | 914 ms/step , 6881.00 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 06:18:06 | Epoch: 1 | Step: 222080 | Dataset: 0-1846221 | Loss: 0.654 | 913 ms/step , 6885.58 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 06:18:15 | Epoch: 1 | Step: 222090 | Dataset: 0-1846541 | Loss: 0.727 | 913 ms/step , 6889.95 GFLOP/s , 17895.2 tokens/s INFO:__main__:2024-11-06 06:18:24 | Epoch: 1 | Step: 222100 | Dataset: 0-1846861 | Loss: 0.762 | 914 ms/step , 6884.63 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 06:18:26 | Validation | Step: 222100 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:18:35 | Epoch: 1 | Step: 222110 | Dataset: 0-1847181 | Loss: 0.678 | 913 ms/step , 6886.01 GFLOP/s , 15278.2 tokens/s INFO:__main__:2024-11-06 06:18:44 | Epoch: 1 | Step: 222120 | Dataset: 0-1847501 | Loss: 0.695 | 913 ms/step , 6887.52 GFLOP/s , 17939.1 tokens/s INFO:__main__:2024-11-06 06:18:53 | Epoch: 1 | Step: 222130 | Dataset: 0-1847821 | Loss: 0.690 | 913 ms/step , 6887.18 GFLOP/s , 17941.6 tokens/s INFO:__main__:2024-11-06 06:19:03 | Epoch: 1 | Step: 222140 | Dataset: 0-1848141 | Loss: 0.686 | 913 ms/step , 6887.76 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:19:12 | Epoch: 1 | Step: 222150 | Dataset: 0-1848461 | Loss: 0.679 | 913 ms/step , 6888.17 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 06:19:21 | Epoch: 1 | Step: 222160 | Dataset: 0-1848781 | Loss: 0.744 | 912 ms/step , 6897.08 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 06:19:30 | Epoch: 1 | Step: 222170 | Dataset: 0-1849101 | Loss: 0.709 | 913 ms/step , 6891.69 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 06:19:39 | Epoch: 1 | Step: 222180 | Dataset: 0-1849421 | Loss: 0.734 | 913 ms/step , 6885.66 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 06:19:48 | Epoch: 1 | Step: 222190 | Dataset: 0-1849741 | Loss: 0.760 | 913 ms/step , 6888.55 GFLOP/s , 17941.0 tokens/s INFO:__main__:2024-11-06 06:19:57 | Epoch: 1 | Step: 222200 | Dataset: 0-1850061 | Loss: 0.828 | 913 ms/step , 6886.85 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 06:19:59 | Validation | Step: 222200 | Val_loss: 0.660 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:20:08 | Epoch: 1 | Step: 222210 | Dataset: 0-1850381 | Loss: 0.676 | 913 ms/step , 6886.82 GFLOP/s , 15266.9 tokens/s INFO:__main__:2024-11-06 06:20:17 | Epoch: 1 | Step: 222220 | Dataset: 0-1850701 | Loss: 0.843 | 913 ms/step , 6886.88 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 06:20:26 | Epoch: 1 | Step: 222230 | Dataset: 0-1851021 | Loss: 0.470 | 912 ms/step , 6897.85 GFLOP/s , 17940.4 tokens/s INFO:__main__:2024-11-06 06:20:35 | Epoch: 1 | Step: 222240 | Dataset: 0-1851341 | Loss: 0.728 | 912 ms/step , 6892.98 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-06 06:20:45 | Epoch: 1 | Step: 222250 | Dataset: 0-1851661 | Loss: 0.811 | 912 ms/step , 6896.13 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 06:20:54 | Epoch: 1 | Step: 222260 | Dataset: 0-1851981 | Loss: 0.777 | 912 ms/step , 6892.91 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 06:21:03 | Epoch: 1 | Step: 222270 | Dataset: 0-1852301 | Loss: 0.750 | 914 ms/step , 6879.90 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 06:21:12 | Epoch: 1 | Step: 222280 | Dataset: 0-1852621 | Loss: 0.662 | 914 ms/step , 6884.28 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 06:21:21 | Epoch: 1 | Step: 222290 | Dataset: 0-1852941 | Loss: 0.711 | 914 ms/step , 6878.30 GFLOP/s , 17929.1 tokens/s INFO:__main__:2024-11-06 06:21:30 | Epoch: 1 | Step: 222300 | Dataset: 0-1853261 | Loss: 0.640 | 913 ms/step , 6888.75 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 06:21:32 | Validation | Step: 222300 | Val_loss: 0.738 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:21:41 | Epoch: 1 | Step: 222310 | Dataset: 0-1853581 | Loss: 0.741 | 913 ms/step , 6891.59 GFLOP/s , 15270.0 tokens/s INFO:__main__:2024-11-06 06:21:50 | Epoch: 1 | Step: 222320 | Dataset: 0-1853901 | Loss: 0.824 | 914 ms/step , 6882.51 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 06:21:59 | Epoch: 1 | Step: 222330 | Dataset: 0-1854221 | Loss: 0.788 | 913 ms/step , 6885.39 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 06:22:08 | Epoch: 1 | Step: 222340 | Dataset: 0-1854541 | Loss: 0.770 | 913 ms/step , 6888.74 GFLOP/s , 17936.3 tokens/s INFO:__main__:2024-11-06 06:22:18 | Epoch: 1 | Step: 222350 | Dataset: 0-1854861 | Loss: 0.732 | 913 ms/step , 6890.41 GFLOP/s , 17936.4 tokens/s INFO:__main__:2024-11-06 06:22:27 | Epoch: 1 | Step: 222360 | Dataset: 0-1855181 | Loss: 0.832 | 912 ms/step , 6897.01 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-06 06:22:36 | Epoch: 1 | Step: 222370 | Dataset: 0-1855501 | Loss: 0.657 | 912 ms/step , 6896.44 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:22:45 | Epoch: 1 | Step: 222380 | Dataset: 0-1855821 | Loss: 0.690 | 912 ms/step , 6893.31 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 06:22:54 | Epoch: 1 | Step: 222390 | Dataset: 0-1856141 | Loss: 0.778 | 913 ms/step , 6892.33 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 06:23:03 | Epoch: 1 | Step: 222400 | Dataset: 0-1856461 | Loss: 0.611 | 913 ms/step , 6887.68 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 06:23:05 | Validation | Step: 222400 | Val_loss: 0.735 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:23:14 | Epoch: 1 | Step: 222410 | Dataset: 0-1856781 | Loss: 0.792 | 914 ms/step , 6883.66 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-06 06:23:23 | Epoch: 1 | Step: 222420 | Dataset: 0-1857101 | Loss: 0.597 | 914 ms/step , 6882.19 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 06:23:32 | Epoch: 1 | Step: 222430 | Dataset: 0-1857421 | Loss: 0.682 | 913 ms/step , 6889.29 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 06:23:41 | Epoch: 1 | Step: 222440 | Dataset: 0-1857741 | Loss: 0.582 | 912 ms/step , 6897.27 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:23:50 | Epoch: 1 | Step: 222450 | Dataset: 0-1858061 | Loss: 0.749 | 913 ms/step , 6889.14 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 06:24:00 | Epoch: 1 | Step: 222460 | Dataset: 0-1858381 | Loss: 0.811 | 915 ms/step , 6874.52 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 06:24:09 | Epoch: 1 | Step: 222470 | Dataset: 0-1858701 | Loss: 0.776 | 912 ms/step , 6895.90 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 06:24:18 | Epoch: 1 | Step: 222480 | Dataset: 0-1859021 | Loss: 0.725 | 913 ms/step , 6887.30 GFLOP/s , 17927.4 tokens/s INFO:__main__:2024-11-06 06:24:27 | Epoch: 1 | Step: 222490 | Dataset: 0-1859341 | Loss: 0.792 | 913 ms/step , 6891.72 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 06:24:36 | Epoch: 1 | Step: 222500 | Dataset: 0-1859661 | Loss: 0.742 | 913 ms/step , 6891.17 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 06:24:38 | Validation | Step: 222500 | Val_loss: 0.717 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:24:47 | Epoch: 1 | Step: 222510 | Dataset: 0-1859981 | Loss: 0.851 | 914 ms/step , 6883.15 GFLOP/s , 15275.1 tokens/s INFO:__main__:2024-11-06 06:24:56 | Epoch: 1 | Step: 222520 | Dataset: 0-1860301 | Loss: 0.677 | 912 ms/step , 6895.32 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 06:25:05 | Epoch: 1 | Step: 222530 | Dataset: 0-1860621 | Loss: 0.726 | 915 ms/step , 6873.09 GFLOP/s , 17936.7 tokens/s INFO:__main__:2024-11-06 06:25:14 | Epoch: 1 | Step: 222540 | Dataset: 0-1860941 | Loss: 0.743 | 914 ms/step , 6883.01 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 06:25:23 | Epoch: 1 | Step: 222550 | Dataset: 0-1861261 | Loss: 0.762 | 913 ms/step , 6891.50 GFLOP/s , 17936.0 tokens/s INFO:__main__:2024-11-06 06:25:33 | Epoch: 1 | Step: 222560 | Dataset: 0-1861581 | Loss: 0.773 | 913 ms/step , 6891.02 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 06:25:42 | Epoch: 1 | Step: 222570 | Dataset: 0-1861901 | Loss: 0.907 | 913 ms/step , 6889.08 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 06:25:51 | Epoch: 1 | Step: 222580 | Dataset: 0-1862221 | Loss: 0.644 | 913 ms/step , 6891.49 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 06:26:00 | Epoch: 1 | Step: 222590 | Dataset: 0-1862541 | Loss: 0.695 | 913 ms/step , 6885.13 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 06:26:09 | Epoch: 1 | Step: 222600 | Dataset: 0-1862861 | Loss: 0.628 | 913 ms/step , 6890.28 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 06:26:11 | Validation | Step: 222600 | Val_loss: 0.682 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:26:20 | Epoch: 1 | Step: 222610 | Dataset: 0-1863181 | Loss: 0.682 | 913 ms/step , 6891.42 GFLOP/s , 15270.1 tokens/s INFO:__main__:2024-11-06 06:26:29 | Epoch: 1 | Step: 222620 | Dataset: 0-1863501 | Loss: 0.570 | 913 ms/step , 6891.88 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 06:26:38 | Epoch: 1 | Step: 222630 | Dataset: 0-1863821 | Loss: 0.852 | 913 ms/step , 6887.17 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 06:26:47 | Epoch: 1 | Step: 222640 | Dataset: 0-1864141 | Loss: 0.753 | 914 ms/step , 6884.21 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 06:26:56 | Epoch: 1 | Step: 222650 | Dataset: 0-1864461 | Loss: 0.818 | 913 ms/step , 6887.62 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 06:27:06 | Epoch: 1 | Step: 222660 | Dataset: 0-1864781 | Loss: 0.853 | 913 ms/step , 6885.14 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 06:27:15 | Epoch: 1 | Step: 222670 | Dataset: 0-1865101 | Loss: 0.707 | 914 ms/step , 6882.30 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 06:27:24 | Epoch: 1 | Step: 222680 | Dataset: 0-1865421 | Loss: 0.853 | 915 ms/step , 6875.11 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 06:27:33 | Epoch: 1 | Step: 222690 | Dataset: 0-1865741 | Loss: 0.709 | 913 ms/step , 6891.68 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-06 06:27:42 | Epoch: 1 | Step: 222700 | Dataset: 0-1866061 | Loss: 0.726 | 913 ms/step , 6889.45 GFLOP/s , 17937.7 tokens/s INFO:__main__:2024-11-06 06:27:44 | Validation | Step: 222700 | Val_loss: 0.692 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:27:53 | Epoch: 1 | Step: 222710 | Dataset: 0-1866381 | Loss: 0.787 | 913 ms/step , 6890.48 GFLOP/s , 15287.2 tokens/s INFO:__main__:2024-11-06 06:28:02 | Epoch: 1 | Step: 222720 | Dataset: 0-1866701 | Loss: 0.680 | 913 ms/step , 6892.14 GFLOP/s , 17937.9 tokens/s INFO:__main__:2024-11-06 06:28:11 | Epoch: 1 | Step: 222730 | Dataset: 0-1867021 | Loss: 0.652 | 913 ms/step , 6886.88 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 06:28:20 | Epoch: 1 | Step: 222740 | Dataset: 0-1867341 | Loss: 0.847 | 913 ms/step , 6890.31 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 06:28:29 | Epoch: 1 | Step: 222750 | Dataset: 0-1867661 | Loss: 0.763 | 915 ms/step , 6875.53 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 06:28:38 | Epoch: 1 | Step: 222760 | Dataset: 0-1867981 | Loss: 0.690 | 912 ms/step , 6896.22 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 06:28:48 | Epoch: 1 | Step: 222770 | Dataset: 0-1868301 | Loss: 0.690 | 914 ms/step , 6883.81 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 06:28:57 | Epoch: 1 | Step: 222780 | Dataset: 0-1868621 | Loss: 0.770 | 912 ms/step , 6898.62 GFLOP/s , 17936.1 tokens/s INFO:__main__:2024-11-06 06:29:06 | Epoch: 1 | Step: 222790 | Dataset: 0-1868941 | Loss: 0.826 | 914 ms/step , 6880.51 GFLOP/s , 17930.5 tokens/s INFO:__main__:2024-11-06 06:29:15 | Epoch: 1 | Step: 222800 | Dataset: 0-1869261 | Loss: 0.559 | 912 ms/step , 6895.76 GFLOP/s , 17934.4 tokens/s INFO:__main__:2024-11-06 06:29:17 | Validation | Step: 222800 | Val_loss: 0.709 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:29:26 | Epoch: 1 | Step: 222810 | Dataset: 0-1869581 | Loss: 0.622 | 913 ms/step , 6886.44 GFLOP/s , 15267.3 tokens/s INFO:__main__:2024-11-06 06:29:35 | Epoch: 1 | Step: 222820 | Dataset: 0-1869901 | Loss: 0.603 | 913 ms/step , 6889.54 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 06:29:44 | Epoch: 1 | Step: 222830 | Dataset: 0-1870221 | Loss: 0.631 | 912 ms/step , 6899.34 GFLOP/s , 17940.1 tokens/s INFO:__main__:2024-11-06 06:29:53 | Epoch: 1 | Step: 222840 | Dataset: 0-1870541 | Loss: 0.877 | 914 ms/step , 6884.77 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 06:30:02 | Epoch: 1 | Step: 222850 | Dataset: 0-1870861 | Loss: 0.785 | 912 ms/step , 6892.80 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 06:30:11 | Epoch: 1 | Step: 222860 | Dataset: 0-1871181 | Loss: 0.878 | 914 ms/step , 6878.25 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 06:30:21 | Epoch: 1 | Step: 222870 | Dataset: 0-1871501 | Loss: 0.601 | 914 ms/step , 6883.44 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 06:30:30 | Epoch: 1 | Step: 222880 | Dataset: 0-1871821 | Loss: 0.736 | 912 ms/step , 6893.94 GFLOP/s , 17933.4 tokens/s INFO:__main__:2024-11-06 06:30:39 | Epoch: 1 | Step: 222890 | Dataset: 0-1872141 | Loss: 0.581 | 912 ms/step , 6894.52 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 06:30:48 | Epoch: 1 | Step: 222900 | Dataset: 0-1872461 | Loss: 0.684 | 913 ms/step , 6890.93 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 06:30:50 | Validation | Step: 222900 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:30:59 | Epoch: 1 | Step: 222910 | Dataset: 0-1872781 | Loss: 0.642 | 913 ms/step , 6887.61 GFLOP/s , 15287.0 tokens/s INFO:__main__:2024-11-06 06:31:08 | Epoch: 1 | Step: 222920 | Dataset: 0-1873101 | Loss: 0.676 | 913 ms/step , 6889.24 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 06:31:17 | Epoch: 1 | Step: 222930 | Dataset: 0-1873421 | Loss: 0.644 | 912 ms/step , 6892.98 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 06:31:26 | Epoch: 1 | Step: 222940 | Dataset: 0-1873741 | Loss: 0.827 | 913 ms/step , 6891.76 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-06 06:31:35 | Epoch: 1 | Step: 222950 | Dataset: 0-1874061 | Loss: 0.724 | 912 ms/step , 6893.30 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 06:31:44 | Epoch: 1 | Step: 222960 | Dataset: 0-1874381 | Loss: 0.689 | 913 ms/step , 6892.49 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 06:31:54 | Epoch: 1 | Step: 222970 | Dataset: 0-1874701 | Loss: 0.424 | 919 ms/step , 6843.52 GFLOP/s , 17859.9 tokens/s INFO:__main__:2024-11-06 06:32:03 | Epoch: 1 | Step: 222980 | Dataset: 0-1875021 | Loss: 0.732 | 931 ms/step , 6756.37 GFLOP/s , 17693.0 tokens/s INFO:__main__:2024-11-06 06:32:12 | Epoch: 1 | Step: 222990 | Dataset: 0-1875341 | Loss: 0.671 | 912 ms/step , 6893.59 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 06:32:21 | Epoch: 1 | Step: 223000 | Dataset: 0-1875661 | Loss: 0.666 | 912 ms/step , 6896.04 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 06:32:23 | Validation | Step: 223000 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:32:23 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_063223_step_223000.pt` INFO:__main__:2024-11-06 06:32:33 | Epoch: 1 | Step: 223010 | Dataset: 0-1875981 | Loss: 0.617 | 913 ms/step , 6890.69 GFLOP/s , 13797.0 tokens/s INFO:__main__:2024-11-06 06:32:42 | Epoch: 1 | Step: 223020 | Dataset: 0-1876301 | Loss: 0.677 | 912 ms/step , 6896.76 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 06:32:51 | Epoch: 1 | Step: 223030 | Dataset: 0-1876621 | Loss: 0.701 | 913 ms/step , 6887.93 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 06:33:00 | Epoch: 1 | Step: 223040 | Dataset: 0-1876941 | Loss: 0.712 | 914 ms/step , 6884.08 GFLOP/s , 17859.8 tokens/s INFO:__main__:2024-11-06 06:33:10 | Epoch: 1 | Step: 223050 | Dataset: 0-1877261 | Loss: 0.823 | 913 ms/step , 6885.93 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:33:19 | Epoch: 1 | Step: 223060 | Dataset: 0-1877581 | Loss: 0.634 | 912 ms/step , 6893.43 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 06:33:28 | Epoch: 1 | Step: 223070 | Dataset: 0-1877901 | Loss: 0.687 | 917 ms/step , 6861.87 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 06:33:37 | Epoch: 1 | Step: 223080 | Dataset: 0-1878221 | Loss: 0.705 | 913 ms/step , 6891.97 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 06:33:46 | Epoch: 1 | Step: 223090 | Dataset: 0-1878541 | Loss: 0.544 | 912 ms/step , 6895.92 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 06:33:55 | Epoch: 1 | Step: 223100 | Dataset: 0-1878861 | Loss: 0.736 | 912 ms/step , 6894.48 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 06:33:57 | Validation | Step: 223100 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:34:06 | Epoch: 1 | Step: 223110 | Dataset: 0-1879181 | Loss: 0.615 | 913 ms/step , 6892.56 GFLOP/s , 15269.9 tokens/s INFO:__main__:2024-11-06 06:34:15 | Epoch: 1 | Step: 223120 | Dataset: 0-1879501 | Loss: 0.642 | 913 ms/step , 6889.04 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 06:34:24 | Epoch: 1 | Step: 223130 | Dataset: 0-1879821 | Loss: 0.700 | 912 ms/step , 6895.82 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 06:34:33 | Epoch: 1 | Step: 223140 | Dataset: 0-1880141 | Loss: 0.713 | 913 ms/step , 6889.56 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 06:34:43 | Epoch: 1 | Step: 223150 | Dataset: 0-1880461 | Loss: 0.739 | 913 ms/step , 6891.33 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 06:34:52 | Epoch: 1 | Step: 223160 | Dataset: 0-1880781 | Loss: 0.795 | 913 ms/step , 6890.11 GFLOP/s , 17941.5 tokens/s INFO:__main__:2024-11-06 06:35:01 | Epoch: 1 | Step: 223170 | Dataset: 0-1881101 | Loss: 0.624 | 913 ms/step , 6892.15 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 06:35:10 | Epoch: 1 | Step: 223180 | Dataset: 0-1881421 | Loss: 0.747 | 912 ms/step , 6898.30 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 06:35:19 | Epoch: 1 | Step: 223190 | Dataset: 0-1881741 | Loss: 0.638 | 913 ms/step , 6888.98 GFLOP/s , 17929.9 tokens/s INFO:__main__:2024-11-06 06:35:28 | Epoch: 1 | Step: 223200 | Dataset: 0-1882061 | Loss: 0.756 | 915 ms/step , 6872.07 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 06:35:30 | Validation | Step: 223200 | Val_loss: 0.749 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:35:39 | Epoch: 1 | Step: 223210 | Dataset: 0-1882381 | Loss: 0.434 | 913 ms/step , 6890.39 GFLOP/s , 15287.7 tokens/s INFO:__main__:2024-11-06 06:35:48 | Epoch: 1 | Step: 223220 | Dataset: 0-1882701 | Loss: 0.623 | 914 ms/step , 6883.12 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-06 06:35:57 | Epoch: 1 | Step: 223230 | Dataset: 0-1883021 | Loss: 0.696 | 914 ms/step , 6881.92 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 06:36:06 | Epoch: 1 | Step: 223240 | Dataset: 0-1883341 | Loss: 0.790 | 914 ms/step , 6882.62 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 06:36:15 | Epoch: 1 | Step: 223250 | Dataset: 0-1883661 | Loss: 0.726 | 912 ms/step , 6893.37 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 06:36:25 | Epoch: 1 | Step: 223260 | Dataset: 0-1883981 | Loss: 0.683 | 913 ms/step , 6891.07 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 06:36:34 | Epoch: 1 | Step: 223270 | Dataset: 0-1884301 | Loss: 0.656 | 913 ms/step , 6891.39 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 06:36:43 | Epoch: 1 | Step: 223280 | Dataset: 0-1884621 | Loss: 0.817 | 913 ms/step , 6889.08 GFLOP/s , 17941.1 tokens/s INFO:__main__:2024-11-06 06:36:52 | Epoch: 1 | Step: 223290 | Dataset: 0-1884941 | Loss: 0.759 | 914 ms/step , 6878.16 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 06:37:01 | Epoch: 1 | Step: 223300 | Dataset: 0-1885261 | Loss: 0.826 | 914 ms/step , 6883.82 GFLOP/s , 17938.8 tokens/s INFO:__main__:2024-11-06 06:37:03 | Validation | Step: 223300 | Val_loss: 0.733 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:37:12 | Epoch: 1 | Step: 223310 | Dataset: 0-1885581 | Loss: 0.793 | 913 ms/step , 6888.50 GFLOP/s , 15279.3 tokens/s INFO:__main__:2024-11-06 06:37:21 | Epoch: 1 | Step: 223320 | Dataset: 0-1885901 | Loss: 0.658 | 913 ms/step , 6891.01 GFLOP/s , 17937.8 tokens/s INFO:__main__:2024-11-06 06:37:30 | Epoch: 1 | Step: 223330 | Dataset: 0-1886221 | Loss: 0.609 | 914 ms/step , 6883.99 GFLOP/s , 17935.6 tokens/s INFO:__main__:2024-11-06 06:37:39 | Epoch: 1 | Step: 223340 | Dataset: 0-1886541 | Loss: 0.769 | 913 ms/step , 6888.43 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 06:37:48 | Epoch: 1 | Step: 223350 | Dataset: 0-1886861 | Loss: 0.728 | 913 ms/step , 6890.69 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 06:37:58 | Epoch: 1 | Step: 223360 | Dataset: 0-1887181 | Loss: 0.807 | 913 ms/step , 6888.69 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 06:38:07 | Epoch: 1 | Step: 223370 | Dataset: 0-1887501 | Loss: 0.461 | 913 ms/step , 6891.94 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 06:38:16 | Epoch: 1 | Step: 223380 | Dataset: 0-1887821 | Loss: 0.677 | 912 ms/step , 6897.35 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 06:38:25 | Epoch: 1 | Step: 223390 | Dataset: 0-1888141 | Loss: 0.687 | 913 ms/step , 6889.38 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 06:38:34 | Epoch: 1 | Step: 223400 | Dataset: 0-1888461 | Loss: 0.732 | 913 ms/step , 6887.46 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 06:38:36 | Validation | Step: 223400 | Val_loss: 0.711 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:38:45 | Epoch: 1 | Step: 223410 | Dataset: 0-1888781 | Loss: 0.719 | 913 ms/step , 6891.18 GFLOP/s , 15274.6 tokens/s INFO:__main__:2024-11-06 06:38:54 | Epoch: 1 | Step: 223420 | Dataset: 0-1889101 | Loss: 0.678 | 913 ms/step , 6892.54 GFLOP/s , 17939.9 tokens/s INFO:__main__:2024-11-06 06:39:03 | Epoch: 1 | Step: 223430 | Dataset: 0-1889421 | Loss: 0.622 | 912 ms/step , 6894.08 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 06:39:12 | Epoch: 1 | Step: 223440 | Dataset: 0-1889741 | Loss: 0.776 | 915 ms/step , 6875.34 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 06:39:21 | Epoch: 1 | Step: 223450 | Dataset: 0-1890061 | Loss: 0.798 | 913 ms/step , 6891.16 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 06:39:30 | Epoch: 1 | Step: 223460 | Dataset: 0-1890381 | Loss: 0.800 | 914 ms/step , 6884.14 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 06:39:40 | Epoch: 1 | Step: 223470 | Dataset: 0-1890701 | Loss: 0.778 | 914 ms/step , 6883.82 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 06:39:49 | Epoch: 1 | Step: 223480 | Dataset: 0-1891021 | Loss: 0.629 | 913 ms/step , 6891.63 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 06:39:58 | Epoch: 1 | Step: 223490 | Dataset: 0-1891341 | Loss: 0.808 | 913 ms/step , 6889.94 GFLOP/s , 17939.6 tokens/s INFO:__main__:2024-11-06 06:40:07 | Epoch: 1 | Step: 223500 | Dataset: 0-1891661 | Loss: 0.843 | 913 ms/step , 6885.28 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 06:40:09 | Validation | Step: 223500 | Val_loss: 0.732 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:40:18 | Epoch: 1 | Step: 223510 | Dataset: 0-1891981 | Loss: 0.601 | 912 ms/step , 6897.26 GFLOP/s , 15278.0 tokens/s INFO:__main__:2024-11-06 06:40:27 | Epoch: 1 | Step: 223520 | Dataset: 0-1892301 | Loss: 0.790 | 913 ms/step , 6890.54 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 06:40:36 | Epoch: 1 | Step: 223530 | Dataset: 0-1892621 | Loss: 0.710 | 912 ms/step , 6892.71 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 06:40:45 | Epoch: 1 | Step: 223540 | Dataset: 0-1892941 | Loss: 0.661 | 912 ms/step , 6892.69 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 06:40:54 | Epoch: 1 | Step: 223550 | Dataset: 0-1893261 | Loss: 0.764 | 912 ms/step , 6892.86 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 06:41:03 | Epoch: 1 | Step: 223560 | Dataset: 0-1893581 | Loss: 0.658 | 913 ms/step , 6891.19 GFLOP/s , 17939.4 tokens/s INFO:__main__:2024-11-06 06:41:13 | Epoch: 1 | Step: 223570 | Dataset: 0-1893901 | Loss: 0.837 | 914 ms/step , 6884.29 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 06:41:22 | Epoch: 1 | Step: 223580 | Dataset: 0-1894221 | Loss: 0.805 | 913 ms/step , 6890.65 GFLOP/s , 17939.3 tokens/s INFO:__main__:2024-11-06 06:41:31 | Epoch: 1 | Step: 223590 | Dataset: 0-1894541 | Loss: 0.823 | 913 ms/step , 6887.47 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 06:41:40 | Epoch: 1 | Step: 223600 | Dataset: 0-1894861 | Loss: 0.702 | 913 ms/step , 6890.86 GFLOP/s , 17933.0 tokens/s INFO:__main__:2024-11-06 06:41:42 | Validation | Step: 223600 | Val_loss: 0.790 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:41:51 | Epoch: 1 | Step: 223610 | Dataset: 0-1895181 | Loss: 0.564 | 911 ms/step , 6900.40 GFLOP/s , 15274.0 tokens/s INFO:__main__:2024-11-06 06:42:00 | Epoch: 1 | Step: 223620 | Dataset: 0-1895501 | Loss: 0.627 | 913 ms/step , 6891.47 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 06:42:09 | Epoch: 1 | Step: 223630 | Dataset: 0-1895821 | Loss: 0.693 | 914 ms/step , 6880.84 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 06:42:18 | Epoch: 1 | Step: 223640 | Dataset: 0-1896141 | Loss: 0.652 | 913 ms/step , 6887.09 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 06:42:27 | Epoch: 1 | Step: 223650 | Dataset: 0-1896461 | Loss: 0.714 | 913 ms/step , 6892.58 GFLOP/s , 17940.8 tokens/s INFO:__main__:2024-11-06 06:42:36 | Epoch: 1 | Step: 223660 | Dataset: 0-1896781 | Loss: 0.665 | 913 ms/step , 6887.22 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 06:42:45 | Epoch: 1 | Step: 223670 | Dataset: 0-1897101 | Loss: 0.715 | 915 ms/step , 6870.97 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 06:42:55 | Epoch: 1 | Step: 223680 | Dataset: 0-1897421 | Loss: 0.565 | 913 ms/step , 6890.42 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 06:43:04 | Epoch: 1 | Step: 223690 | Dataset: 0-1897741 | Loss: 0.636 | 912 ms/step , 6897.74 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 06:43:13 | Epoch: 1 | Step: 223700 | Dataset: 0-1898061 | Loss: 0.709 | 914 ms/step , 6882.58 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 06:43:14 | Validation | Step: 223700 | Val_loss: 0.780 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:43:24 | Epoch: 1 | Step: 223710 | Dataset: 0-1898381 | Loss: 0.698 | 913 ms/step , 6885.20 GFLOP/s , 15271.2 tokens/s INFO:__main__:2024-11-06 06:43:33 | Epoch: 1 | Step: 223720 | Dataset: 0-1898701 | Loss: 0.641 | 914 ms/step , 6884.40 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 06:43:42 | Epoch: 1 | Step: 223730 | Dataset: 0-1899021 | Loss: 0.685 | 913 ms/step , 6885.99 GFLOP/s , 17935.3 tokens/s INFO:__main__:2024-11-06 06:43:51 | Epoch: 1 | Step: 223740 | Dataset: 0-1899341 | Loss: 0.671 | 913 ms/step , 6890.54 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 06:44:00 | Epoch: 1 | Step: 223750 | Dataset: 0-1899661 | Loss: 0.652 | 913 ms/step , 6886.07 GFLOP/s , 17931.6 tokens/s INFO:__main__:2024-11-06 06:44:09 | Epoch: 1 | Step: 223760 | Dataset: 0-1899981 | Loss: 0.611 | 914 ms/step , 6883.08 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 06:44:18 | Epoch: 1 | Step: 223770 | Dataset: 0-1900301 | Loss: 0.658 | 914 ms/step , 6882.63 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 06:44:28 | Epoch: 1 | Step: 223780 | Dataset: 0-1900621 | Loss: 0.683 | 914 ms/step , 6878.34 GFLOP/s , 17925.0 tokens/s INFO:__main__:2024-11-06 06:44:37 | Epoch: 1 | Step: 223790 | Dataset: 0-1900941 | Loss: 0.702 | 914 ms/step , 6883.58 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 06:44:46 | Epoch: 1 | Step: 223800 | Dataset: 0-1901261 | Loss: 0.759 | 914 ms/step , 6882.63 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 06:44:47 | Validation | Step: 223800 | Val_loss: 0.653 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:44:57 | Epoch: 1 | Step: 223810 | Dataset: 0-1901581 | Loss: 0.809 | 914 ms/step , 6881.12 GFLOP/s , 15271.3 tokens/s INFO:__main__:2024-11-06 06:45:06 | Epoch: 1 | Step: 223820 | Dataset: 0-1901901 | Loss: 0.630 | 913 ms/step , 6888.04 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 06:45:15 | Epoch: 1 | Step: 223830 | Dataset: 0-1902221 | Loss: 0.742 | 913 ms/step , 6886.12 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 06:45:24 | Epoch: 1 | Step: 223840 | Dataset: 0-1902541 | Loss: 0.665 | 913 ms/step , 6889.22 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 06:45:33 | Epoch: 1 | Step: 223850 | Dataset: 0-1902861 | Loss: 0.627 | 913 ms/step , 6888.34 GFLOP/s , 17923.9 tokens/s INFO:__main__:2024-11-06 06:45:42 | Epoch: 1 | Step: 223860 | Dataset: 0-1903181 | Loss: 0.620 | 913 ms/step , 6889.90 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 06:45:51 | Epoch: 1 | Step: 223870 | Dataset: 0-1903501 | Loss: 0.708 | 913 ms/step , 6887.52 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 06:46:01 | Epoch: 1 | Step: 223880 | Dataset: 0-1903821 | Loss: 0.715 | 914 ms/step , 6884.97 GFLOP/s , 17924.5 tokens/s INFO:__main__:2024-11-06 06:46:10 | Epoch: 1 | Step: 223890 | Dataset: 0-1904141 | Loss: 0.755 | 916 ms/step , 6868.48 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 06:46:19 | Epoch: 1 | Step: 223900 | Dataset: 0-1904461 | Loss: 0.649 | 912 ms/step , 6896.98 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 06:46:20 | Validation | Step: 223900 | Val_loss: 0.699 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:46:30 | Epoch: 1 | Step: 223910 | Dataset: 0-1904781 | Loss: 0.685 | 913 ms/step , 6891.68 GFLOP/s , 15275.6 tokens/s INFO:__main__:2024-11-06 06:46:39 | Epoch: 1 | Step: 223920 | Dataset: 0-1905101 | Loss: 0.697 | 913 ms/step , 6888.60 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 06:46:48 | Epoch: 1 | Step: 223930 | Dataset: 0-1905421 | Loss: 0.687 | 914 ms/step , 6883.59 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 06:46:57 | Epoch: 1 | Step: 223940 | Dataset: 0-1905741 | Loss: 0.682 | 913 ms/step , 6886.57 GFLOP/s , 17934.1 tokens/s INFO:__main__:2024-11-06 06:47:06 | Epoch: 1 | Step: 223950 | Dataset: 0-1906061 | Loss: 0.663 | 912 ms/step , 6892.95 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 06:47:15 | Epoch: 1 | Step: 223960 | Dataset: 0-1906381 | Loss: 0.632 | 913 ms/step , 6888.20 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 06:47:24 | Epoch: 1 | Step: 223970 | Dataset: 0-1906701 | Loss: 0.728 | 913 ms/step , 6891.40 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 06:47:34 | Epoch: 1 | Step: 223980 | Dataset: 0-1907021 | Loss: 0.789 | 915 ms/step , 6876.03 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-06 06:47:43 | Epoch: 1 | Step: 223990 | Dataset: 0-1907341 | Loss: 0.656 | 913 ms/step , 6888.58 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 06:47:52 | Epoch: 1 | Step: 224000 | Dataset: 0-1907661 | Loss: 0.586 | 914 ms/step , 6883.98 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 06:47:53 | Validation | Step: 224000 | Val_loss: 0.680 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:47:53 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_064753_step_224000.pt` INFO:__main__:2024-11-06 06:48:04 | Epoch: 1 | Step: 224010 | Dataset: 0-1907981 | Loss: 0.625 | 914 ms/step , 6881.61 GFLOP/s , 13805.6 tokens/s INFO:__main__:2024-11-06 06:48:13 | Epoch: 1 | Step: 224020 | Dataset: 0-1908301 | Loss: 0.737 | 914 ms/step , 6881.30 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 06:48:22 | Epoch: 1 | Step: 224030 | Dataset: 0-1908621 | Loss: 0.738 | 913 ms/step , 6889.67 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 06:48:31 | Epoch: 1 | Step: 224040 | Dataset: 0-1908941 | Loss: 0.671 | 916 ms/step , 6869.17 GFLOP/s , 17896.2 tokens/s INFO:__main__:2024-11-06 06:48:40 | Epoch: 1 | Step: 224050 | Dataset: 0-1909261 | Loss: 0.634 | 915 ms/step , 6871.75 GFLOP/s , 17901.8 tokens/s INFO:__main__:2024-11-06 06:48:49 | Epoch: 1 | Step: 224060 | Dataset: 0-1909581 | Loss: 0.702 | 914 ms/step , 6879.84 GFLOP/s , 17911.1 tokens/s INFO:__main__:2024-11-06 06:48:59 | Epoch: 1 | Step: 224070 | Dataset: 0-1909901 | Loss: 0.659 | 914 ms/step , 6884.76 GFLOP/s , 17920.6 tokens/s INFO:__main__:2024-11-06 06:49:08 | Epoch: 1 | Step: 224080 | Dataset: 0-1910221 | Loss: 0.720 | 914 ms/step , 6883.06 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 06:49:17 | Epoch: 1 | Step: 224090 | Dataset: 0-1910541 | Loss: 0.731 | 915 ms/step , 6873.23 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-06 06:49:26 | Epoch: 1 | Step: 224100 | Dataset: 0-1910861 | Loss: 0.645 | 913 ms/step , 6889.46 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 06:49:28 | Validation | Step: 224100 | Val_loss: 0.768 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:49:37 | Epoch: 1 | Step: 224110 | Dataset: 0-1911181 | Loss: 0.603 | 914 ms/step , 6878.35 GFLOP/s , 15280.9 tokens/s INFO:__main__:2024-11-06 06:49:46 | Epoch: 1 | Step: 224120 | Dataset: 0-1911501 | Loss: 0.696 | 913 ms/step , 6891.60 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 06:49:55 | Epoch: 1 | Step: 224130 | Dataset: 0-1911821 | Loss: 0.566 | 913 ms/step , 6890.91 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 06:50:04 | Epoch: 1 | Step: 224140 | Dataset: 0-1912141 | Loss: 0.782 | 913 ms/step , 6891.14 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 06:50:13 | Epoch: 1 | Step: 224150 | Dataset: 0-1912461 | Loss: 0.626 | 912 ms/step , 6894.39 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 06:50:22 | Epoch: 1 | Step: 224160 | Dataset: 0-1912781 | Loss: 0.674 | 913 ms/step , 6885.35 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 06:50:32 | Epoch: 1 | Step: 224170 | Dataset: 0-1913101 | Loss: 0.727 | 912 ms/step , 6894.09 GFLOP/s , 17925.6 tokens/s INFO:__main__:2024-11-06 06:50:41 | Epoch: 1 | Step: 224180 | Dataset: 0-1913421 | Loss: 0.665 | 914 ms/step , 6883.26 GFLOP/s , 17924.7 tokens/s INFO:__main__:2024-11-06 06:50:50 | Epoch: 1 | Step: 224190 | Dataset: 0-1913741 | Loss: 0.652 | 912 ms/step , 6894.57 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 06:50:59 | Epoch: 1 | Step: 224200 | Dataset: 0-1914061 | Loss: 0.701 | 913 ms/step , 6886.46 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 06:51:01 | Validation | Step: 224200 | Val_loss: 0.697 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:51:10 | Epoch: 1 | Step: 224210 | Dataset: 0-1914381 | Loss: 0.706 | 914 ms/step , 6879.33 GFLOP/s , 15277.5 tokens/s INFO:__main__:2024-11-06 06:51:19 | Epoch: 1 | Step: 224220 | Dataset: 0-1914701 | Loss: 0.654 | 915 ms/step , 6871.73 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-06 06:51:28 | Epoch: 1 | Step: 224230 | Dataset: 0-1915021 | Loss: 0.622 | 913 ms/step , 6892.34 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 06:51:37 | Epoch: 1 | Step: 224240 | Dataset: 0-1915341 | Loss: 0.714 | 913 ms/step , 6885.29 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 06:51:46 | Epoch: 1 | Step: 224250 | Dataset: 0-1915661 | Loss: 0.721 | 914 ms/step , 6877.94 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 06:51:55 | Epoch: 1 | Step: 224260 | Dataset: 0-1915981 | Loss: 0.661 | 914 ms/step , 6884.22 GFLOP/s , 17935.8 tokens/s INFO:__main__:2024-11-06 06:52:05 | Epoch: 1 | Step: 224270 | Dataset: 0-1916301 | Loss: 0.717 | 914 ms/step , 6877.79 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 06:52:14 | Epoch: 1 | Step: 224280 | Dataset: 0-1916621 | Loss: 0.675 | 915 ms/step , 6876.56 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 06:52:23 | Epoch: 1 | Step: 224290 | Dataset: 0-1916941 | Loss: 0.739 | 913 ms/step , 6885.30 GFLOP/s , 17925.7 tokens/s INFO:__main__:2024-11-06 06:52:32 | Epoch: 1 | Step: 224300 | Dataset: 0-1917261 | Loss: 0.656 | 913 ms/step , 6890.61 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 06:52:34 | Validation | Step: 224300 | Val_loss: 0.727 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:52:43 | Epoch: 1 | Step: 224310 | Dataset: 0-1917581 | Loss: 0.791 | 913 ms/step , 6885.15 GFLOP/s , 15284.0 tokens/s INFO:__main__:2024-11-06 06:52:52 | Epoch: 1 | Step: 224320 | Dataset: 0-1917901 | Loss: 0.722 | 913 ms/step , 6887.23 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 06:53:01 | Epoch: 1 | Step: 224330 | Dataset: 0-1918221 | Loss: 0.623 | 913 ms/step , 6889.52 GFLOP/s , 17897.9 tokens/s INFO:__main__:2024-11-06 06:53:10 | Epoch: 1 | Step: 224340 | Dataset: 0-1918541 | Loss: 0.735 | 920 ms/step , 6840.12 GFLOP/s , 17907.8 tokens/s INFO:__main__:2024-11-06 06:53:19 | Epoch: 1 | Step: 224350 | Dataset: 0-1918861 | Loss: 0.626 | 912 ms/step , 6899.92 GFLOP/s , 17857.2 tokens/s INFO:__main__:2024-11-06 06:53:28 | Epoch: 1 | Step: 224360 | Dataset: 0-1919181 | Loss: 0.719 | 914 ms/step , 6882.52 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 06:53:38 | Epoch: 1 | Step: 224370 | Dataset: 0-1919501 | Loss: 0.637 | 914 ms/step , 6879.88 GFLOP/s , 17829.4 tokens/s INFO:__main__:2024-11-06 06:53:47 | Epoch: 1 | Step: 224380 | Dataset: 0-1919821 | Loss: 0.688 | 913 ms/step , 6885.88 GFLOP/s , 17933.7 tokens/s INFO:__main__:2024-11-06 06:53:56 | Epoch: 1 | Step: 224390 | Dataset: 0-1920141 | Loss: 0.733 | 914 ms/step , 6880.14 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 06:54:05 | Epoch: 1 | Step: 224400 | Dataset: 0-1920461 | Loss: 0.704 | 913 ms/step , 6886.82 GFLOP/s , 17931.7 tokens/s INFO:__main__:2024-11-06 06:54:07 | Validation | Step: 224400 | Val_loss: 0.754 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:54:16 | Epoch: 1 | Step: 224410 | Dataset: 0-1920781 | Loss: 0.638 | 917 ms/step , 6861.52 GFLOP/s , 15267.8 tokens/s INFO:__main__:2024-11-06 06:54:25 | Epoch: 1 | Step: 224420 | Dataset: 0-1921101 | Loss: 0.618 | 915 ms/step , 6875.73 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 06:54:34 | Epoch: 1 | Step: 224430 | Dataset: 0-1921421 | Loss: 0.629 | 912 ms/step , 6897.68 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 06:54:43 | Epoch: 1 | Step: 224440 | Dataset: 0-1921741 | Loss: 0.743 | 914 ms/step , 6882.01 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 06:54:52 | Epoch: 1 | Step: 224450 | Dataset: 0-1922061 | Loss: 0.657 | 913 ms/step , 6886.41 GFLOP/s , 17936.8 tokens/s INFO:__main__:2024-11-06 06:55:01 | Epoch: 1 | Step: 224460 | Dataset: 0-1922381 | Loss: 0.657 | 913 ms/step , 6887.42 GFLOP/s , 17931.9 tokens/s INFO:__main__:2024-11-06 06:55:11 | Epoch: 1 | Step: 224470 | Dataset: 0-1922701 | Loss: 0.744 | 913 ms/step , 6891.15 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 06:55:20 | Epoch: 1 | Step: 224480 | Dataset: 0-1923021 | Loss: 0.644 | 913 ms/step , 6888.36 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 06:55:29 | Epoch: 1 | Step: 224490 | Dataset: 0-1923341 | Loss: 0.711 | 916 ms/step , 6869.12 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 06:55:38 | Epoch: 1 | Step: 224500 | Dataset: 0-1923661 | Loss: 0.785 | 912 ms/step , 6896.40 GFLOP/s , 17939.2 tokens/s INFO:__main__:2024-11-06 06:55:40 | Validation | Step: 224500 | Val_loss: 0.688 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:55:49 | Epoch: 1 | Step: 224510 | Dataset: 0-1923981 | Loss: 0.678 | 913 ms/step , 6888.32 GFLOP/s , 15270.3 tokens/s INFO:__main__:2024-11-06 06:55:58 | Epoch: 1 | Step: 224520 | Dataset: 0-1924301 | Loss: 0.701 | 913 ms/step , 6888.34 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-06 06:56:07 | Epoch: 1 | Step: 224530 | Dataset: 0-1924621 | Loss: 0.577 | 912 ms/step , 6893.43 GFLOP/s , 17929.2 tokens/s INFO:__main__:2024-11-06 06:56:16 | Epoch: 1 | Step: 224540 | Dataset: 0-1924941 | Loss: 0.638 | 913 ms/step , 6886.77 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 06:56:25 | Epoch: 1 | Step: 224550 | Dataset: 0-1925261 | Loss: 0.606 | 912 ms/step , 6892.95 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 06:56:34 | Epoch: 1 | Step: 224560 | Dataset: 0-1925581 | Loss: 0.730 | 913 ms/step , 6888.77 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 06:56:44 | Epoch: 1 | Step: 224570 | Dataset: 0-1925901 | Loss: 0.740 | 912 ms/step , 6893.32 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 06:56:53 | Epoch: 1 | Step: 224580 | Dataset: 0-1926221 | Loss: 0.715 | 914 ms/step , 6878.62 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 06:57:02 | Epoch: 1 | Step: 224590 | Dataset: 0-1926541 | Loss: 0.668 | 913 ms/step , 6890.83 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 06:57:11 | Epoch: 1 | Step: 224600 | Dataset: 0-1926861 | Loss: 0.640 | 913 ms/step , 6887.34 GFLOP/s , 17925.1 tokens/s INFO:__main__:2024-11-06 06:57:13 | Validation | Step: 224600 | Val_loss: 0.699 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:57:22 | Epoch: 1 | Step: 224610 | Dataset: 0-1927181 | Loss: 0.735 | 914 ms/step , 6882.60 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 06:57:31 | Epoch: 1 | Step: 224620 | Dataset: 0-1927501 | Loss: 0.706 | 914 ms/step , 6881.71 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 06:57:40 | Epoch: 1 | Step: 224630 | Dataset: 0-1927821 | Loss: 0.647 | 914 ms/step , 6878.87 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 06:57:49 | Epoch: 1 | Step: 224640 | Dataset: 0-1928141 | Loss: 0.648 | 914 ms/step , 6884.53 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 06:57:58 | Epoch: 1 | Step: 224650 | Dataset: 0-1928461 | Loss: 0.717 | 915 ms/step , 6876.38 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-06 06:58:07 | Epoch: 1 | Step: 224660 | Dataset: 0-1928781 | Loss: 0.613 | 915 ms/step , 6877.21 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 06:58:17 | Epoch: 1 | Step: 224670 | Dataset: 0-1929101 | Loss: 0.704 | 913 ms/step , 6887.44 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 06:58:26 | Epoch: 1 | Step: 224680 | Dataset: 0-1929421 | Loss: 0.630 | 914 ms/step , 6880.79 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 06:58:35 | Epoch: 1 | Step: 224690 | Dataset: 0-1929741 | Loss: 0.744 | 914 ms/step , 6880.94 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 06:58:44 | Epoch: 1 | Step: 224700 | Dataset: 0-1930061 | Loss: 0.647 | 912 ms/step , 6892.92 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 06:58:46 | Validation | Step: 224700 | Val_loss: 0.718 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 06:58:55 | Epoch: 1 | Step: 224710 | Dataset: 0-1930381 | Loss: 0.779 | 913 ms/step , 6887.92 GFLOP/s , 15267.4 tokens/s INFO:__main__:2024-11-06 06:59:04 | Epoch: 1 | Step: 224720 | Dataset: 0-1930701 | Loss: 0.671 | 915 ms/step , 6876.05 GFLOP/s , 17924.8 tokens/s INFO:__main__:2024-11-06 06:59:13 | Epoch: 1 | Step: 224730 | Dataset: 0-1931021 | Loss: 0.685 | 912 ms/step , 6894.16 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 06:59:22 | Epoch: 1 | Step: 224740 | Dataset: 0-1931341 | Loss: 0.690 | 913 ms/step , 6886.21 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 06:59:31 | Epoch: 1 | Step: 224750 | Dataset: 0-1931661 | Loss: 0.686 | 914 ms/step , 6883.53 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 06:59:40 | Epoch: 1 | Step: 224760 | Dataset: 0-1931981 | Loss: 0.677 | 914 ms/step , 6881.91 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 06:59:50 | Epoch: 1 | Step: 224770 | Dataset: 0-1932301 | Loss: 0.694 | 914 ms/step , 6883.20 GFLOP/s , 17933.2 tokens/s INFO:__main__:2024-11-06 06:59:59 | Epoch: 1 | Step: 224780 | Dataset: 0-1932621 | Loss: 0.683 | 914 ms/step , 6884.55 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-06 07:00:08 | Epoch: 1 | Step: 224790 | Dataset: 0-1932941 | Loss: 0.584 | 913 ms/step , 6889.58 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 07:00:17 | Epoch: 1 | Step: 224800 | Dataset: 0-1933261 | Loss: 0.632 | 913 ms/step , 6892.12 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 07:00:18 | Validation | Step: 224800 | Val_loss: 0.665 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:00:28 | Epoch: 1 | Step: 224810 | Dataset: 0-1933581 | Loss: 0.679 | 914 ms/step , 6883.45 GFLOP/s , 15277.3 tokens/s INFO:__main__:2024-11-06 07:00:37 | Epoch: 1 | Step: 224820 | Dataset: 0-1933901 | Loss: 0.706 | 914 ms/step , 6877.54 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 07:00:46 | Epoch: 1 | Step: 224830 | Dataset: 0-1934221 | Loss: 0.768 | 913 ms/step , 6886.08 GFLOP/s , 17926.6 tokens/s INFO:__main__:2024-11-06 07:00:55 | Epoch: 1 | Step: 224840 | Dataset: 0-1934541 | Loss: 0.692 | 912 ms/step , 6894.75 GFLOP/s , 17930.2 tokens/s INFO:__main__:2024-11-06 07:01:04 | Epoch: 1 | Step: 224850 | Dataset: 0-1934861 | Loss: 0.656 | 913 ms/step , 6889.28 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 07:01:13 | Epoch: 1 | Step: 224860 | Dataset: 0-1935181 | Loss: 0.730 | 913 ms/step , 6886.44 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 07:01:22 | Epoch: 1 | Step: 224870 | Dataset: 0-1935501 | Loss: 0.704 | 915 ms/step , 6876.42 GFLOP/s , 17923.6 tokens/s INFO:__main__:2024-11-06 07:01:32 | Epoch: 1 | Step: 224880 | Dataset: 0-1935821 | Loss: 0.598 | 914 ms/step , 6882.61 GFLOP/s , 17935.5 tokens/s INFO:__main__:2024-11-06 07:01:41 | Epoch: 1 | Step: 224890 | Dataset: 0-1936141 | Loss: 0.635 | 913 ms/step , 6890.71 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 07:01:50 | Epoch: 1 | Step: 224900 | Dataset: 0-1936461 | Loss: 0.752 | 914 ms/step , 6884.74 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 07:01:51 | Validation | Step: 224900 | Val_loss: 0.679 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:02:01 | Epoch: 1 | Step: 224910 | Dataset: 0-1936781 | Loss: 0.615 | 913 ms/step , 6888.61 GFLOP/s , 15273.3 tokens/s INFO:__main__:2024-11-06 07:02:10 | Epoch: 1 | Step: 224920 | Dataset: 0-1937101 | Loss: 0.715 | 913 ms/step , 6888.59 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 07:02:19 | Epoch: 1 | Step: 224930 | Dataset: 0-1937421 | Loss: 0.735 | 915 ms/step , 6875.94 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-06 07:02:28 | Epoch: 1 | Step: 224940 | Dataset: 0-1937741 | Loss: 0.693 | 913 ms/step , 6892.59 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 07:02:37 | Epoch: 1 | Step: 224950 | Dataset: 0-1938061 | Loss: 0.712 | 913 ms/step , 6891.43 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 07:02:46 | Epoch: 1 | Step: 224960 | Dataset: 0-1938381 | Loss: 0.751 | 914 ms/step , 6883.04 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-06 07:02:55 | Epoch: 1 | Step: 224970 | Dataset: 0-1938701 | Loss: 0.728 | 913 ms/step , 6885.40 GFLOP/s , 17912.3 tokens/s INFO:__main__:2024-11-06 07:03:05 | Epoch: 1 | Step: 224980 | Dataset: 0-1939021 | Loss: 0.699 | 915 ms/step , 6875.58 GFLOP/s , 17913.8 tokens/s INFO:__main__:2024-11-06 07:03:14 | Epoch: 1 | Step: 224990 | Dataset: 0-1939341 | Loss: 0.726 | 912 ms/step , 6893.11 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-06 07:03:23 | Epoch: 1 | Step: 225000 | Dataset: 0-1939661 | Loss: 0.721 | 915 ms/step , 6876.21 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 07:03:25 | Validation | Step: 225000 | Val_loss: 0.698 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:03:25 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_070325_step_225000.pt` INFO:__main__:2024-11-06 07:03:35 | Epoch: 1 | Step: 225010 | Dataset: 0-1939981 | Loss: 0.678 | 927 ms/step , 6781.79 GFLOP/s , 13794.4 tokens/s INFO:__main__:2024-11-06 07:03:44 | Epoch: 1 | Step: 225020 | Dataset: 0-1940301 | Loss: 0.734 | 914 ms/step , 6878.56 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 07:03:53 | Epoch: 1 | Step: 225030 | Dataset: 0-1940621 | Loss: 0.760 | 915 ms/step , 6875.87 GFLOP/s , 17915.6 tokens/s INFO:__main__:2024-11-06 07:04:02 | Epoch: 1 | Step: 225040 | Dataset: 0-1940941 | Loss: 0.784 | 914 ms/step , 6883.29 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 07:04:11 | Epoch: 1 | Step: 225050 | Dataset: 0-1941261 | Loss: 0.734 | 914 ms/step , 6877.53 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 07:04:21 | Epoch: 1 | Step: 225060 | Dataset: 0-1941581 | Loss: 0.746 | 915 ms/step , 6876.46 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-06 07:04:30 | Epoch: 1 | Step: 225070 | Dataset: 0-1941901 | Loss: 0.760 | 914 ms/step , 6882.88 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-06 07:04:39 | Epoch: 1 | Step: 225080 | Dataset: 0-1942221 | Loss: 0.730 | 914 ms/step , 6877.67 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 07:04:48 | Epoch: 1 | Step: 225090 | Dataset: 0-1942541 | Loss: 0.723 | 915 ms/step , 6874.81 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-06 07:04:57 | Epoch: 1 | Step: 225100 | Dataset: 0-1942861 | Loss: 0.698 | 914 ms/step , 6883.24 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 07:04:59 | Validation | Step: 225100 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:05:08 | Epoch: 1 | Step: 225110 | Dataset: 0-1943181 | Loss: 0.689 | 913 ms/step , 6889.33 GFLOP/s , 15269.1 tokens/s INFO:__main__:2024-11-06 07:05:17 | Epoch: 1 | Step: 225120 | Dataset: 0-1943501 | Loss: 0.731 | 913 ms/step , 6886.27 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-06 07:05:26 | Epoch: 1 | Step: 225130 | Dataset: 0-1943821 | Loss: 0.711 | 913 ms/step , 6889.28 GFLOP/s , 17913.1 tokens/s INFO:__main__:2024-11-06 07:05:35 | Epoch: 1 | Step: 225140 | Dataset: 0-1944141 | Loss: 0.784 | 915 ms/step , 6875.70 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 07:05:44 | Epoch: 1 | Step: 225150 | Dataset: 0-1944461 | Loss: 0.713 | 914 ms/step , 6882.52 GFLOP/s , 17920.1 tokens/s INFO:__main__:2024-11-06 07:05:54 | Epoch: 1 | Step: 225160 | Dataset: 0-1944781 | Loss: 0.761 | 912 ms/step , 6893.67 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 07:06:03 | Epoch: 1 | Step: 225170 | Dataset: 0-1945101 | Loss: 0.728 | 912 ms/step , 6896.42 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-06 07:06:12 | Epoch: 1 | Step: 225180 | Dataset: 0-1945421 | Loss: 0.789 | 914 ms/step , 6881.70 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 07:06:21 | Epoch: 1 | Step: 225190 | Dataset: 0-1945741 | Loss: 0.661 | 912 ms/step , 6893.35 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 07:06:30 | Epoch: 1 | Step: 225200 | Dataset: 0-1946061 | Loss: 0.780 | 915 ms/step , 6871.43 GFLOP/s , 17921.8 tokens/s INFO:__main__:2024-11-06 07:06:32 | Validation | Step: 225200 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:06:41 | Epoch: 1 | Step: 225210 | Dataset: 0-1946381 | Loss: 0.789 | 915 ms/step , 6874.98 GFLOP/s , 15261.6 tokens/s INFO:__main__:2024-11-06 07:06:50 | Epoch: 1 | Step: 225220 | Dataset: 0-1946701 | Loss: 0.643 | 913 ms/step , 6888.23 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-06 07:06:59 | Epoch: 1 | Step: 225230 | Dataset: 0-1947021 | Loss: 0.773 | 914 ms/step , 6877.60 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 07:07:08 | Epoch: 1 | Step: 225240 | Dataset: 0-1947341 | Loss: 0.636 | 914 ms/step , 6880.71 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 07:07:17 | Epoch: 1 | Step: 225250 | Dataset: 0-1947661 | Loss: 0.759 | 914 ms/step , 6879.87 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 07:07:27 | Epoch: 1 | Step: 225260 | Dataset: 0-1947981 | Loss: 0.658 | 914 ms/step , 6882.37 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 07:07:36 | Epoch: 1 | Step: 225270 | Dataset: 0-1948301 | Loss: 0.666 | 914 ms/step , 6884.45 GFLOP/s , 17920.2 tokens/s INFO:__main__:2024-11-06 07:07:45 | Epoch: 1 | Step: 225280 | Dataset: 0-1948621 | Loss: 0.667 | 914 ms/step , 6884.41 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 07:07:54 | Epoch: 1 | Step: 225290 | Dataset: 0-1948941 | Loss: 0.712 | 922 ms/step , 6818.29 GFLOP/s , 17834.1 tokens/s INFO:__main__:2024-11-06 07:08:03 | Epoch: 1 | Step: 225300 | Dataset: 0-1949261 | Loss: 0.715 | 914 ms/step , 6882.82 GFLOP/s , 17839.5 tokens/s INFO:__main__:2024-11-06 07:08:05 | Validation | Step: 225300 | Val_loss: 0.662 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:08:14 | Epoch: 1 | Step: 225310 | Dataset: 0-1949581 | Loss: 0.713 | 913 ms/step , 6889.23 GFLOP/s , 15258.8 tokens/s INFO:__main__:2024-11-06 07:08:23 | Epoch: 1 | Step: 225320 | Dataset: 0-1949901 | Loss: 0.735 | 916 ms/step , 6867.33 GFLOP/s , 17919.0 tokens/s INFO:__main__:2024-11-06 07:08:32 | Epoch: 1 | Step: 225330 | Dataset: 0-1950221 | Loss: 0.721 | 914 ms/step , 6884.19 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 07:08:41 | Epoch: 1 | Step: 225340 | Dataset: 0-1950541 | Loss: 0.734 | 913 ms/step , 6885.26 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 07:08:51 | Epoch: 1 | Step: 225350 | Dataset: 0-1950861 | Loss: 0.731 | 914 ms/step , 6881.85 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 07:09:00 | Epoch: 1 | Step: 225360 | Dataset: 0-1951181 | Loss: 0.832 | 914 ms/step , 6880.03 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 07:09:09 | Epoch: 1 | Step: 225370 | Dataset: 0-1951501 | Loss: 0.727 | 914 ms/step , 6883.44 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 07:09:18 | Epoch: 1 | Step: 225380 | Dataset: 0-1951821 | Loss: 0.667 | 917 ms/step , 6860.86 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 07:09:27 | Epoch: 1 | Step: 225390 | Dataset: 0-1952141 | Loss: 0.738 | 913 ms/step , 6888.84 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 07:09:36 | Epoch: 1 | Step: 225400 | Dataset: 0-1952461 | Loss: 0.801 | 914 ms/step , 6879.59 GFLOP/s , 17914.2 tokens/s INFO:__main__:2024-11-06 07:09:38 | Validation | Step: 225400 | Val_loss: 0.634 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:09:47 | Epoch: 1 | Step: 225410 | Dataset: 0-1952781 | Loss: 0.731 | 915 ms/step , 6875.37 GFLOP/s , 15265.9 tokens/s INFO:__main__:2024-11-06 07:09:56 | Epoch: 1 | Step: 225420 | Dataset: 0-1953101 | Loss: 0.746 | 914 ms/step , 6881.80 GFLOP/s , 17922.2 tokens/s INFO:__main__:2024-11-06 07:10:05 | Epoch: 1 | Step: 225430 | Dataset: 0-1953421 | Loss: 0.729 | 916 ms/step , 6867.78 GFLOP/s , 17916.8 tokens/s INFO:__main__:2024-11-06 07:10:14 | Epoch: 1 | Step: 225440 | Dataset: 0-1953741 | Loss: 0.760 | 914 ms/step , 6884.08 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-06 07:10:24 | Epoch: 1 | Step: 225450 | Dataset: 0-1954061 | Loss: 0.706 | 915 ms/step , 6871.25 GFLOP/s , 17904.0 tokens/s INFO:__main__:2024-11-06 07:10:33 | Epoch: 1 | Step: 225460 | Dataset: 0-1954381 | Loss: 0.737 | 915 ms/step , 6870.30 GFLOP/s , 17906.2 tokens/s INFO:__main__:2024-11-06 07:10:42 | Epoch: 1 | Step: 225470 | Dataset: 0-1954701 | Loss: 0.718 | 914 ms/step , 6880.31 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-06 07:10:51 | Epoch: 1 | Step: 225480 | Dataset: 0-1955021 | Loss: 0.736 | 915 ms/step , 6876.60 GFLOP/s , 17918.9 tokens/s INFO:__main__:2024-11-06 07:11:00 | Epoch: 1 | Step: 225490 | Dataset: 0-1955341 | Loss: 0.746 | 913 ms/step , 6886.16 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 07:11:09 | Epoch: 1 | Step: 225500 | Dataset: 0-1955661 | Loss: 0.680 | 913 ms/step , 6887.67 GFLOP/s , 17922.3 tokens/s INFO:__main__:2024-11-06 07:11:11 | Validation | Step: 225500 | Val_loss: 0.667 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:11:20 | Epoch: 1 | Step: 225510 | Dataset: 0-1955981 | Loss: 0.693 | 913 ms/step , 6886.89 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-06 07:11:29 | Epoch: 1 | Step: 225520 | Dataset: 0-1956301 | Loss: 0.705 | 912 ms/step , 6894.44 GFLOP/s , 17922.8 tokens/s INFO:__main__:2024-11-06 07:11:38 | Epoch: 1 | Step: 225530 | Dataset: 0-1956621 | Loss: 0.689 | 913 ms/step , 6885.64 GFLOP/s , 17917.3 tokens/s INFO:__main__:2024-11-06 07:11:47 | Epoch: 1 | Step: 225540 | Dataset: 0-1956941 | Loss: 0.709 | 914 ms/step , 6883.23 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 07:11:57 | Epoch: 1 | Step: 225550 | Dataset: 0-1957261 | Loss: 0.662 | 912 ms/step , 6893.69 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 07:12:06 | Epoch: 1 | Step: 225560 | Dataset: 0-1957581 | Loss: 0.689 | 913 ms/step , 6885.62 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 07:12:15 | Epoch: 1 | Step: 225570 | Dataset: 0-1957901 | Loss: 0.825 | 914 ms/step , 6881.19 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 07:12:24 | Epoch: 1 | Step: 225580 | Dataset: 0-1958221 | Loss: 0.726 | 916 ms/step , 6869.33 GFLOP/s , 17923.5 tokens/s INFO:__main__:2024-11-06 07:12:33 | Epoch: 1 | Step: 225590 | Dataset: 0-1958541 | Loss: 0.757 | 914 ms/step , 6879.54 GFLOP/s , 17915.8 tokens/s INFO:__main__:2024-11-06 07:12:42 | Epoch: 1 | Step: 225600 | Dataset: 0-1958861 | Loss: 0.731 | 914 ms/step , 6881.72 GFLOP/s , 17922.4 tokens/s INFO:__main__:2024-11-06 07:12:44 | Validation | Step: 225600 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:12:53 | Epoch: 1 | Step: 225610 | Dataset: 0-1959181 | Loss: 0.700 | 915 ms/step , 6875.90 GFLOP/s , 15266.0 tokens/s INFO:__main__:2024-11-06 07:13:02 | Epoch: 1 | Step: 225620 | Dataset: 0-1959501 | Loss: 0.724 | 914 ms/step , 6884.86 GFLOP/s , 17922.6 tokens/s INFO:__main__:2024-11-06 07:13:11 | Epoch: 1 | Step: 225630 | Dataset: 0-1959821 | Loss: 0.692 | 913 ms/step , 6887.71 GFLOP/s , 17918.4 tokens/s INFO:__main__:2024-11-06 07:13:20 | Epoch: 1 | Step: 225640 | Dataset: 0-1960141 | Loss: 0.695 | 914 ms/step , 6884.22 GFLOP/s , 17924.1 tokens/s INFO:__main__:2024-11-06 07:13:30 | Epoch: 1 | Step: 225650 | Dataset: 0-1960461 | Loss: 0.733 | 915 ms/step , 6876.07 GFLOP/s , 17918.2 tokens/s INFO:__main__:2024-11-06 07:13:39 | Epoch: 1 | Step: 225660 | Dataset: 0-1960781 | Loss: 0.564 | 913 ms/step , 6885.55 GFLOP/s , 17919.8 tokens/s INFO:__main__:2024-11-06 07:13:48 | Epoch: 1 | Step: 225670 | Dataset: 0-1961101 | Loss: 0.803 | 914 ms/step , 6879.91 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 07:13:57 | Epoch: 1 | Step: 225680 | Dataset: 0-1961421 | Loss: 0.807 | 914 ms/step , 6880.96 GFLOP/s , 17928.6 tokens/s INFO:__main__:2024-11-06 07:14:06 | Epoch: 1 | Step: 225690 | Dataset: 0-1961741 | Loss: 0.712 | 913 ms/step , 6889.25 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 07:14:15 | Epoch: 1 | Step: 225700 | Dataset: 0-1962061 | Loss: 0.821 | 913 ms/step , 6892.08 GFLOP/s , 17931.3 tokens/s INFO:__main__:2024-11-06 07:14:17 | Validation | Step: 225700 | Val_loss: 0.649 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:14:26 | Epoch: 1 | Step: 225710 | Dataset: 0-1962381 | Loss: 0.843 | 914 ms/step , 6883.06 GFLOP/s , 15275.2 tokens/s INFO:__main__:2024-11-06 07:14:35 | Epoch: 1 | Step: 225720 | Dataset: 0-1962701 | Loss: 0.790 | 913 ms/step , 6885.26 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 07:14:44 | Epoch: 1 | Step: 225730 | Dataset: 0-1963021 | Loss: 0.666 | 913 ms/step , 6889.69 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 07:14:53 | Epoch: 1 | Step: 225740 | Dataset: 0-1963341 | Loss: 0.871 | 913 ms/step , 6886.17 GFLOP/s , 17924.2 tokens/s INFO:__main__:2024-11-06 07:15:03 | Epoch: 1 | Step: 225750 | Dataset: 0-1963661 | Loss: 0.718 | 913 ms/step , 6891.70 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 07:15:12 | Epoch: 1 | Step: 225760 | Dataset: 0-1963981 | Loss: 0.784 | 914 ms/step , 6881.61 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 07:15:21 | Epoch: 1 | Step: 225770 | Dataset: 0-1964301 | Loss: 0.744 | 913 ms/step , 6891.45 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 07:15:30 | Epoch: 1 | Step: 225780 | Dataset: 0-1964621 | Loss: 0.757 | 914 ms/step , 6880.77 GFLOP/s , 17929.5 tokens/s INFO:__main__:2024-11-06 07:15:39 | Epoch: 1 | Step: 225790 | Dataset: 0-1964941 | Loss: 0.839 | 913 ms/step , 6889.60 GFLOP/s , 17935.0 tokens/s INFO:__main__:2024-11-06 07:15:48 | Epoch: 1 | Step: 225800 | Dataset: 0-1965261 | Loss: 0.800 | 914 ms/step , 6882.40 GFLOP/s , 17930.4 tokens/s INFO:__main__:2024-11-06 07:15:50 | Validation | Step: 225800 | Val_loss: 0.757 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:15:59 | Epoch: 1 | Step: 225810 | Dataset: 0-1965581 | Loss: 0.858 | 914 ms/step , 6881.67 GFLOP/s , 15278.3 tokens/s INFO:__main__:2024-11-06 07:16:08 | Epoch: 1 | Step: 225820 | Dataset: 0-1965901 | Loss: 0.871 | 916 ms/step , 6869.10 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 07:16:17 | Epoch: 1 | Step: 225830 | Dataset: 0-1966221 | Loss: 0.827 | 915 ms/step , 6875.29 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 07:16:26 | Epoch: 1 | Step: 225840 | Dataset: 0-1966541 | Loss: 0.813 | 914 ms/step , 6883.55 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 07:16:36 | Epoch: 1 | Step: 225850 | Dataset: 0-1966861 | Loss: 0.885 | 915 ms/step , 6875.76 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 07:16:45 | Epoch: 1 | Step: 225860 | Dataset: 0-1967181 | Loss: 0.642 | 912 ms/step , 6894.71 GFLOP/s , 17933.3 tokens/s INFO:__main__:2024-11-06 07:16:54 | Epoch: 1 | Step: 225870 | Dataset: 0-1967501 | Loss: 0.870 | 914 ms/step , 6878.76 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 07:17:03 | Epoch: 1 | Step: 225880 | Dataset: 0-1967821 | Loss: 0.772 | 914 ms/step , 6885.01 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 07:17:12 | Epoch: 1 | Step: 225890 | Dataset: 0-1968141 | Loss: 0.851 | 913 ms/step , 6885.88 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 07:17:21 | Epoch: 1 | Step: 225900 | Dataset: 0-1968461 | Loss: 0.902 | 915 ms/step , 6875.11 GFLOP/s , 17935.7 tokens/s INFO:__main__:2024-11-06 07:17:23 | Validation | Step: 225900 | Val_loss: 0.692 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:17:32 | Epoch: 1 | Step: 225910 | Dataset: 0-1968781 | Loss: 0.775 | 914 ms/step , 6879.18 GFLOP/s , 15274.1 tokens/s INFO:__main__:2024-11-06 07:17:41 | Epoch: 1 | Step: 225920 | Dataset: 0-1969101 | Loss: 0.935 | 914 ms/step , 6879.64 GFLOP/s , 17928.4 tokens/s INFO:__main__:2024-11-06 07:17:50 | Epoch: 1 | Step: 225930 | Dataset: 0-1969421 | Loss: 0.881 | 914 ms/step , 6879.23 GFLOP/s , 17928.5 tokens/s INFO:__main__:2024-11-06 07:17:59 | Epoch: 1 | Step: 225940 | Dataset: 0-1969741 | Loss: 0.634 | 913 ms/step , 6889.40 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 07:18:08 | Epoch: 1 | Step: 225950 | Dataset: 0-1970061 | Loss: 0.813 | 913 ms/step , 6891.38 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 07:18:18 | Epoch: 1 | Step: 225960 | Dataset: 0-1970381 | Loss: 0.779 | 913 ms/step , 6890.38 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 07:18:27 | Epoch: 1 | Step: 225970 | Dataset: 0-1970701 | Loss: 0.816 | 914 ms/step , 6879.20 GFLOP/s , 17937.0 tokens/s INFO:__main__:2024-11-06 07:18:36 | Epoch: 1 | Step: 225980 | Dataset: 0-1971021 | Loss: 0.785 | 913 ms/step , 6886.54 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 07:18:45 | Epoch: 1 | Step: 225990 | Dataset: 0-1971341 | Loss: 0.775 | 913 ms/step , 6890.35 GFLOP/s , 17930.3 tokens/s INFO:__main__:2024-11-06 07:18:54 | Epoch: 1 | Step: 226000 | Dataset: 0-1971661 | Loss: 0.727 | 913 ms/step , 6888.99 GFLOP/s , 17932.8 tokens/s INFO:__main__:2024-11-06 07:18:56 | Validation | Step: 226000 | Val_loss: 0.722 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:18:56 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_071856_step_226000.pt` INFO:__main__:2024-11-06 07:19:06 | Epoch: 1 | Step: 226010 | Dataset: 0-1971981 | Loss: 0.606 | 914 ms/step , 6884.17 GFLOP/s , 13779.4 tokens/s INFO:__main__:2024-11-06 07:19:15 | Epoch: 1 | Step: 226020 | Dataset: 0-1972301 | Loss: 0.780 | 915 ms/step , 6875.35 GFLOP/s , 17924.0 tokens/s INFO:__main__:2024-11-06 07:19:24 | Epoch: 1 | Step: 226030 | Dataset: 0-1972621 | Loss: 0.756 | 913 ms/step , 6888.63 GFLOP/s , 17930.9 tokens/s INFO:__main__:2024-11-06 07:19:33 | Epoch: 1 | Step: 226040 | Dataset: 0-1972941 | Loss: 0.782 | 913 ms/step , 6891.19 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 07:19:43 | Epoch: 1 | Step: 226050 | Dataset: 0-1973261 | Loss: 0.879 | 912 ms/step , 6894.34 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 07:19:52 | Epoch: 1 | Step: 226060 | Dataset: 0-1973581 | Loss: 0.875 | 914 ms/step , 6884.16 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 07:20:01 | Epoch: 1 | Step: 226070 | Dataset: 0-1973901 | Loss: 0.668 | 915 ms/step , 6870.69 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 07:20:10 | Epoch: 1 | Step: 226080 | Dataset: 0-1974221 | Loss: 0.722 | 913 ms/step , 6888.41 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 07:20:19 | Epoch: 1 | Step: 226090 | Dataset: 0-1974541 | Loss: 0.642 | 913 ms/step , 6890.68 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 07:20:28 | Epoch: 1 | Step: 226100 | Dataset: 0-1974861 | Loss: 0.634 | 913 ms/step , 6892.36 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 07:20:30 | Validation | Step: 226100 | Val_loss: 0.665 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:20:39 | Epoch: 1 | Step: 226110 | Dataset: 0-1975181 | Loss: 0.938 | 913 ms/step , 6889.92 GFLOP/s , 15277.1 tokens/s INFO:__main__:2024-11-06 07:20:48 | Epoch: 1 | Step: 226120 | Dataset: 0-1975501 | Loss: 0.796 | 913 ms/step , 6890.41 GFLOP/s , 17939.0 tokens/s INFO:__main__:2024-11-06 07:20:57 | Epoch: 1 | Step: 226130 | Dataset: 0-1975821 | Loss: 0.856 | 913 ms/step , 6888.63 GFLOP/s , 17927.5 tokens/s INFO:__main__:2024-11-06 07:21:06 | Epoch: 1 | Step: 226140 | Dataset: 0-1976141 | Loss: 0.884 | 916 ms/step , 6869.58 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 07:21:16 | Epoch: 1 | Step: 226150 | Dataset: 0-1976461 | Loss: 0.816 | 913 ms/step , 6891.42 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 07:21:25 | Epoch: 1 | Step: 226160 | Dataset: 0-1976781 | Loss: 0.774 | 913 ms/step , 6887.28 GFLOP/s , 17925.5 tokens/s INFO:__main__:2024-11-06 07:21:34 | Epoch: 1 | Step: 226170 | Dataset: 0-1977101 | Loss: 0.616 | 914 ms/step , 6883.55 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 07:21:43 | Epoch: 1 | Step: 226180 | Dataset: 0-1977421 | Loss: 0.501 | 912 ms/step , 6893.95 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 07:21:52 | Epoch: 1 | Step: 226190 | Dataset: 0-1977741 | Loss: 0.820 | 913 ms/step , 6890.89 GFLOP/s , 17938.4 tokens/s INFO:__main__:2024-11-06 07:22:01 | Epoch: 1 | Step: 226200 | Dataset: 0-1978061 | Loss: 0.738 | 914 ms/step , 6884.51 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 07:22:03 | Validation | Step: 226200 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:22:12 | Epoch: 1 | Step: 226210 | Dataset: 0-1978381 | Loss: 0.799 | 916 ms/step , 6867.99 GFLOP/s , 15278.5 tokens/s INFO:__main__:2024-11-06 07:22:21 | Epoch: 1 | Step: 226220 | Dataset: 0-1978701 | Loss: 0.735 | 913 ms/step , 6888.92 GFLOP/s , 17927.8 tokens/s INFO:__main__:2024-11-06 07:22:30 | Epoch: 1 | Step: 226230 | Dataset: 0-1979021 | Loss: 0.806 | 914 ms/step , 6879.97 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 07:22:39 | Epoch: 1 | Step: 226240 | Dataset: 0-1979341 | Loss: 0.703 | 912 ms/step , 6897.23 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 07:22:49 | Epoch: 1 | Step: 226250 | Dataset: 0-1979661 | Loss: 0.707 | 914 ms/step , 6883.86 GFLOP/s , 17923.8 tokens/s INFO:__main__:2024-11-06 07:22:58 | Epoch: 1 | Step: 226260 | Dataset: 0-1979981 | Loss: 0.728 | 913 ms/step , 6889.84 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-06 07:23:07 | Epoch: 1 | Step: 226270 | Dataset: 0-1980301 | Loss: 0.830 | 912 ms/step , 6892.81 GFLOP/s , 17937.1 tokens/s INFO:__main__:2024-11-06 07:23:16 | Epoch: 1 | Step: 226280 | Dataset: 0-1980621 | Loss: 0.679 | 912 ms/step , 6892.96 GFLOP/s , 17932.3 tokens/s INFO:__main__:2024-11-06 07:23:25 | Epoch: 1 | Step: 226290 | Dataset: 0-1980941 | Loss: 0.784 | 912 ms/step , 6894.44 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 07:23:34 | Epoch: 1 | Step: 226300 | Dataset: 0-1981261 | Loss: 0.739 | 913 ms/step , 6888.95 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 07:23:36 | Validation | Step: 226300 | Val_loss: 0.710 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:23:45 | Epoch: 1 | Step: 226310 | Dataset: 0-1981581 | Loss: 0.944 | 913 ms/step , 6886.68 GFLOP/s , 15282.6 tokens/s INFO:__main__:2024-11-06 07:23:54 | Epoch: 1 | Step: 226320 | Dataset: 0-1981901 | Loss: 0.819 | 914 ms/step , 6882.03 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 07:24:03 | Epoch: 1 | Step: 226330 | Dataset: 0-1982221 | Loss: 0.591 | 912 ms/step , 6892.91 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-06 07:24:12 | Epoch: 1 | Step: 226340 | Dataset: 0-1982541 | Loss: 0.853 | 913 ms/step , 6886.14 GFLOP/s , 17938.2 tokens/s INFO:__main__:2024-11-06 07:24:21 | Epoch: 1 | Step: 226350 | Dataset: 0-1982861 | Loss: 0.794 | 912 ms/step , 6896.76 GFLOP/s , 17941.4 tokens/s INFO:__main__:2024-11-06 07:24:31 | Epoch: 1 | Step: 226360 | Dataset: 0-1983181 | Loss: 0.769 | 914 ms/step , 6878.32 GFLOP/s , 17928.1 tokens/s INFO:__main__:2024-11-06 07:24:40 | Epoch: 1 | Step: 226370 | Dataset: 0-1983501 | Loss: 0.654 | 912 ms/step , 6892.94 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-06 07:24:49 | Epoch: 1 | Step: 226380 | Dataset: 0-1983821 | Loss: 0.792 | 912 ms/step , 6896.55 GFLOP/s , 17931.4 tokens/s INFO:__main__:2024-11-06 07:24:58 | Epoch: 1 | Step: 226390 | Dataset: 0-1984141 | Loss: 0.907 | 914 ms/step , 6882.18 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 07:25:07 | Epoch: 1 | Step: 226400 | Dataset: 0-1984461 | Loss: 0.738 | 913 ms/step , 6891.39 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 07:25:09 | Validation | Step: 226400 | Val_loss: 0.658 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:25:18 | Epoch: 1 | Step: 226410 | Dataset: 0-1984781 | Loss: 0.694 | 912 ms/step , 6896.98 GFLOP/s , 15272.5 tokens/s INFO:__main__:2024-11-06 07:25:27 | Epoch: 1 | Step: 226420 | Dataset: 0-1985101 | Loss: 0.683 | 914 ms/step , 6879.62 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 07:25:36 | Epoch: 1 | Step: 226430 | Dataset: 0-1985421 | Loss: 0.771 | 915 ms/step , 6875.92 GFLOP/s , 17926.3 tokens/s INFO:__main__:2024-11-06 07:25:45 | Epoch: 1 | Step: 226440 | Dataset: 0-1985741 | Loss: 0.841 | 913 ms/step , 6885.11 GFLOP/s , 17928.7 tokens/s INFO:__main__:2024-11-06 07:25:54 | Epoch: 1 | Step: 226450 | Dataset: 0-1986061 | Loss: 0.719 | 913 ms/step , 6890.43 GFLOP/s , 17933.1 tokens/s INFO:__main__:2024-11-06 07:26:04 | Epoch: 1 | Step: 226460 | Dataset: 0-1986381 | Loss: 0.786 | 914 ms/step , 6883.58 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 07:26:13 | Epoch: 1 | Step: 226470 | Dataset: 0-1986701 | Loss: 0.843 | 914 ms/step , 6884.07 GFLOP/s , 17928.0 tokens/s INFO:__main__:2024-11-06 07:26:22 | Epoch: 1 | Step: 226480 | Dataset: 0-1987021 | Loss: 0.784 | 912 ms/step , 6896.31 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 07:26:31 | Epoch: 1 | Step: 226490 | Dataset: 0-1987341 | Loss: 0.746 | 913 ms/step , 6887.03 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 07:26:40 | Epoch: 1 | Step: 226500 | Dataset: 0-1987661 | Loss: 0.789 | 913 ms/step , 6886.00 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 07:26:42 | Validation | Step: 226500 | Val_loss: 0.769 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:26:51 | Epoch: 1 | Step: 226510 | Dataset: 0-1987981 | Loss: 0.856 | 914 ms/step , 6884.85 GFLOP/s , 15276.2 tokens/s INFO:__main__:2024-11-06 07:27:00 | Epoch: 1 | Step: 226520 | Dataset: 0-1988301 | Loss: 0.745 | 914 ms/step , 6880.37 GFLOP/s , 17934.0 tokens/s INFO:__main__:2024-11-06 07:27:09 | Epoch: 1 | Step: 226530 | Dataset: 0-1988621 | Loss: 0.670 | 912 ms/step , 6899.85 GFLOP/s , 17932.0 tokens/s INFO:__main__:2024-11-06 07:27:18 | Epoch: 1 | Step: 226540 | Dataset: 0-1988941 | Loss: 0.730 | 912 ms/step , 6895.92 GFLOP/s , 17932.1 tokens/s INFO:__main__:2024-11-06 07:27:27 | Epoch: 1 | Step: 226550 | Dataset: 0-1989261 | Loss: 0.749 | 914 ms/step , 6881.29 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 07:27:37 | Epoch: 1 | Step: 226560 | Dataset: 0-1989581 | Loss: 0.795 | 912 ms/step , 6893.21 GFLOP/s , 17933.8 tokens/s INFO:__main__:2024-11-06 07:27:46 | Epoch: 1 | Step: 226570 | Dataset: 0-1989901 | Loss: 0.698 | 913 ms/step , 6888.86 GFLOP/s , 17934.7 tokens/s INFO:__main__:2024-11-06 07:27:55 | Epoch: 1 | Step: 226580 | Dataset: 0-1990221 | Loss: 0.635 | 914 ms/step , 6882.12 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 07:28:04 | Epoch: 1 | Step: 226590 | Dataset: 0-1990541 | Loss: 0.876 | 912 ms/step , 6893.32 GFLOP/s , 17943.1 tokens/s INFO:__main__:2024-11-06 07:28:13 | Epoch: 1 | Step: 226600 | Dataset: 0-1990861 | Loss: 0.797 | 913 ms/step , 6888.54 GFLOP/s , 17938.6 tokens/s INFO:__main__:2024-11-06 07:28:15 | Validation | Step: 226600 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:28:24 | Epoch: 1 | Step: 226610 | Dataset: 0-1991181 | Loss: 0.613 | 912 ms/step , 6896.97 GFLOP/s , 15281.7 tokens/s INFO:__main__:2024-11-06 07:28:33 | Epoch: 1 | Step: 226620 | Dataset: 0-1991501 | Loss: 0.726 | 913 ms/step , 6890.53 GFLOP/s , 17936.5 tokens/s INFO:__main__:2024-11-06 07:28:42 | Epoch: 1 | Step: 226630 | Dataset: 0-1991821 | Loss: 0.760 | 914 ms/step , 6883.82 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 07:28:51 | Epoch: 1 | Step: 226640 | Dataset: 0-1992141 | Loss: 0.811 | 914 ms/step , 6883.54 GFLOP/s , 17931.2 tokens/s INFO:__main__:2024-11-06 07:29:00 | Epoch: 1 | Step: 226650 | Dataset: 0-1992461 | Loss: 0.777 | 914 ms/step , 6883.95 GFLOP/s , 17928.3 tokens/s INFO:__main__:2024-11-06 07:29:10 | Epoch: 1 | Step: 226660 | Dataset: 0-1992781 | Loss: 0.849 | 914 ms/step , 6883.66 GFLOP/s , 17931.1 tokens/s INFO:__main__:2024-11-06 07:29:19 | Epoch: 1 | Step: 226670 | Dataset: 0-1993101 | Loss: 0.811 | 913 ms/step , 6885.92 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 07:29:28 | Epoch: 1 | Step: 226680 | Dataset: 0-1993421 | Loss: 0.636 | 913 ms/step , 6890.67 GFLOP/s , 17935.2 tokens/s INFO:__main__:2024-11-06 07:29:37 | Epoch: 1 | Step: 226690 | Dataset: 0-1993741 | Loss: 0.779 | 913 ms/step , 6887.16 GFLOP/s , 17935.1 tokens/s INFO:__main__:2024-11-06 07:29:46 | Epoch: 1 | Step: 226700 | Dataset: 0-1994061 | Loss: 0.789 | 914 ms/step , 6883.84 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 07:29:48 | Validation | Step: 226700 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:29:57 | Epoch: 1 | Step: 226710 | Dataset: 0-1994381 | Loss: 0.836 | 913 ms/step , 6887.28 GFLOP/s , 15271.6 tokens/s INFO:__main__:2024-11-06 07:30:06 | Epoch: 1 | Step: 226720 | Dataset: 0-1994701 | Loss: 0.805 | 913 ms/step , 6885.79 GFLOP/s , 17929.3 tokens/s INFO:__main__:2024-11-06 07:30:15 | Epoch: 1 | Step: 226730 | Dataset: 0-1995021 | Loss: 0.624 | 912 ms/step , 6894.94 GFLOP/s , 17936.6 tokens/s INFO:__main__:2024-11-06 07:30:24 | Epoch: 1 | Step: 226740 | Dataset: 0-1995341 | Loss: 0.782 | 914 ms/step , 6881.73 GFLOP/s , 17932.2 tokens/s INFO:__main__:2024-11-06 07:30:33 | Epoch: 1 | Step: 226750 | Dataset: 0-1995661 | Loss: 0.831 | 913 ms/step , 6888.32 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 07:30:42 | Epoch: 1 | Step: 226760 | Dataset: 0-1995981 | Loss: 0.933 | 913 ms/step , 6888.10 GFLOP/s , 17926.8 tokens/s INFO:__main__:2024-11-06 07:30:52 | Epoch: 1 | Step: 226770 | Dataset: 0-1996301 | Loss: 0.622 | 913 ms/step , 6892.27 GFLOP/s , 17929.7 tokens/s INFO:__main__:2024-11-06 07:31:01 | Epoch: 1 | Step: 226780 | Dataset: 0-1996621 | Loss: 0.876 | 913 ms/step , 6890.00 GFLOP/s , 17932.4 tokens/s INFO:__main__:2024-11-06 07:31:10 | Epoch: 1 | Step: 226790 | Dataset: 0-1996941 | Loss: 0.798 | 913 ms/step , 6885.50 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 07:31:19 | Epoch: 1 | Step: 226800 | Dataset: 0-1997261 | Loss: 0.630 | 914 ms/step , 6878.21 GFLOP/s , 17925.4 tokens/s INFO:__main__:2024-11-06 07:31:21 | Validation | Step: 226800 | Val_loss: 0.730 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:31:30 | Epoch: 1 | Step: 226810 | Dataset: 0-1997581 | Loss: 0.777 | 915 ms/step , 6871.69 GFLOP/s , 15276.0 tokens/s INFO:__main__:2024-11-06 07:31:39 | Epoch: 1 | Step: 226820 | Dataset: 0-1997901 | Loss: 0.770 | 913 ms/step , 6885.17 GFLOP/s , 17934.5 tokens/s INFO:__main__:2024-11-06 07:31:48 | Epoch: 1 | Step: 226830 | Dataset: 0-1998221 | Loss: 0.714 | 913 ms/step , 6887.07 GFLOP/s , 17931.5 tokens/s INFO:__main__:2024-11-06 07:31:57 | Epoch: 1 | Step: 226840 | Dataset: 0-1998541 | Loss: 0.860 | 913 ms/step , 6891.15 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 07:32:06 | Epoch: 1 | Step: 226850 | Dataset: 0-1998861 | Loss: 0.688 | 912 ms/step , 6899.27 GFLOP/s , 17937.2 tokens/s INFO:__main__:2024-11-06 07:32:15 | Epoch: 1 | Step: 226860 | Dataset: 0-1999181 | Loss: 0.833 | 914 ms/step , 6878.92 GFLOP/s , 17929.8 tokens/s INFO:__main__:2024-11-06 07:32:25 | Epoch: 1 | Step: 226870 | Dataset: 0-1999501 | Loss: 0.753 | 914 ms/step , 6880.75 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 07:32:34 | Epoch: 1 | Step: 226880 | Dataset: 0-1999821 | Loss: 0.814 | 914 ms/step , 6884.68 GFLOP/s , 17928.9 tokens/s INFO:__main__:2024-11-06 07:32:43 | Epoch: 1 | Step: 226890 | Dataset: 0-2000141 | Loss: 0.798 | 914 ms/step , 6884.55 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 07:32:52 | Epoch: 1 | Step: 226900 | Dataset: 0-2000461 | Loss: 0.847 | 914 ms/step , 6882.54 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 07:32:54 | Validation | Step: 226900 | Val_loss: 0.713 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:33:03 | Epoch: 1 | Step: 226910 | Dataset: 0-2000781 | Loss: 0.918 | 913 ms/step , 6891.51 GFLOP/s , 15272.8 tokens/s INFO:__main__:2024-11-06 07:33:12 | Epoch: 1 | Step: 226920 | Dataset: 0-2001101 | Loss: 0.742 | 913 ms/step , 6886.82 GFLOP/s , 17932.9 tokens/s INFO:__main__:2024-11-06 07:33:21 | Epoch: 1 | Step: 226930 | Dataset: 0-2001421 | Loss: 0.660 | 913 ms/step , 6888.62 GFLOP/s , 17934.6 tokens/s INFO:__main__:2024-11-06 07:33:30 | Epoch: 1 | Step: 226940 | Dataset: 0-2001741 | Loss: 0.859 | 913 ms/step , 6891.53 GFLOP/s , 17938.7 tokens/s INFO:__main__:2024-11-06 07:33:39 | Epoch: 1 | Step: 226950 | Dataset: 0-2002061 | Loss: 0.806 | 913 ms/step , 6889.03 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 07:33:48 | Epoch: 1 | Step: 226960 | Dataset: 0-2002381 | Loss: 0.866 | 914 ms/step , 6879.02 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 07:33:58 | Epoch: 1 | Step: 226970 | Dataset: 0-2002701 | Loss: 0.825 | 912 ms/step , 6893.14 GFLOP/s , 17933.5 tokens/s INFO:__main__:2024-11-06 07:34:07 | Epoch: 1 | Step: 226980 | Dataset: 0-2003021 | Loss: 0.650 | 912 ms/step , 6894.58 GFLOP/s , 17926.5 tokens/s INFO:__main__:2024-11-06 07:34:16 | Epoch: 1 | Step: 226990 | Dataset: 0-2003341 | Loss: 0.726 | 913 ms/step , 6885.28 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 07:34:25 | Epoch: 1 | Step: 227000 | Dataset: 0-2003661 | Loss: 0.759 | 914 ms/step , 6884.51 GFLOP/s , 17935.9 tokens/s INFO:__main__:2024-11-06 07:34:27 | Validation | Step: 227000 | Val_loss: 0.763 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:34:27 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_073427_step_227000.pt` INFO:__main__:2024-11-06 07:34:37 | Epoch: 1 | Step: 227010 | Dataset: 0-2003981 | Loss: 0.803 | 912 ms/step , 6894.26 GFLOP/s , 13823.4 tokens/s INFO:__main__:2024-11-06 07:34:46 | Epoch: 1 | Step: 227020 | Dataset: 0-2004301 | Loss: 0.761 | 913 ms/step , 6889.18 GFLOP/s , 17935.4 tokens/s INFO:__main__:2024-11-06 07:34:55 | Epoch: 1 | Step: 227030 | Dataset: 0-2004621 | Loss: 0.785 | 913 ms/step , 6887.99 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 07:35:04 | Epoch: 1 | Step: 227040 | Dataset: 0-2004941 | Loss: 0.527 | 913 ms/step , 6887.05 GFLOP/s , 17891.0 tokens/s INFO:__main__:2024-11-06 07:35:13 | Epoch: 1 | Step: 227050 | Dataset: 0-2005261 | Loss: 0.865 | 913 ms/step , 6888.47 GFLOP/s , 17932.6 tokens/s INFO:__main__:2024-11-06 07:35:22 | Epoch: 1 | Step: 227060 | Dataset: 0-2005581 | Loss: 0.614 | 912 ms/step , 6895.83 GFLOP/s , 17937.4 tokens/s INFO:__main__:2024-11-06 07:35:32 | Epoch: 1 | Step: 227070 | Dataset: 0-2005901 | Loss: 0.919 | 913 ms/step , 6890.79 GFLOP/s , 17934.8 tokens/s INFO:__main__:2024-11-06 07:35:41 | Epoch: 1 | Step: 227080 | Dataset: 0-2006221 | Loss: 0.738 | 913 ms/step , 6885.30 GFLOP/s , 17930.6 tokens/s INFO:__main__:2024-11-06 07:35:50 | Epoch: 1 | Step: 227090 | Dataset: 0-2006541 | Loss: 0.822 | 914 ms/step , 6880.66 GFLOP/s , 17929.0 tokens/s INFO:__main__:2024-11-06 07:35:59 | Epoch: 1 | Step: 227100 | Dataset: 0-2006861 | Loss: 0.818 | 913 ms/step , 6889.22 GFLOP/s , 17927.1 tokens/s INFO:__main__:2024-11-06 07:36:01 | Validation | Step: 227100 | Val_loss: 0.700 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:36:10 | Epoch: 1 | Step: 227110 | Dataset: 0-2007181 | Loss: 0.940 | 914 ms/step , 6884.04 GFLOP/s , 15272.4 tokens/s INFO:__main__:2024-11-06 07:36:19 | Epoch: 1 | Step: 227120 | Dataset: 0-2007501 | Loss: 0.848 | 913 ms/step , 6889.34 GFLOP/s , 17936.2 tokens/s INFO:__main__:2024-11-06 07:36:28 | Epoch: 1 | Step: 227130 | Dataset: 0-2007821 | Loss: 0.854 | 913 ms/step , 6886.53 GFLOP/s , 17930.0 tokens/s INFO:__main__:2024-11-06 07:36:37 | Epoch: 1 | Step: 227140 | Dataset: 0-2008141 | Loss: 0.748 | 912 ms/step , 6894.69 GFLOP/s , 17936.9 tokens/s INFO:__main__:2024-11-06 07:36:46 | Epoch: 1 | Step: 227150 | Dataset: 0-2008461 | Loss: 0.740 | 912 ms/step , 6892.75 GFLOP/s , 17926.9 tokens/s INFO:__main__:2024-11-06 07:36:55 | Epoch: 1 | Step: 227160 | Dataset: 0-2008781 | Loss: 0.759 | 911 ms/step , 6901.46 GFLOP/s , 17939.5 tokens/s INFO:__main__:2024-11-06 07:37:05 | Epoch: 1 | Step: 227170 | Dataset: 0-2009101 | Loss: 0.811 | 914 ms/step , 6881.96 GFLOP/s , 17927.6 tokens/s INFO:__main__:2024-11-06 07:37:14 | Epoch: 1 | Step: 227180 | Dataset: 0-2009421 | Loss: 0.737 | 914 ms/step , 6882.89 GFLOP/s , 17927.3 tokens/s INFO:__main__:2024-11-06 07:37:23 | Epoch: 1 | Step: 227190 | Dataset: 0-2009741 | Loss: 0.793 | 913 ms/step , 6885.54 GFLOP/s , 17931.8 tokens/s INFO:__main__:2024-11-06 07:37:32 | Epoch: 1 | Step: 227200 | Dataset: 0-2010061 | Loss: 0.716 | 914 ms/step , 6879.31 GFLOP/s , 17927.9 tokens/s INFO:__main__:2024-11-06 07:37:34 | Validation | Step: 227200 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:37:43 | Epoch: 1 | Step: 227210 | Dataset: 0-2010381 | Loss: 0.710 | 912 ms/step , 6894.81 GFLOP/s , 15279.3 tokens/s INFO:__main__:2024-11-06 07:37:52 | Epoch: 1 | Step: 227220 | Dataset: 0-2010701 | Loss: 0.727 | 912 ms/step , 6892.78 GFLOP/s , 17926.0 tokens/s INFO:__main__:2024-11-06 07:38:01 | Epoch: 1 | Step: 227230 | Dataset: 0-2011021 | Loss: 0.887 | 915 ms/step , 6875.60 GFLOP/s , 17918.1 tokens/s INFO:__main__:2024-11-06 07:38:10 | Epoch: 1 | Step: 227240 | Dataset: 0-2011341 | Loss: 0.787 | 914 ms/step , 6882.31 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 07:38:19 | Epoch: 1 | Step: 227250 | Dataset: 0-2011661 | Loss: 0.842 | 914 ms/step , 6882.81 GFLOP/s , 17928.8 tokens/s INFO:__main__:2024-11-06 07:38:28 | Epoch: 1 | Step: 227260 | Dataset: 0-2011981 | Loss: 0.724 | 914 ms/step , 6882.57 GFLOP/s , 17937.5 tokens/s INFO:__main__:2024-11-06 07:38:38 | Epoch: 1 | Step: 227270 | Dataset: 0-2012301 | Loss: 0.820 | 914 ms/step , 6881.89 GFLOP/s , 17932.7 tokens/s INFO:__main__:2024-11-06 07:38:47 | Epoch: 1 | Step: 227280 | Dataset: 0-2012621 | Loss: 0.713 | 913 ms/step , 6892.11 GFLOP/s , 17926.1 tokens/s INFO:__main__:2024-11-06 07:38:56 | Epoch: 1 | Step: 227290 | Dataset: 0-2012941 | Loss: 0.745 | 913 ms/step , 6889.35 GFLOP/s , 17932.5 tokens/s INFO:__main__:2024-11-06 07:39:05 | Epoch: 1 | Step: 227300 | Dataset: 0-2013261 | Loss: 0.895 | 913 ms/step , 6892.48 GFLOP/s , 17924.3 tokens/s INFO:__main__:2024-11-06 07:39:07 | Validation | Step: 227300 | Val_loss: 0.672 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:39:16 | Epoch: 1 | Step: 227310 | Dataset: 0-2013581 | Loss: 0.792 | 913 ms/step , 6891.78 GFLOP/s , 15274.5 tokens/s INFO:__main__:2024-11-06 07:39:25 | Epoch: 1 | Step: 227320 | Dataset: 0-2013901 | Loss: 0.837 | 913 ms/step , 6887.45 GFLOP/s , 17934.3 tokens/s INFO:__main__:2024-11-06 07:39:34 | Epoch: 1 | Step: 227330 | Dataset: 0-2014221 | Loss: 0.762 | 912 ms/step , 6893.52 GFLOP/s , 17930.1 tokens/s INFO:__main__:2024-11-06 07:39:43 | Epoch: 1 | Step: 227340 | Dataset: 0-2014541 | Loss: 0.735 | 912 ms/step , 6893.16 GFLOP/s , 17941.3 tokens/s INFO:__main__:2024-11-06 07:39:52 | Epoch: 1 | Step: 227350 | Dataset: 0-2014861 | Loss: 0.912 | 914 ms/step , 6880.75 GFLOP/s , 17930.8 tokens/s INFO:__main__:2024-11-06 07:40:01 | Epoch: 1 | Step: 227360 | Dataset: 0-2015181 | Loss: 0.802 | 912 ms/step , 6897.11 GFLOP/s , 17938.1 tokens/s INFO:__main__:2024-11-06 07:40:10 | Epoch: 1 | Step: 227370 | Dataset: 0-2015501 | Loss: 0.846 | 913 ms/step , 6890.08 GFLOP/s , 17933.9 tokens/s INFO:__main__:2024-11-06 07:40:20 | Epoch: 1 | Step: 227380 | Dataset: 0-2015821 | Loss: 0.416 | 910 ms/step , 6912.77 GFLOP/s , 17971.0 tokens/s INFO:__main__:2024-11-06 07:40:29 | Epoch: 1 | Step: 227390 | Dataset: 0-2016141 | Loss: 0.320 | 911 ms/step , 6904.85 GFLOP/s , 17969.6 tokens/s INFO:__main__:2024-11-06 07:40:38 | Epoch: 1 | Step: 227400 | Dataset: 0-2016461 | Loss: 0.387 | 912 ms/step , 6899.43 GFLOP/s , 17974.6 tokens/s INFO:__main__:2024-11-06 07:40:39 | Validation | Step: 227400 | Val_loss: 0.684 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:40:49 | Epoch: 1 | Step: 227410 | Dataset: 0-2016781 | Loss: 0.451 | 912 ms/step , 6893.01 GFLOP/s , 15305.4 tokens/s INFO:__main__:2024-11-06 07:40:58 | Epoch: 1 | Step: 227420 | Dataset: 0-2017101 | Loss: 0.282 | 911 ms/step , 6907.13 GFLOP/s , 17973.4 tokens/s INFO:__main__:2024-11-06 07:41:07 | Epoch: 1 | Step: 227430 | Dataset: 0-2017421 | Loss: 0.194 | 911 ms/step , 6904.40 GFLOP/s , 17969.0 tokens/s INFO:__main__:2024-11-06 07:41:16 | Epoch: 1 | Step: 227440 | Dataset: 0-2017741 | Loss: 0.360 | 911 ms/step , 6907.07 GFLOP/s , 17968.3 tokens/s INFO:__main__:2024-11-06 07:41:25 | Epoch: 1 | Step: 227450 | Dataset: 0-2018061 | Loss: 0.293 | 911 ms/step , 6902.63 GFLOP/s , 17975.5 tokens/s INFO:__main__:2024-11-06 07:41:34 | Epoch: 1 | Step: 227460 | Dataset: 0-2018381 | Loss: 0.435 | 911 ms/step , 6902.07 GFLOP/s , 17969.1 tokens/s INFO:__main__:2024-11-06 07:41:43 | Epoch: 1 | Step: 227470 | Dataset: 0-2018701 | Loss: 0.300 | 911 ms/step , 6905.03 GFLOP/s , 17969.4 tokens/s INFO:__main__:2024-11-06 07:41:52 | Epoch: 1 | Step: 227480 | Dataset: 0-2019021 | Loss: 0.473 | 913 ms/step , 6885.84 GFLOP/s , 17966.5 tokens/s INFO:__main__:2024-11-06 07:42:01 | Epoch: 1 | Step: 227490 | Dataset: 0-2019341 | Loss: 0.361 | 911 ms/step , 6903.66 GFLOP/s , 17966.2 tokens/s INFO:__main__:2024-11-06 07:42:11 | Epoch: 1 | Step: 227500 | Dataset: 0-2019661 | Loss: 0.432 | 911 ms/step , 6907.22 GFLOP/s , 17969.4 tokens/s INFO:__main__:2024-11-06 07:42:12 | Validation | Step: 227500 | Val_loss: 0.710 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:42:21 | Epoch: 1 | Step: 227510 | Dataset: 0-2019981 | Loss: 0.284 | 911 ms/step , 6903.13 GFLOP/s , 15301.3 tokens/s INFO:__main__:2024-11-06 07:42:30 | Epoch: 1 | Step: 227520 | Dataset: 0-2020301 | Loss: 0.148 | 911 ms/step , 6904.81 GFLOP/s , 17970.9 tokens/s INFO:__main__:2024-11-06 07:42:40 | Epoch: 1 | Step: 227530 | Dataset: 0-2020621 | Loss: 0.513 | 912 ms/step , 6898.47 GFLOP/s , 17970.4 tokens/s INFO:__main__:2024-11-06 07:42:49 | Epoch: 1 | Step: 227540 | Dataset: 0-2020941 | Loss: 0.133 | 910 ms/step , 6909.76 GFLOP/s , 17971.1 tokens/s INFO:__main__:2024-11-06 07:42:58 | Epoch: 1 | Step: 227550 | Dataset: 0-2021261 | Loss: 0.384 | 911 ms/step , 6903.46 GFLOP/s , 17965.0 tokens/s INFO:__main__:2024-11-06 07:43:07 | Epoch: 1 | Step: 227560 | Dataset: 0-2021581 | Loss: 0.354 | 911 ms/step , 6905.34 GFLOP/s , 17969.2 tokens/s INFO:__main__:2024-11-06 07:43:16 | Epoch: 1 | Step: 227570 | Dataset: 0-2021901 | Loss: 0.355 | 911 ms/step , 6906.95 GFLOP/s , 17968.7 tokens/s INFO:__main__:2024-11-06 07:43:25 | Epoch: 1 | Step: 227580 | Dataset: 0-2022221 | Loss: 0.378 | 911 ms/step , 6904.06 GFLOP/s , 17964.5 tokens/s INFO:__main__:2024-11-06 07:43:34 | Epoch: 1 | Step: 227590 | Dataset: 0-2022541 | Loss: 0.413 | 912 ms/step , 6897.99 GFLOP/s , 17957.9 tokens/s INFO:__main__:2024-11-06 07:43:43 | Epoch: 1 | Step: 227600 | Dataset: 0-2022861 | Loss: 0.374 | 912 ms/step , 6899.29 GFLOP/s , 17960.9 tokens/s INFO:__main__:2024-11-06 07:43:45 | Validation | Step: 227600 | Val_loss: 0.709 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:43:54 | Epoch: 1 | Step: 227610 | Dataset: 0-2023181 | Loss: 0.818 | 912 ms/step , 6899.82 GFLOP/s , 15296.5 tokens/s INFO:__main__:2024-11-06 07:44:03 | Epoch: 1 | Step: 227620 | Dataset: 0-2023501 | Loss: 0.919 | 912 ms/step , 6894.15 GFLOP/s , 17952.5 tokens/s INFO:__main__:2024-11-06 07:44:12 | Epoch: 1 | Step: 227630 | Dataset: 0-2023821 | Loss: 0.324 | 913 ms/step , 6889.76 GFLOP/s , 17961.1 tokens/s INFO:__main__:2024-11-06 07:44:21 | Epoch: 1 | Step: 227640 | Dataset: 0-2024141 | Loss: 0.859 | 914 ms/step , 6883.47 GFLOP/s , 17958.2 tokens/s INFO:__main__:2024-11-06 07:44:31 | Epoch: 1 | Step: 227650 | Dataset: 0-2024461 | Loss: 0.746 | 914 ms/step , 6882.10 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 07:44:40 | Epoch: 1 | Step: 227660 | Dataset: 0-2024781 | Loss: 0.814 | 914 ms/step , 6879.07 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-06 07:44:49 | Epoch: 1 | Step: 227670 | Dataset: 0-2025101 | Loss: 0.696 | 913 ms/step , 6886.54 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-06 07:44:58 | Epoch: 1 | Step: 227680 | Dataset: 0-2025421 | Loss: 0.733 | 914 ms/step , 6879.02 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-06 07:45:07 | Epoch: 1 | Step: 227690 | Dataset: 0-2025741 | Loss: 0.710 | 914 ms/step , 6879.88 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 07:45:16 | Epoch: 1 | Step: 227700 | Dataset: 0-2026061 | Loss: 0.729 | 914 ms/step , 6880.82 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 07:45:18 | Validation | Step: 227700 | Val_loss: 0.707 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:45:27 | Epoch: 1 | Step: 227710 | Dataset: 0-2026381 | Loss: 0.719 | 914 ms/step , 6881.72 GFLOP/s , 15288.6 tokens/s INFO:__main__:2024-11-06 07:45:36 | Epoch: 1 | Step: 227720 | Dataset: 0-2026701 | Loss: 0.725 | 915 ms/step , 6874.03 GFLOP/s , 17923.4 tokens/s INFO:__main__:2024-11-06 07:45:45 | Epoch: 1 | Step: 227730 | Dataset: 0-2027021 | Loss: 0.756 | 915 ms/step , 6876.80 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-06 07:45:54 | Epoch: 1 | Step: 227740 | Dataset: 0-2027341 | Loss: 0.674 | 914 ms/step , 6880.74 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-06 07:46:04 | Epoch: 1 | Step: 227750 | Dataset: 0-2027661 | Loss: 0.722 | 914 ms/step , 6882.18 GFLOP/s , 17926.7 tokens/s INFO:__main__:2024-11-06 07:46:13 | Epoch: 1 | Step: 227760 | Dataset: 0-2027981 | Loss: 0.672 | 913 ms/step , 6888.66 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 07:46:22 | Epoch: 1 | Step: 227770 | Dataset: 0-2028301 | Loss: 0.679 | 913 ms/step , 6885.72 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 07:46:31 | Epoch: 1 | Step: 227780 | Dataset: 0-2028621 | Loss: 0.765 | 915 ms/step , 6871.73 GFLOP/s , 17923.3 tokens/s INFO:__main__:2024-11-06 07:46:40 | Epoch: 1 | Step: 227790 | Dataset: 0-2028941 | Loss: 0.790 | 914 ms/step , 6879.24 GFLOP/s , 17916.3 tokens/s INFO:__main__:2024-11-06 07:46:49 | Epoch: 1 | Step: 227800 | Dataset: 0-2029261 | Loss: 0.685 | 913 ms/step , 6887.60 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-06 07:46:51 | Validation | Step: 227800 | Val_loss: 0.723 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:47:00 | Epoch: 1 | Step: 227810 | Dataset: 0-2029581 | Loss: 0.658 | 914 ms/step , 6882.38 GFLOP/s , 15259.9 tokens/s INFO:__main__:2024-11-06 07:47:09 | Epoch: 1 | Step: 227820 | Dataset: 0-2029901 | Loss: 0.651 | 916 ms/step , 6864.27 GFLOP/s , 17912.8 tokens/s INFO:__main__:2024-11-06 07:47:18 | Epoch: 1 | Step: 227830 | Dataset: 0-2030221 | Loss: 0.716 | 913 ms/step , 6885.81 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-06 07:47:28 | Epoch: 1 | Step: 227840 | Dataset: 0-2030541 | Loss: 0.717 | 915 ms/step , 6877.16 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-06 07:47:37 | Epoch: 1 | Step: 227850 | Dataset: 0-2030861 | Loss: 0.721 | 915 ms/step , 6875.88 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-06 07:47:46 | Epoch: 1 | Step: 227860 | Dataset: 0-2031181 | Loss: 0.753 | 914 ms/step , 6882.69 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-06 07:47:55 | Epoch: 1 | Step: 227870 | Dataset: 0-2031501 | Loss: 0.770 | 914 ms/step , 6879.67 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 07:48:04 | Epoch: 1 | Step: 227880 | Dataset: 0-2031821 | Loss: 0.780 | 914 ms/step , 6880.48 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-06 07:48:13 | Epoch: 1 | Step: 227890 | Dataset: 0-2032141 | Loss: 0.709 | 914 ms/step , 6878.89 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-06 07:48:22 | Epoch: 1 | Step: 227900 | Dataset: 0-2032461 | Loss: 0.712 | 913 ms/step , 6887.15 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-06 07:48:24 | Validation | Step: 227900 | Val_loss: 0.721 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:48:33 | Epoch: 1 | Step: 227910 | Dataset: 0-2032781 | Loss: 0.741 | 914 ms/step , 6883.03 GFLOP/s , 15270.4 tokens/s INFO:__main__:2024-11-06 07:48:42 | Epoch: 1 | Step: 227920 | Dataset: 0-2033101 | Loss: 0.643 | 913 ms/step , 6891.09 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 07:48:51 | Epoch: 1 | Step: 227930 | Dataset: 0-2033421 | Loss: 0.736 | 914 ms/step , 6879.39 GFLOP/s , 17917.5 tokens/s INFO:__main__:2024-11-06 07:49:01 | Epoch: 1 | Step: 227940 | Dataset: 0-2033741 | Loss: 0.674 | 914 ms/step , 6881.57 GFLOP/s , 17915.0 tokens/s INFO:__main__:2024-11-06 07:49:10 | Epoch: 1 | Step: 227950 | Dataset: 0-2034061 | Loss: 0.723 | 914 ms/step , 6881.82 GFLOP/s , 17917.8 tokens/s INFO:__main__:2024-11-06 07:49:19 | Epoch: 1 | Step: 227960 | Dataset: 0-2034381 | Loss: 0.790 | 913 ms/step , 6890.59 GFLOP/s , 17920.0 tokens/s INFO:__main__:2024-11-06 07:49:28 | Epoch: 1 | Step: 227970 | Dataset: 0-2034701 | Loss: 0.656 | 912 ms/step , 6894.26 GFLOP/s , 17927.2 tokens/s INFO:__main__:2024-11-06 07:49:37 | Epoch: 1 | Step: 227980 | Dataset: 0-2035021 | Loss: 0.765 | 912 ms/step , 6894.92 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 07:49:46 | Epoch: 1 | Step: 227990 | Dataset: 0-2035341 | Loss: 0.672 | 914 ms/step , 6882.97 GFLOP/s , 17923.0 tokens/s INFO:__main__:2024-11-06 07:49:55 | Epoch: 1 | Step: 228000 | Dataset: 0-2035661 | Loss: 0.701 | 914 ms/step , 6883.31 GFLOP/s , 17928.2 tokens/s INFO:__main__:2024-11-06 07:49:57 | Validation | Step: 228000 | Val_loss: 0.687 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:49:57 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_074957_step_228000.pt` INFO:__main__:2024-11-06 07:50:07 | Epoch: 1 | Step: 228010 | Dataset: 0-2035981 | Loss: 0.720 | 914 ms/step , 6883.25 GFLOP/s , 13790.5 tokens/s INFO:__main__:2024-11-06 07:50:16 | Epoch: 1 | Step: 228020 | Dataset: 0-2036301 | Loss: 0.657 | 916 ms/step , 6864.50 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-06 07:50:26 | Epoch: 1 | Step: 228030 | Dataset: 0-2036621 | Loss: 0.714 | 916 ms/step , 6869.25 GFLOP/s , 17921.1 tokens/s INFO:__main__:2024-11-06 07:50:35 | Epoch: 1 | Step: 228040 | Dataset: 0-2036941 | Loss: 0.798 | 913 ms/step , 6889.77 GFLOP/s , 17895.1 tokens/s INFO:__main__:2024-11-06 07:50:44 | Epoch: 1 | Step: 228050 | Dataset: 0-2037261 | Loss: 0.645 | 913 ms/step , 6885.29 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-06 07:50:53 | Epoch: 1 | Step: 228060 | Dataset: 0-2037581 | Loss: 0.794 | 915 ms/step , 6876.00 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 07:51:02 | Epoch: 1 | Step: 228070 | Dataset: 0-2037901 | Loss: 0.677 | 913 ms/step , 6887.80 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-06 07:51:11 | Epoch: 1 | Step: 228080 | Dataset: 0-2038221 | Loss: 0.675 | 915 ms/step , 6874.32 GFLOP/s , 17909.6 tokens/s INFO:__main__:2024-11-06 07:51:20 | Epoch: 1 | Step: 228090 | Dataset: 0-2038541 | Loss: 0.741 | 915 ms/step , 6875.23 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-06 07:51:30 | Epoch: 1 | Step: 228100 | Dataset: 0-2038861 | Loss: 0.811 | 915 ms/step , 6876.46 GFLOP/s , 17913.7 tokens/s INFO:__main__:2024-11-06 07:51:31 | Validation | Step: 228100 | Val_loss: 0.692 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:51:40 | Epoch: 1 | Step: 228110 | Dataset: 0-2039181 | Loss: 0.742 | 914 ms/step , 6880.58 GFLOP/s , 15257.9 tokens/s INFO:__main__:2024-11-06 07:51:49 | Epoch: 1 | Step: 228120 | Dataset: 0-2039501 | Loss: 0.721 | 913 ms/step , 6887.95 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-06 07:51:59 | Epoch: 1 | Step: 228130 | Dataset: 0-2039821 | Loss: 0.697 | 914 ms/step , 6885.01 GFLOP/s , 17923.7 tokens/s INFO:__main__:2024-11-06 07:52:08 | Epoch: 1 | Step: 228140 | Dataset: 0-2040141 | Loss: 0.726 | 914 ms/step , 6878.58 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-06 07:52:17 | Epoch: 1 | Step: 228150 | Dataset: 0-2040461 | Loss: 0.635 | 913 ms/step , 6890.04 GFLOP/s , 17929.6 tokens/s INFO:__main__:2024-11-06 07:52:26 | Epoch: 1 | Step: 228160 | Dataset: 0-2040781 | Loss: 0.749 | 913 ms/step , 6886.10 GFLOP/s , 17922.1 tokens/s INFO:__main__:2024-11-06 07:52:35 | Epoch: 1 | Step: 228170 | Dataset: 0-2041101 | Loss: 0.734 | 915 ms/step , 6877.25 GFLOP/s , 17924.6 tokens/s INFO:__main__:2024-11-06 07:52:44 | Epoch: 1 | Step: 228180 | Dataset: 0-2041421 | Loss: 0.748 | 914 ms/step , 6883.43 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 07:52:53 | Epoch: 1 | Step: 228190 | Dataset: 0-2041741 | Loss: 0.725 | 913 ms/step , 6888.79 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 07:53:03 | Epoch: 1 | Step: 228200 | Dataset: 0-2042061 | Loss: 0.711 | 914 ms/step , 6883.68 GFLOP/s , 17865.3 tokens/s INFO:__main__:2024-11-06 07:53:04 | Validation | Step: 228200 | Val_loss: 0.715 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:53:13 | Epoch: 1 | Step: 228210 | Dataset: 0-2042381 | Loss: 0.746 | 914 ms/step , 6881.03 GFLOP/s , 15262.4 tokens/s INFO:__main__:2024-11-06 07:53:23 | Epoch: 1 | Step: 228220 | Dataset: 0-2042701 | Loss: 0.732 | 920 ms/step , 6832.69 GFLOP/s , 17900.8 tokens/s INFO:__main__:2024-11-06 07:53:32 | Epoch: 1 | Step: 228230 | Dataset: 0-2043021 | Loss: 0.777 | 914 ms/step , 6877.76 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-06 07:53:41 | Epoch: 1 | Step: 228240 | Dataset: 0-2043341 | Loss: 0.722 | 913 ms/step , 6887.88 GFLOP/s , 17916.9 tokens/s INFO:__main__:2024-11-06 07:53:50 | Epoch: 1 | Step: 228250 | Dataset: 0-2043661 | Loss: 0.751 | 914 ms/step , 6882.16 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 07:53:59 | Epoch: 1 | Step: 228260 | Dataset: 0-2043981 | Loss: 0.696 | 915 ms/step , 6877.20 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 07:54:08 | Epoch: 1 | Step: 228270 | Dataset: 0-2044301 | Loss: 0.731 | 914 ms/step , 6883.55 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 07:54:17 | Epoch: 1 | Step: 228280 | Dataset: 0-2044621 | Loss: 0.626 | 915 ms/step , 6874.30 GFLOP/s , 17910.8 tokens/s INFO:__main__:2024-11-06 07:54:27 | Epoch: 1 | Step: 228290 | Dataset: 0-2044941 | Loss: 0.747 | 914 ms/step , 6881.80 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 07:54:36 | Epoch: 1 | Step: 228300 | Dataset: 0-2045261 | Loss: 0.769 | 914 ms/step , 6883.58 GFLOP/s , 17912.7 tokens/s INFO:__main__:2024-11-06 07:54:37 | Validation | Step: 228300 | Val_loss: 0.722 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:54:46 | Epoch: 1 | Step: 228310 | Dataset: 0-2045581 | Loss: 0.708 | 913 ms/step , 6885.59 GFLOP/s , 15257.5 tokens/s INFO:__main__:2024-11-06 07:54:56 | Epoch: 1 | Step: 228320 | Dataset: 0-2045901 | Loss: 0.704 | 913 ms/step , 6889.18 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 07:55:05 | Epoch: 1 | Step: 228330 | Dataset: 0-2046221 | Loss: 0.758 | 914 ms/step , 6877.85 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-06 07:55:14 | Epoch: 1 | Step: 228340 | Dataset: 0-2046541 | Loss: 0.793 | 914 ms/step , 6882.16 GFLOP/s , 17919.4 tokens/s INFO:__main__:2024-11-06 07:55:23 | Epoch: 1 | Step: 228350 | Dataset: 0-2046861 | Loss: 0.728 | 915 ms/step , 6875.59 GFLOP/s , 17916.1 tokens/s INFO:__main__:2024-11-06 07:55:32 | Epoch: 1 | Step: 228360 | Dataset: 0-2047181 | Loss: 0.692 | 912 ms/step , 6893.97 GFLOP/s , 17922.0 tokens/s INFO:__main__:2024-11-06 07:55:41 | Epoch: 1 | Step: 228370 | Dataset: 0-2047501 | Loss: 0.687 | 913 ms/step , 6885.58 GFLOP/s , 17923.2 tokens/s INFO:__main__:2024-11-06 07:55:50 | Epoch: 1 | Step: 228380 | Dataset: 0-2047821 | Loss: 0.734 | 913 ms/step , 6889.38 GFLOP/s , 17922.7 tokens/s INFO:__main__:2024-11-06 07:56:00 | Epoch: 1 | Step: 228390 | Dataset: 0-2048141 | Loss: 0.690 | 916 ms/step , 6866.98 GFLOP/s , 17919.3 tokens/s INFO:__main__:2024-11-06 07:56:09 | Epoch: 1 | Step: 228400 | Dataset: 0-2048461 | Loss: 0.752 | 914 ms/step , 6882.11 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 07:56:10 | Validation | Step: 228400 | Val_loss: 0.764 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:56:19 | Epoch: 1 | Step: 228410 | Dataset: 0-2048781 | Loss: 0.820 | 913 ms/step , 6889.88 GFLOP/s , 15265.6 tokens/s INFO:__main__:2024-11-06 07:56:29 | Epoch: 1 | Step: 228420 | Dataset: 0-2049101 | Loss: 0.719 | 914 ms/step , 6881.15 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 07:56:38 | Epoch: 1 | Step: 228430 | Dataset: 0-2049421 | Loss: 0.770 | 914 ms/step , 6882.54 GFLOP/s , 17913.2 tokens/s INFO:__main__:2024-11-06 07:56:47 | Epoch: 1 | Step: 228440 | Dataset: 0-2049741 | Loss: 0.673 | 914 ms/step , 6884.92 GFLOP/s , 17921.0 tokens/s INFO:__main__:2024-11-06 07:56:56 | Epoch: 1 | Step: 228450 | Dataset: 0-2050061 | Loss: 0.665 | 915 ms/step , 6873.82 GFLOP/s , 17920.8 tokens/s INFO:__main__:2024-11-06 07:57:05 | Epoch: 1 | Step: 228460 | Dataset: 0-2050381 | Loss: 0.696 | 916 ms/step , 6868.90 GFLOP/s , 17924.4 tokens/s INFO:__main__:2024-11-06 07:57:14 | Epoch: 1 | Step: 228470 | Dataset: 0-2050701 | Loss: 0.704 | 914 ms/step , 6878.39 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-06 07:57:23 | Epoch: 1 | Step: 228480 | Dataset: 0-2051021 | Loss: 0.672 | 912 ms/step , 6894.97 GFLOP/s , 17915.3 tokens/s INFO:__main__:2024-11-06 07:57:33 | Epoch: 1 | Step: 228490 | Dataset: 0-2051341 | Loss: 0.746 | 915 ms/step , 6877.41 GFLOP/s , 17915.4 tokens/s INFO:__main__:2024-11-06 07:57:42 | Epoch: 1 | Step: 228500 | Dataset: 0-2051661 | Loss: 0.709 | 914 ms/step , 6880.41 GFLOP/s , 17917.9 tokens/s INFO:__main__:2024-11-06 07:57:43 | Validation | Step: 228500 | Val_loss: 0.688 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:57:52 | Epoch: 1 | Step: 228510 | Dataset: 0-2051981 | Loss: 0.692 | 914 ms/step , 6880.60 GFLOP/s , 15275.3 tokens/s INFO:__main__:2024-11-06 07:58:02 | Epoch: 1 | Step: 228520 | Dataset: 0-2052301 | Loss: 0.796 | 914 ms/step , 6882.84 GFLOP/s , 17925.3 tokens/s INFO:__main__:2024-11-06 07:58:11 | Epoch: 1 | Step: 228530 | Dataset: 0-2052621 | Loss: 0.689 | 913 ms/step , 6886.94 GFLOP/s , 17920.4 tokens/s INFO:__main__:2024-11-06 07:58:20 | Epoch: 1 | Step: 228540 | Dataset: 0-2052941 | Loss: 0.755 | 915 ms/step , 6872.78 GFLOP/s , 17919.1 tokens/s INFO:__main__:2024-11-06 07:58:29 | Epoch: 1 | Step: 228550 | Dataset: 0-2053261 | Loss: 0.688 | 915 ms/step , 6875.10 GFLOP/s , 17911.5 tokens/s INFO:__main__:2024-11-06 07:58:38 | Epoch: 1 | Step: 228560 | Dataset: 0-2053581 | Loss: 0.700 | 914 ms/step , 6879.43 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 07:58:47 | Epoch: 1 | Step: 228570 | Dataset: 0-2053901 | Loss: 0.687 | 913 ms/step , 6892.43 GFLOP/s , 17918.0 tokens/s INFO:__main__:2024-11-06 07:58:56 | Epoch: 1 | Step: 228580 | Dataset: 0-2054221 | Loss: 0.707 | 915 ms/step , 6877.19 GFLOP/s , 17913.9 tokens/s INFO:__main__:2024-11-06 07:59:06 | Epoch: 1 | Step: 228590 | Dataset: 0-2054541 | Loss: 0.709 | 914 ms/step , 6884.96 GFLOP/s , 17915.9 tokens/s INFO:__main__:2024-11-06 07:59:15 | Epoch: 1 | Step: 228600 | Dataset: 0-2054861 | Loss: 0.688 | 914 ms/step , 6881.78 GFLOP/s , 17921.9 tokens/s INFO:__main__:2024-11-06 07:59:16 | Validation | Step: 228600 | Val_loss: 0.591 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 07:59:25 | Epoch: 1 | Step: 228610 | Dataset: 0-2055181 | Loss: 0.735 | 914 ms/step , 6883.55 GFLOP/s , 15268.6 tokens/s INFO:__main__:2024-11-06 07:59:35 | Epoch: 1 | Step: 228620 | Dataset: 0-2055501 | Loss: 0.693 | 914 ms/step , 6884.00 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 07:59:44 | Epoch: 1 | Step: 228630 | Dataset: 0-2055821 | Loss: 0.737 | 914 ms/step , 6884.84 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 07:59:53 | Epoch: 1 | Step: 228640 | Dataset: 0-2056141 | Loss: 0.682 | 914 ms/step , 6881.77 GFLOP/s , 17916.2 tokens/s INFO:__main__:2024-11-06 08:00:02 | Epoch: 1 | Step: 228650 | Dataset: 0-2056461 | Loss: 0.724 | 913 ms/step , 6889.46 GFLOP/s , 17920.3 tokens/s INFO:__main__:2024-11-06 08:00:11 | Epoch: 1 | Step: 228660 | Dataset: 0-2056781 | Loss: 0.780 | 914 ms/step , 6877.84 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 08:00:20 | Epoch: 1 | Step: 228670 | Dataset: 0-2057101 | Loss: 0.711 | 914 ms/step , 6881.85 GFLOP/s , 17911.0 tokens/s INFO:__main__:2024-11-06 08:00:29 | Epoch: 1 | Step: 228680 | Dataset: 0-2057421 | Loss: 0.648 | 916 ms/step , 6867.84 GFLOP/s , 17918.6 tokens/s INFO:__main__:2024-11-06 08:00:39 | Epoch: 1 | Step: 228690 | Dataset: 0-2057741 | Loss: 0.758 | 913 ms/step , 6886.61 GFLOP/s , 17925.9 tokens/s INFO:__main__:2024-11-06 08:00:48 | Epoch: 1 | Step: 228700 | Dataset: 0-2058061 | Loss: 0.690 | 914 ms/step , 6880.32 GFLOP/s , 17917.0 tokens/s INFO:__main__:2024-11-06 08:00:49 | Validation | Step: 228700 | Val_loss: 0.644 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:00:58 | Epoch: 1 | Step: 228710 | Dataset: 0-2058381 | Loss: 0.735 | 913 ms/step , 6885.63 GFLOP/s , 15275.8 tokens/s INFO:__main__:2024-11-06 08:01:08 | Epoch: 1 | Step: 228720 | Dataset: 0-2058701 | Loss: 0.688 | 913 ms/step , 6887.72 GFLOP/s , 17930.7 tokens/s INFO:__main__:2024-11-06 08:01:17 | Epoch: 1 | Step: 228730 | Dataset: 0-2059021 | Loss: 0.697 | 913 ms/step , 6892.34 GFLOP/s , 17919.6 tokens/s INFO:__main__:2024-11-06 08:01:26 | Epoch: 1 | Step: 228740 | Dataset: 0-2059341 | Loss: 0.657 | 913 ms/step , 6885.69 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 08:01:35 | Epoch: 1 | Step: 228750 | Dataset: 0-2059661 | Loss: 0.754 | 915 ms/step , 6873.96 GFLOP/s , 17914.6 tokens/s INFO:__main__:2024-11-06 08:01:44 | Epoch: 1 | Step: 228760 | Dataset: 0-2059981 | Loss: 0.743 | 914 ms/step , 6878.50 GFLOP/s , 17925.2 tokens/s INFO:__main__:2024-11-06 08:01:53 | Epoch: 1 | Step: 228770 | Dataset: 0-2060301 | Loss: 0.763 | 915 ms/step , 6876.72 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 08:02:02 | Epoch: 1 | Step: 228780 | Dataset: 0-2060621 | Loss: 0.703 | 914 ms/step , 6882.36 GFLOP/s , 17925.8 tokens/s INFO:__main__:2024-11-06 08:02:12 | Epoch: 1 | Step: 228790 | Dataset: 0-2060941 | Loss: 0.733 | 915 ms/step , 6873.03 GFLOP/s , 17916.0 tokens/s INFO:__main__:2024-11-06 08:02:21 | Epoch: 1 | Step: 228800 | Dataset: 0-2061261 | Loss: 0.742 | 913 ms/step , 6890.05 GFLOP/s , 17914.4 tokens/s INFO:__main__:2024-11-06 08:02:22 | Validation | Step: 228800 | Val_loss: 0.696 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:02:32 | Epoch: 1 | Step: 228810 | Dataset: 0-2061581 | Loss: 0.701 | 914 ms/step , 6877.94 GFLOP/s , 15271.4 tokens/s INFO:__main__:2024-11-06 08:02:41 | Epoch: 1 | Step: 228820 | Dataset: 0-2061901 | Loss: 0.708 | 914 ms/step , 6879.32 GFLOP/s , 17916.5 tokens/s INFO:__main__:2024-11-06 08:02:50 | Epoch: 1 | Step: 228830 | Dataset: 0-2062221 | Loss: 0.742 | 914 ms/step , 6879.99 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 08:02:59 | Epoch: 1 | Step: 228840 | Dataset: 0-2062541 | Loss: 0.701 | 914 ms/step , 6883.55 GFLOP/s , 17918.8 tokens/s INFO:__main__:2024-11-06 08:03:08 | Epoch: 1 | Step: 228850 | Dataset: 0-2062861 | Loss: 0.654 | 914 ms/step , 6881.68 GFLOP/s , 17917.6 tokens/s INFO:__main__:2024-11-06 08:03:17 | Epoch: 1 | Step: 228860 | Dataset: 0-2063181 | Loss: 0.751 | 915 ms/step , 6875.43 GFLOP/s , 17916.7 tokens/s INFO:__main__:2024-11-06 08:03:26 | Epoch: 1 | Step: 228870 | Dataset: 0-2063501 | Loss: 0.732 | 914 ms/step , 6882.34 GFLOP/s , 17920.9 tokens/s INFO:__main__:2024-11-06 08:03:36 | Epoch: 1 | Step: 228880 | Dataset: 0-2063821 | Loss: 0.714 | 912 ms/step , 6894.41 GFLOP/s , 17919.2 tokens/s INFO:__main__:2024-11-06 08:03:45 | Epoch: 1 | Step: 228890 | Dataset: 0-2064141 | Loss: 0.702 | 914 ms/step , 6880.14 GFLOP/s , 17911.2 tokens/s INFO:__main__:2024-11-06 08:03:54 | Epoch: 1 | Step: 228900 | Dataset: 0-2064461 | Loss: 0.712 | 914 ms/step , 6882.73 GFLOP/s , 17914.7 tokens/s INFO:__main__:2024-11-06 08:03:55 | Validation | Step: 228900 | Val_loss: 0.761 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:04:05 | Epoch: 1 | Step: 228910 | Dataset: 0-2064781 | Loss: 0.740 | 914 ms/step , 6880.41 GFLOP/s , 15264.2 tokens/s INFO:__main__:2024-11-06 08:04:14 | Epoch: 1 | Step: 228920 | Dataset: 0-2065101 | Loss: 0.807 | 914 ms/step , 6880.52 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 08:04:23 | Epoch: 1 | Step: 228930 | Dataset: 0-2065421 | Loss: 0.718 | 915 ms/step , 6873.83 GFLOP/s , 17917.7 tokens/s INFO:__main__:2024-11-06 08:04:32 | Epoch: 1 | Step: 228940 | Dataset: 0-2065741 | Loss: 0.704 | 913 ms/step , 6885.69 GFLOP/s , 17916.6 tokens/s INFO:__main__:2024-11-06 08:04:41 | Epoch: 1 | Step: 228950 | Dataset: 0-2066061 | Loss: 0.730 | 915 ms/step , 6874.38 GFLOP/s , 17918.3 tokens/s INFO:__main__:2024-11-06 08:04:50 | Epoch: 1 | Step: 228960 | Dataset: 0-2066381 | Loss: 0.742 | 914 ms/step , 6883.32 GFLOP/s , 17913.3 tokens/s INFO:__main__:2024-11-06 08:04:59 | Epoch: 1 | Step: 228970 | Dataset: 0-2066701 | Loss: 0.709 | 914 ms/step , 6883.20 GFLOP/s , 17921.5 tokens/s INFO:__main__:2024-11-06 08:05:09 | Epoch: 1 | Step: 228980 | Dataset: 0-2067021 | Loss: 0.739 | 915 ms/step , 6874.43 GFLOP/s , 17921.4 tokens/s INFO:__main__:2024-11-06 08:05:18 | Epoch: 1 | Step: 228990 | Dataset: 0-2067341 | Loss: 0.693 | 913 ms/step , 6886.43 GFLOP/s , 17918.5 tokens/s INFO:__main__:2024-11-06 08:05:27 | Epoch: 1 | Step: 229000 | Dataset: 0-2067661 | Loss: 0.661 | 916 ms/step , 6869.42 GFLOP/s , 17921.6 tokens/s INFO:__main__:2024-11-06 08:05:28 | Validation | Step: 229000 | Val_loss: 0.647 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:05:28 | Saving full-param checkpoint to `/home/bd4sur/ai/Nano/checkpoint/checkpoint_20241106_080528_step_229000.pt` INFO:__main__:2024-11-06 08:05:39 | Epoch: 1 | Step: 229010 | Dataset: 0-2067981 | Loss: 0.764 | 916 ms/step , 6866.16 GFLOP/s , 13787.9 tokens/s INFO:__main__:2024-11-06 08:05:48 | Epoch: 1 | Step: 229020 | Dataset: 0-2068301 | Loss: 0.607 | 915 ms/step , 6871.96 GFLOP/s , 17888.8 tokens/s INFO:__main__:2024-11-06 08:05:57 | Epoch: 1 | Step: 229030 | Dataset: 0-2068621 | Loss: 0.770 | 916 ms/step , 6864.50 GFLOP/s , 17888.3 tokens/s INFO:__main__:2024-11-06 08:06:06 | Epoch: 1 | Step: 229040 | Dataset: 0-2068941 | Loss: 0.714 | 916 ms/step , 6865.13 GFLOP/s , 17898.7 tokens/s INFO:__main__:2024-11-06 08:06:15 | Epoch: 1 | Step: 229050 | Dataset: 0-2069261 | Loss: 0.739 | 914 ms/step , 6880.76 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-06 08:06:24 | Epoch: 1 | Step: 229060 | Dataset: 0-2069581 | Loss: 0.761 | 914 ms/step , 6881.94 GFLOP/s , 17913.4 tokens/s INFO:__main__:2024-11-06 08:06:34 | Epoch: 1 | Step: 229070 | Dataset: 0-2069901 | Loss: 0.775 | 914 ms/step , 6881.12 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-06 08:06:43 | Epoch: 1 | Step: 229080 | Dataset: 0-2070221 | Loss: 0.809 | 915 ms/step , 6873.72 GFLOP/s , 17919.9 tokens/s INFO:__main__:2024-11-06 08:06:52 | Epoch: 1 | Step: 229090 | Dataset: 0-2070541 | Loss: 0.790 | 914 ms/step , 6877.55 GFLOP/s , 17909.1 tokens/s INFO:__main__:2024-11-06 08:07:01 | Epoch: 1 | Step: 229100 | Dataset: 0-2070861 | Loss: 0.660 | 914 ms/step , 6884.47 GFLOP/s , 17910.2 tokens/s INFO:__main__:2024-11-06 08:07:03 | Validation | Step: 229100 | Val_loss: 0.752 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:07:12 | Epoch: 1 | Step: 229110 | Dataset: 0-2071181 | Loss: 0.777 | 913 ms/step , 6885.29 GFLOP/s , 15261.7 tokens/s INFO:__main__:2024-11-06 08:07:21 | Epoch: 1 | Step: 229120 | Dataset: 0-2071501 | Loss: 0.669 | 913 ms/step , 6889.01 GFLOP/s , 17911.9 tokens/s INFO:__main__:2024-11-06 08:07:30 | Epoch: 1 | Step: 229130 | Dataset: 0-2071821 | Loss: 0.787 | 915 ms/step , 6876.13 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-06 08:07:39 | Epoch: 1 | Step: 229140 | Dataset: 0-2072141 | Loss: 0.723 | 915 ms/step , 6872.42 GFLOP/s , 17906.8 tokens/s INFO:__main__:2024-11-06 08:07:48 | Epoch: 1 | Step: 229150 | Dataset: 0-2072461 | Loss: 0.715 | 914 ms/step , 6881.32 GFLOP/s , 17911.3 tokens/s INFO:__main__:2024-11-06 08:07:58 | Epoch: 1 | Step: 229160 | Dataset: 0-2072781 | Loss: 0.747 | 914 ms/step , 6882.85 GFLOP/s , 17911.7 tokens/s INFO:__main__:2024-11-06 08:08:07 | Epoch: 1 | Step: 229170 | Dataset: 0-2073101 | Loss: 0.682 | 913 ms/step , 6885.64 GFLOP/s , 17908.6 tokens/s INFO:__main__:2024-11-06 08:08:16 | Epoch: 1 | Step: 229180 | Dataset: 0-2073421 | Loss: 0.703 | 913 ms/step , 6890.08 GFLOP/s , 17906.2 tokens/s INFO:__main__:2024-11-06 08:08:25 | Epoch: 1 | Step: 229190 | Dataset: 0-2073741 | Loss: 0.699 | 914 ms/step , 6881.48 GFLOP/s , 17915.5 tokens/s INFO:__main__:2024-11-06 08:08:34 | Epoch: 1 | Step: 229200 | Dataset: 0-2074061 | Loss: 0.700 | 914 ms/step , 6882.71 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-06 08:08:36 | Validation | Step: 229200 | Val_loss: 0.649 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:08:45 | Epoch: 1 | Step: 229210 | Dataset: 0-2074381 | Loss: 0.768 | 915 ms/step , 6876.31 GFLOP/s , 15255.0 tokens/s INFO:__main__:2024-11-06 08:08:54 | Epoch: 1 | Step: 229220 | Dataset: 0-2074701 | Loss: 0.696 | 915 ms/step , 6877.07 GFLOP/s , 17904.8 tokens/s INFO:__main__:2024-11-06 08:09:03 | Epoch: 1 | Step: 229230 | Dataset: 0-2075021 | Loss: 0.730 | 914 ms/step , 6879.14 GFLOP/s , 17903.3 tokens/s INFO:__main__:2024-11-06 08:09:12 | Epoch: 1 | Step: 229240 | Dataset: 0-2075341 | Loss: 0.731 | 916 ms/step , 6863.87 GFLOP/s , 17902.3 tokens/s INFO:__main__:2024-11-06 08:09:21 | Epoch: 1 | Step: 229250 | Dataset: 0-2075661 | Loss: 0.732 | 914 ms/step , 6880.49 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-06 08:09:31 | Epoch: 1 | Step: 229260 | Dataset: 0-2075981 | Loss: 0.723 | 914 ms/step , 6880.06 GFLOP/s , 17901.9 tokens/s INFO:__main__:2024-11-06 08:09:40 | Epoch: 1 | Step: 229270 | Dataset: 0-2076301 | Loss: 0.723 | 914 ms/step , 6883.55 GFLOP/s , 17915.2 tokens/s INFO:__main__:2024-11-06 08:09:49 | Epoch: 1 | Step: 229280 | Dataset: 0-2076621 | Loss: 0.687 | 915 ms/step , 6871.92 GFLOP/s , 17900.8 tokens/s INFO:__main__:2024-11-06 08:09:58 | Epoch: 1 | Step: 229290 | Dataset: 0-2076941 | Loss: 0.746 | 915 ms/step , 6872.49 GFLOP/s , 17909.9 tokens/s INFO:__main__:2024-11-06 08:10:07 | Epoch: 1 | Step: 229300 | Dataset: 0-2077261 | Loss: 0.687 | 915 ms/step , 6871.46 GFLOP/s , 17901.2 tokens/s INFO:__main__:2024-11-06 08:10:09 | Validation | Step: 229300 | Val_loss: 0.718 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:10:18 | Epoch: 1 | Step: 229310 | Dataset: 0-2077581 | Loss: 0.720 | 915 ms/step , 6876.82 GFLOP/s , 15256.6 tokens/s INFO:__main__:2024-11-06 08:10:27 | Epoch: 1 | Step: 229320 | Dataset: 0-2077901 | Loss: 0.624 | 913 ms/step , 6885.78 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-06 08:10:36 | Epoch: 1 | Step: 229330 | Dataset: 0-2078221 | Loss: 0.721 | 915 ms/step , 6873.07 GFLOP/s , 17905.8 tokens/s INFO:__main__:2024-11-06 08:10:45 | Epoch: 1 | Step: 229340 | Dataset: 0-2078541 | Loss: 0.749 | 914 ms/step , 6882.36 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-06 08:10:55 | Epoch: 1 | Step: 229350 | Dataset: 0-2078861 | Loss: 0.712 | 914 ms/step , 6883.81 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-06 08:11:04 | Epoch: 1 | Step: 229360 | Dataset: 0-2079181 | Loss: 0.757 | 914 ms/step , 6879.71 GFLOP/s , 17909.2 tokens/s INFO:__main__:2024-11-06 08:11:13 | Epoch: 1 | Step: 229370 | Dataset: 0-2079501 | Loss: 0.743 | 914 ms/step , 6879.56 GFLOP/s , 17913.0 tokens/s INFO:__main__:2024-11-06 08:11:22 | Epoch: 1 | Step: 229380 | Dataset: 0-2079821 | Loss: 0.765 | 916 ms/step , 6868.76 GFLOP/s , 17910.4 tokens/s INFO:__main__:2024-11-06 08:11:31 | Epoch: 1 | Step: 229390 | Dataset: 0-2080141 | Loss: 0.680 | 913 ms/step , 6891.14 GFLOP/s , 17914.0 tokens/s INFO:__main__:2024-11-06 08:11:40 | Epoch: 1 | Step: 229400 | Dataset: 0-2080461 | Loss: 0.728 | 913 ms/step , 6885.44 GFLOP/s , 17911.6 tokens/s INFO:__main__:2024-11-06 08:11:42 | Validation | Step: 229400 | Val_loss: 0.729 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:11:51 | Epoch: 1 | Step: 229410 | Dataset: 0-2080781 | Loss: 0.683 | 914 ms/step , 6882.65 GFLOP/s , 15264.5 tokens/s INFO:__main__:2024-11-06 08:12:00 | Epoch: 1 | Step: 229420 | Dataset: 0-2081101 | Loss: 0.684 | 914 ms/step , 6883.65 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-06 08:12:09 | Epoch: 1 | Step: 229430 | Dataset: 0-2081421 | Loss: 0.651 | 915 ms/step , 6875.96 GFLOP/s , 17897.7 tokens/s INFO:__main__:2024-11-06 08:12:18 | Epoch: 1 | Step: 229440 | Dataset: 0-2081741 | Loss: 0.702 | 915 ms/step , 6875.86 GFLOP/s , 17905.4 tokens/s INFO:__main__:2024-11-06 08:12:28 | Epoch: 1 | Step: 229450 | Dataset: 0-2082061 | Loss: 0.680 | 913 ms/step , 6890.90 GFLOP/s , 17907.7 tokens/s INFO:__main__:2024-11-06 08:12:37 | Epoch: 1 | Step: 229460 | Dataset: 0-2082381 | Loss: 0.732 | 916 ms/step , 6864.77 GFLOP/s , 17900.6 tokens/s INFO:__main__:2024-11-06 08:12:46 | Epoch: 1 | Step: 229470 | Dataset: 0-2082701 | Loss: 0.723 | 916 ms/step , 6868.64 GFLOP/s , 17908.8 tokens/s INFO:__main__:2024-11-06 08:12:55 | Epoch: 1 | Step: 229480 | Dataset: 0-2083021 | Loss: 0.794 | 914 ms/step , 6880.20 GFLOP/s , 17909.3 tokens/s INFO:__main__:2024-11-06 08:13:04 | Epoch: 1 | Step: 229490 | Dataset: 0-2083341 | Loss: 0.715 | 913 ms/step , 6888.56 GFLOP/s , 17904.0 tokens/s INFO:__main__:2024-11-06 08:13:13 | Epoch: 1 | Step: 229500 | Dataset: 0-2083661 | Loss: 0.715 | 914 ms/step , 6881.32 GFLOP/s , 17908.7 tokens/s INFO:__main__:2024-11-06 08:13:15 | Validation | Step: 229500 | Val_loss: 0.716 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:13:24 | Epoch: 1 | Step: 229510 | Dataset: 0-2083981 | Loss: 0.758 | 915 ms/step , 6873.91 GFLOP/s , 15252.4 tokens/s INFO:__main__:2024-11-06 08:13:33 | Epoch: 1 | Step: 229520 | Dataset: 0-2084301 | Loss: 0.704 | 913 ms/step , 6890.93 GFLOP/s , 17912.2 tokens/s INFO:__main__:2024-11-06 08:13:42 | Epoch: 1 | Step: 229530 | Dataset: 0-2084621 | Loss: 0.766 | 913 ms/step , 6885.13 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-06 08:13:52 | Epoch: 1 | Step: 229540 | Dataset: 0-2084941 | Loss: 0.773 | 917 ms/step , 6861.63 GFLOP/s , 17904.6 tokens/s INFO:__main__:2024-11-06 08:14:01 | Epoch: 1 | Step: 229550 | Dataset: 0-2085261 | Loss: 0.724 | 916 ms/step , 6862.94 GFLOP/s , 17909.0 tokens/s INFO:__main__:2024-11-06 08:14:10 | Epoch: 1 | Step: 229560 | Dataset: 0-2085581 | Loss: 0.721 | 914 ms/step , 6884.40 GFLOP/s , 17915.7 tokens/s INFO:__main__:2024-11-06 08:14:19 | Epoch: 1 | Step: 229570 | Dataset: 0-2085901 | Loss: 0.745 | 915 ms/step , 6877.33 GFLOP/s , 17901.7 tokens/s INFO:__main__:2024-11-06 08:14:28 | Epoch: 1 | Step: 229580 | Dataset: 0-2086221 | Loss: 0.790 | 914 ms/step , 6882.05 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-06 08:14:37 | Epoch: 1 | Step: 229590 | Dataset: 0-2086541 | Loss: 0.682 | 914 ms/step , 6878.12 GFLOP/s , 17911.8 tokens/s INFO:__main__:2024-11-06 08:14:46 | Epoch: 1 | Step: 229600 | Dataset: 0-2086861 | Loss: 0.730 | 914 ms/step , 6878.69 GFLOP/s , 17908.3 tokens/s INFO:__main__:2024-11-06 08:14:48 | Validation | Step: 229600 | Val_loss: 0.677 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:14:57 | Epoch: 1 | Step: 229610 | Dataset: 0-2087181 | Loss: 0.693 | 914 ms/step , 6879.67 GFLOP/s , 15261.9 tokens/s INFO:__main__:2024-11-06 08:15:06 | Epoch: 1 | Step: 229620 | Dataset: 0-2087501 | Loss: 0.734 | 915 ms/step , 6875.49 GFLOP/s , 17913.6 tokens/s INFO:__main__:2024-11-06 08:15:16 | Epoch: 1 | Step: 229630 | Dataset: 0-2087821 | Loss: 0.768 | 915 ms/step , 6876.49 GFLOP/s , 17904.5 tokens/s INFO:__main__:2024-11-06 08:15:25 | Epoch: 1 | Step: 229640 | Dataset: 0-2088141 | Loss: 0.692 | 915 ms/step , 6872.88 GFLOP/s , 17910.5 tokens/s INFO:__main__:2024-11-06 08:15:34 | Epoch: 1 | Step: 229650 | Dataset: 0-2088461 | Loss: 0.686 | 913 ms/step , 6885.62 GFLOP/s , 17898.8 tokens/s INFO:__main__:2024-11-06 08:15:43 | Epoch: 1 | Step: 229660 | Dataset: 0-2088781 | Loss: 0.665 | 914 ms/step , 6881.35 GFLOP/s , 17901.9 tokens/s INFO:__main__:2024-11-06 08:15:52 | Epoch: 1 | Step: 229670 | Dataset: 0-2089101 | Loss: 0.710 | 916 ms/step , 6868.37 GFLOP/s , 17905.1 tokens/s INFO:__main__:2024-11-06 08:16:01 | Epoch: 1 | Step: 229680 | Dataset: 0-2089421 | Loss: 0.592 | 917 ms/step , 6855.10 GFLOP/s , 17908.2 tokens/s INFO:__main__:2024-11-06 08:16:10 | Epoch: 1 | Step: 229690 | Dataset: 0-2089741 | Loss: 0.693 | 913 ms/step , 6886.60 GFLOP/s , 17914.5 tokens/s INFO:__main__:2024-11-06 08:16:20 | Epoch: 1 | Step: 229700 | Dataset: 0-2090061 | Loss: 0.700 | 914 ms/step , 6884.37 GFLOP/s , 17912.0 tokens/s INFO:__main__:2024-11-06 08:16:21 | Validation | Step: 229700 | Val_loss: 0.746 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:16:30 | Epoch: 1 | Step: 229710 | Dataset: 0-2090381 | Loss: 0.770 | 914 ms/step , 6881.97 GFLOP/s , 15256.8 tokens/s INFO:__main__:2024-11-06 08:16:39 | Epoch: 1 | Step: 229720 | Dataset: 0-2090701 | Loss: 0.736 | 913 ms/step , 6885.47 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-06 08:16:49 | Epoch: 1 | Step: 229730 | Dataset: 0-2091021 | Loss: 0.753 | 914 ms/step , 6883.50 GFLOP/s , 17917.4 tokens/s INFO:__main__:2024-11-06 08:16:58 | Epoch: 1 | Step: 229740 | Dataset: 0-2091341 | Loss: 0.754 | 915 ms/step , 6873.90 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-06 08:17:07 | Epoch: 1 | Step: 229750 | Dataset: 0-2091661 | Loss: 0.666 | 914 ms/step , 6882.50 GFLOP/s , 17903.2 tokens/s INFO:__main__:2024-11-06 08:17:16 | Epoch: 1 | Step: 229760 | Dataset: 0-2091981 | Loss: 0.722 | 916 ms/step , 6868.77 GFLOP/s , 17910.3 tokens/s INFO:__main__:2024-11-06 08:17:25 | Epoch: 1 | Step: 229770 | Dataset: 0-2092301 | Loss: 0.848 | 913 ms/step , 6892.45 GFLOP/s , 17911.4 tokens/s INFO:__main__:2024-11-06 08:17:34 | Epoch: 2 | Step: 229780 | Dataset: 0-218 | Loss: 0.719 | 914 ms/step , 6879.89 GFLOP/s , 17904.1 tokens/s INFO:__main__:2024-11-06 08:17:43 | Epoch: 2 | Step: 229790 | Dataset: 0-538 | Loss: 0.659 | 913 ms/step , 6886.79 GFLOP/s , 17919.7 tokens/s INFO:__main__:2024-11-06 08:17:53 | Epoch: 2 | Step: 229800 | Dataset: 0-858 | Loss: 0.732 | 914 ms/step , 6878.71 GFLOP/s , 17907.2 tokens/s INFO:__main__:2024-11-06 08:17:54 | Validation | Step: 229800 | Val_loss: 0.759 | Best_val_loss: 0.3249 INFO:__main__:2024-11-06 08:18:03 | Epoch: 2 | Step: 229810 | Dataset: 0-1178 | Loss: 0.721 | 915 ms/step , 6876.49 GFLOP/s , 15257.1 tokens/s INFO:__main__:2024-11-06 08:18:13 | Epoch: 2 | Step: 229820 | Dataset: 0-1498 | Loss: 0.722 | 915 ms/step , 6874.75 GFLOP/s , 17908.9 tokens/s INFO:__main__:2024-11-06 08:18:22 | Epoch: 2 | Step: 229830 | Dataset: 0-1818 | Loss: 0.684 | 914 ms/step , 6877.61 GFLOP/s , 17902.8 tokens/s INFO:__main__:2024-11-06 08:18:31 | Epoch: 2 | Step: 229840 | Dataset: 0-2138 | Loss: 0.669 | 915 ms/step , 6874.19 GFLOP/s , 17902.9 tokens/s INFO:__main__:2024-11-06 08:18:40 | Epoch: 2 | Step: 229850 | Dataset: 0-2458 | Loss: 0.721 | 913 ms/step , 6887.86 GFLOP/s , 17919.5 tokens/s INFO:__main__:2024-11-06 08:18:49 | Epoch: 2 | Step: 229860 | Dataset: 0-2778 | Loss: 0.679 | 914 ms/step , 6880.51 GFLOP/s , 17909.4 tokens/s INFO:__main__:2024-11-06 08:18:58 | Epoch: 2 | Step: 229870 | Dataset: 0-3098 | Loss: 0.802 | 915 ms/step , 6877.49 GFLOP/s , 17910.5 tokens/s