Upload 9 files

5035033 verified 11 months ago

25 kB

	[2024-02-27 15:55:39,803] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,622] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,640] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,658] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,683] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,774] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,793] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,859] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:50,879] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-02-27 15:55:54,396] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,397] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	[2024-02-27 15:55:54,397] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,399] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,438] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,466] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,496] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,529] [INFO] [comm.py:637:init_distributed] cdb=None
	[2024-02-27 15:55:54,621] [INFO] [comm.py:637:init_distributed] cdb=None
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	**************************************************************************************************** DistributedType.DEEPSPEED
	[2024-02-27 15:56:09,579] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 273, num_elems = 6.91B
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	Is prompt masked: True
	************************************************** train_dataset
	[2024-02-27 15:56:34,622] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.11.0, git-hash=unknown, git-branch=unknown
	************************************************** train_dataset
	************************************************** train_dataset
	************************************************** train_dataset
	************************************************** train_dataset
	************************************************** train_dataset
	************************************************** train_dataset
	************************************************** train_dataset
	[2024-02-27 15:56:34,797] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
	[2024-02-27 15:56:34,799] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
	[2024-02-27 15:56:34,799] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
	[2024-02-27 15:56:34,811] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
	[2024-02-27 15:56:34,811] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
	[2024-02-27 15:56:34,811] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
	[2024-02-27 15:56:34,811] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
	[2024-02-27 15:56:34,975] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
	[2024-02-27 15:56:34,976] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 3.23 GB CA 4.47 GB Max_CA 4 GB
	[2024-02-27 15:56:34,977] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	[2024-02-27 15:56:34,978] [INFO] [stage3.py:126:__init__] Reduce bucket size 16777216
	[2024-02-27 15:56:34,978] [INFO] [stage3.py:127:__init__] Prefetch bucket size 15099494
	[2024-02-27 15:56:35,119] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
	[2024-02-27 15:56:35,120] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 4.47 GB Max_CA 4 GB
	[2024-02-27 15:56:35,120] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	Parameter Offload: Total persistent parameters: 249856 in 61 params
	[2024-02-27 15:56:35,295] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
	[2024-02-27 15:56:35,296] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 4.47 GB Max_CA 4 GB
	[2024-02-27 15:56:35,297] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	[2024-02-27 15:56:35,427] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
	[2024-02-27 15:56:35,427] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 4.47 GB Max_CA 4 GB
	[2024-02-27 15:56:35,428] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	[2024-02-27 15:56:37,252] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1
	[2024-02-27 15:56:37,253] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 1.67 GB Max_CA 4 GB
	[2024-02-27 15:56:37,253] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 55.98 GB, percent = 8.9%
	[2024-02-27 15:56:37,404] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
	[2024-02-27 15:56:37,404] [INFO] [utils.py:803:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 1.67 GB Max_CA 2 GB
	[2024-02-27 15:56:37,405] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 55.98 GB, percent = 8.9%
	[2024-02-27 15:56:38,537] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
	[2024-02-27 15:56:38,538] [INFO] [utils.py:803:see_memory_usage] MA 4.89 GB Max_MA 6.5 GB CA 6.5 GB Max_CA 6 GB
	[2024-02-27 15:56:38,538] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 52.73 GB, percent = 8.4%
	[2024-02-27 15:56:39,212] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
	[2024-02-27 15:56:39,213] [INFO] [utils.py:803:see_memory_usage] MA 4.89 GB Max_MA 4.89 GB CA 6.5 GB Max_CA 6 GB
	[2024-02-27 15:56:39,213] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	[2024-02-27 15:56:39,404] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
	[2024-02-27 15:56:39,405] [INFO] [utils.py:803:see_memory_usage] MA 11.32 GB Max_MA 20.98 GB CA 22.59 GB Max_CA 23 GB
	[2024-02-27 15:56:39,405] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.68 GB, percent = 7.1%
	[2024-02-27 15:56:39,406] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized
	[2024-02-27 15:56:43,265] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
	[2024-02-27 15:56:43,265] [INFO] [utils.py:803:see_memory_usage] MA 12.96 GB Max_MA 14.53 GB CA 24.98 GB Max_CA 25 GB
	[2024-02-27 15:56:43,266] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.7 GB, percent = 7.1%
	[2024-02-27 15:56:43,266] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
	[2024-02-27 15:56:43,266] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
	[2024-02-27 15:56:43,266] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
	[2024-02-27 15:56:43,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
	[2024-02-27 15:56:43,267] [INFO] [config.py:968:print] DeepSpeedEngine configuration:
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] activation_checkpointing_config {
	"partition_activations": false,
	"contiguous_memory_optimization": false,
	"cpu_checkpointing": false,
	"number_checkpoints": null,
	"synchronize_checkpoint_boundary": false,
	"profile": false
	}
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] amp_enabled .................. False
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] amp_params ................... False
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] autotuning_config ............ {
	"enabled": false,
	"start_step": null,
	"end_step": null,
	"metric_path": null,
	"arg_mappings": null,
	"metric": "throughput",
	"model_info": null,
	"results_dir": "autotuning_results",
	"exps_dir": "autotuning_exps",
	"overwrite": true,
	"fast": true,
	"start_profile_step": 3,
	"end_profile_step": 5,
	"tuner_type": "gridsearch",
	"tuner_early_stopping": 5,
	"tuner_num_trials": 50,
	"model_info_path": null,
	"mp_size": 1,
	"max_train_batch_size": null,
	"min_train_batch_size": 1,
	"max_train_micro_batch_size_per_gpu": 1.024000e+03,
	"min_train_micro_batch_size_per_gpu": 1,
	"num_tuning_micro_batch_sizes": 3
	}
	[2024-02-27 15:56:43,268] [INFO] [config.py:972:print] bfloat16_enabled ............. True
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] checkpoint_parallel_write_pipeline False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] checkpoint_tag_validation_enabled True
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] checkpoint_tag_validation_fail False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f15566e2920>
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] communication_data_type ...... None
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] curriculum_enabled_legacy .... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] curriculum_params_legacy ..... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] data_efficiency_enabled ...... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] dataloader_drop_last ......... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] disable_allgather ............ False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] dump_state ................... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] dynamic_loss_scale_args ...... None
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_enabled ........... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_gas_boundary_resolution 1
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_layer_name ........ bert.encoder.layer
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_layer_num ......... 0
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_max_iter .......... 100
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_stability ......... 1e-06
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_tol ............... 0.01
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] eigenvalue_verbose ........... False
	[2024-02-27 15:56:43,269] [INFO] [config.py:972:print] elasticity_enabled ........... False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] flops_profiler_config ........ {
	"enabled": false,
	"recompute_fwd_factor": 0.0,
	"profile_step": 1,
	"module_depth": -1,
	"top_modules": 1,
	"detailed": true,
	"output_file": null
	}
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] fp16_auto_cast ............... None
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] fp16_enabled ................. False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] fp16_master_weights_and_gradients False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] global_rank .................. 0
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] grad_accum_dtype ............. None
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] gradient_accumulation_steps .. 4
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] gradient_clipping ............ 1.0
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] gradient_predivide_factor .... 1.0
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] initial_dynamic_scale ........ 1
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] load_universal_checkpoint .... False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] loss_scale ................... 1.0
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] memory_breakdown ............. False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] mics_hierarchial_params_gather False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] mics_shard_size .............. -1
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] nebula_config ................ {
	"enabled": false,
	"persistent_storage_path": null,
	"persistent_time_interval": 100,
	"num_of_version_in_retention": 2,
	"enable_nebula_load": true,
	"load_path": null
	}
	[2024-02-27 15:56:43,270] [INFO] [config.py:972:print] optimizer_legacy_fusion ...... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] optimizer_name ............... None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] optimizer_params ............. None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] pld_enabled .................. False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] pld_params ................... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] prescale_gradients ........... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] scheduler_name ............... None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] scheduler_params ............. None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] sparse_attention ............. None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] sparse_gradients_enabled ..... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] steps_per_print .............. inf
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] train_batch_size ............. 128
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] train_micro_batch_size_per_gpu 4
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] use_node_local_storage ....... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] wall_clock_breakdown ......... False
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] weight_quantization_config ... None
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] world_size ................... 8
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] zero_allow_untested_optimizer True
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] zero_enabled ................. True
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] zero_force_ds_cpu_optimizer .. True
	[2024-02-27 15:56:43,271] [INFO] [config.py:972:print] zero_optimization_stage ...... 3
	[2024-02-27 15:56:43,272] [INFO] [config.py:958:print_user_config] json = {
	"bf16": {
	"enabled": true
	},
	"zero_optimization": {
	"stage": 3,
	"overlap_comm": true,
	"contiguous_gradients": true,
	"sub_group_size": 1.000000e+09,
	"reduce_bucket_size": 1.677722e+07,
	"stage3_prefetch_bucket_size": 1.509949e+07,
	"stage3_param_persistence_threshold": 4.096000e+04,
	"stage3_max_live_parameters": 1.000000e+09,
	"stage3_max_reuse_distance": 1.000000e+09,
	"stage3_gather_16bit_weights_on_model_save": true
	},
	"gradient_accumulation_steps": 4,
	"gradient_clipping": 1.0,
	"steps_per_print": inf,
	"train_batch_size": 128,
	"train_micro_batch_size_per_gpu": 4,
	"wall_clock_breakdown": false,
	"fp16": {
	"enabled": false
	},
	"zero_allow_untested_optimizer": true
	}
	[2024-02-27 16:14:40,687] [WARNING] [stage3.py:1947:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 16:52:42,926] [WARNING] [stage3.py:1947:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 16:53:38,504] [WARNING] [stage3.py:1947:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 23:41:27,648] [WARNING] [stage3.py:1947:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 23:52:36,654] [WARNING] [stage3.py:1947:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 23:53:33,055] [WARNING] [stage3.py:1947:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-27 23:54:28,758] [WARNING] [stage3.py:1947:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-28 00:20:20,365] [WARNING] [stage3.py:1947:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2024-02-28 00:24:03,965] [WARNING] [stage3.py:1947:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time