⚠️ Training Loss Drops to 0.0 and Validation Loss Becomes NaN When Fine-Tuning ModernBERT
#83
by
onurulu17
- opened
Hi everyone,
I’m fine-tuning ModernBERT-base on a custom text corpus and running into a serious issue where the training loss suddenly collapses to 0.0 and the validation loss turns into NaN after several thousand steps.
Here’s what I’m seeing in the logs:
Step Training Loss Validation Loss
... ... ...
4000 45.5973 43.5447
4500 35.7538 34.0837
5000 0.0000 nan
5500 0.0000 nan
6000 0.0000 nan
6500 0.0000 nan
7000 0.0000 nan
After around step 5000, the loss becomes 0.0 and never recovers — the model stops updating entirely.
My training setup includes:
bf16=True(on A100 GPU)gradient_checkpointing=Truelr_scheduler_type="cosine"learning_rate=3e-5per_device_train_batch_size=8gradient_accumulation_steps=9
Does anyone know the cause or have a solution for this issue?