⚠️ Training Loss Drops to 0.0 and Validation Loss Becomes NaN When Fine-Tuning ModernBERT

#83
by onurulu17 - opened

Hi everyone,

I’m fine-tuning ModernBERT-base on a custom text corpus and running into a serious issue where the training loss suddenly collapses to 0.0 and the validation loss turns into NaN after several thousand steps.

Here’s what I’m seeing in the logs:

Step    Training Loss   Validation Loss
...          ...                      ...
4000    45.5973         43.5447
4500    35.7538         34.0837
5000    0.0000          nan
5500    0.0000          nan
6000    0.0000          nan
6500    0.0000          nan
7000    0.0000          nan

After around step 5000, the loss becomes 0.0 and never recovers — the model stops updating entirely.

My training setup includes:

  • bf16=True (on A100 GPU)
  • gradient_checkpointing=True
  • lr_scheduler_type="cosine"
  • learning_rate=3e-5
  • per_device_train_batch_size=8
  • gradient_accumulation_steps=9

Does anyone know the cause or have a solution for this issue?

Sign up or log in to comment