A little question about aux_loss

#4
by WaitHZ - opened

334 line, Why not divide by the number of experts but multiply?

fi = ce * self.n_routed_experts

When choosing multiple experts, it seems that the average expectation of ce is several times that of Pi?

DeepSeek org

The computation is correct. fi = n_actual / n_expected = n_actual / (n_total_tokens * top_k / n_routed_experts) = n_actual * n_routed_experts/ (n_total_tokens * top_k).
ce computes n_actual / (n_total_tokens * top_k)【note that the first dimension of mask_ce is: n_total_tokens * top_k)】, so fi needs to be multiplied by n_routed_experts.
You can generate a case to verify this computation.

WaitHZ changed discussion status to closed

Sign up or log in to comment