A little question about aux_loss
#4
by
WaitHZ
- opened
334 line, Why not divide by the number of experts but multiply?
fi = ce * self.n_routed_experts
When choosing multiple experts, it seems that the average expectation of ce is several times that of Pi?
The computation is correct. fi = n_actual / n_expected = n_actual / (n_total_tokens * top_k / n_routed_experts) = n_actual * n_routed_experts/ (n_total_tokens * top_k).
ce computes n_actual / (n_total_tokens * top_k)【note that the first dimension of mask_ce is: n_total_tokens * top_k)】, so fi needs to be multiplied by n_routed_experts.
You can generate a case to verify this computation.
thanks
WaitHZ
changed discussion status to
closed