A little question about aux_loss

by WaitHZ - opened Mar 7, 2024

Mar 7, 2024

334 line, Why not divide by the number of experts but multiply?

fi = ce * self.n_routed_experts

When choosing multiple experts, it seems that the average expectation of ce is several times that of Pi?

DeepSeekDDM

DeepSeek org Mar 19, 2024

The computation is correct. fi = n_actual / n_expected = n_actual / (n_total_tokens * top_k / n_routed_experts) = n_actual * n_routed_experts/ (n_total_tokens * top_k).
ce computes n_actual / (n_total_tokens * top_k)【note that the first dimension of mask_ce is: n_total_tokens * top_k)】, so fi needs to be multiplied by n_routed_experts.
You can generate a case to verify this computation.

WaitHZ

Mar 21, 2024

thanks

WaitHZ changed discussion status to closed Mar 21, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment