fix-grid-limits
#2
by
3outeille
HF Staff
- opened
for people using megablocks for training and thus having seqlen=4096
. This will yield a Triton Error [CUDA]: invalid argument
at _binned_copy[(num_experts, expert_capacity)]
as expert_capacity
needs to be < 65535
(as per cuda doc) . Reason for expert_capacity
to be that large is that large is because of tokens_per_expert = top_k * tokens * world_size / num_experts
. We can't change value of top_k
and num_experts
as most models has been trained with those specific set of values. One simple fix is to swap the dims of the kernels as 1st dim has a hard limit of 2^31-1
. Plus num_experts
rarely goes to that number anyway
3outeille
changed pull request status to
open