Clarification on SwiGLU Implementation and intermediate_size Configuration
Hello!I’m working with the RNABert model and encountered a dimension mismatch error when using the SwiGLU activation function. I believe the root cause lies in the ambiguity of how intermediate_size
is defined in the config file when SwiGLU is enabled. Here’s a detailed breakdown:
Problem Description
Model Structure
RnaBertIntermediate.dense
:Linear(hidden_size=1280 -> intermediate_size=3392)
- Activation:
SwiGLU
splits the output into two chunks (3392 → 1696 + 1696
), processes them, and returns a tensor of shape[..., 1696]
. RnaBertOutput.dense
: Expects input dim3392
but receives1696
, causing a matrix multiplication error:RuntimeError: mat1 and mat2 shapes cannot be multiplied (24576x1696 and 3392x1280)
Configuration Ambiguity
- The config parameter
intermediate_size=3392
seems to conflict with SwiGLU’s design. - For SwiGLU,
intermediate_size
should represent the post-activation dimension (e.g.,1696
), while the pre-activation layer should output2 * intermediate_size = 3392
.
- The config parameter
Proposed Fixes
To resolve this, either:
- Option 1: Keep
intermediate_size=3392
in the config but modifyRnaBertIntermediate.dense
to output2 * 3392 = 6784
, then split into3392 + 3392
for SwiGLU. - Option 2: Set
intermediate_size=1696
in the config, and letRnaBertIntermediate.dense
output2 * 1696 = 3392
(current code behavior), compatible withRnaBertOutput.dense
.
Question:
Could you clarify the intended definition of intermediate_size
when using SwiGLU? Should it represent the pre-activation dimension (e.g., 3392
, requiring code adjustments) or the post-activation dimension (e.g., 1696
, requiring config adjustments)?
Thank you for your guidance!
ps:Because I am afraid that I cannot describe the key to the problem clearly, this question was written by DeepSeek-R1, sorry.