Clarification on SwiGLU Implementation and intermediate_size Configuration

#2
by DualK - opened

Hello!I’m working with the RNABert model and encountered a dimension mismatch error when using the SwiGLU activation function. I believe the root cause lies in the ambiguity of how intermediate_size is defined in the config file when SwiGLU is enabled. Here’s a detailed breakdown:

Problem Description

  1. Model Structure

    • RnaBertIntermediate.dense: Linear(hidden_size=1280 -> intermediate_size=3392)
    • Activation: SwiGLU splits the output into two chunks (3392 → 1696 + 1696), processes them, and returns a tensor of shape [..., 1696].
    • RnaBertOutput.dense: Expects input dim 3392 but receives 1696, causing a matrix multiplication error:
      RuntimeError: mat1 and mat2 shapes cannot be multiplied (24576x1696 and 3392x1280)  
      
  2. Configuration Ambiguity

    • The config parameter intermediate_size=3392 seems to conflict with SwiGLU’s design.
    • For SwiGLU, intermediate_size should represent the post-activation dimension (e.g., 1696), while the pre-activation layer should output 2 * intermediate_size = 3392.

Proposed Fixes

To resolve this, either:

  • Option 1: Keep intermediate_size=3392 in the config but modify RnaBertIntermediate.dense to output 2 * 3392 = 6784, then split into 3392 + 3392 for SwiGLU.
  • Option 2: Set intermediate_size=1696 in the config, and let RnaBertIntermediate.dense output 2 * 1696 = 3392 (current code behavior), compatible with RnaBertOutput.dense.

Question:
Could you clarify the intended definition of intermediate_size when using SwiGLU? Should it represent the pre-activation dimension (e.g., 3392, requiring code adjustments) or the post-activation dimension (e.g., 1696, requiring config adjustments)?

Thank you for your guidance!
ps:Because I am afraid that I cannot describe the key to the problem clearly, this question was written by DeepSeek-R1, sorry.

Sign up or log in to comment