What is the ratio of mLSTM to sLSTM blocks & nature of the recurrence of sLSTM
Is it 7:1 like in the original paper?
Also--the question that interest me the most but is otherwise a bit off-topic-- is sLSTM recurrent in depth? The sentence
To enable parallelization, the mLSTM abandons memory mixing, i.e., the hidden-hidden recurrent connections."
in the original paper seems to suggest it is. If correct, could someone comment on its exact nature, please?
This is a 1:0 model, with a few more adaptions - namely a post-up projection block and soft-capping.
I'm unsure what you mean be recurrent in depth for sLSTM, it typically means that you have multiple layers stacked (as in gridLSTM) - this is not the case.
sLSTM (which is not part of xLSTM-7b) is like a plain single-layered LSTM with exponential gating and a multi-head option. The recurrence refers to the gate activations being dependent influenced by the previous hidden state via the recurrent weight matrix - referred to a memory mixing in the original xLSTM paper.
Thanks, sorry, this wasn't clear because I am confused. I meant to ask if the topology of the neural net is recurrent (layer-wise) or does it only refers to the way tokens are processed, and the neural net is e.g. an MLP (like Mamba)? Your explanation clears things for me, and made my realised my question was poorly though/formulated, as memory mixing only refers to the mLSTM arch., and not the topology of the NN, as far as I understand.