why put MistralRotaryEmbedding in each attention layer instead of putting only once before the first attention layer?

#91
by liougehooa - opened

I found Mistral and some LLMS put positional Embedding in each attention layer(Transformer block) instead. Initially in transformer, the network only has one the first attention layer for each Transformer(one for encoder, one for decoder)?
Why is it better?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment