why put MistralRotaryEmbedding in each attention layer instead of putting only once before the first attention layer?

#91

by liougehooa - opened Apr 16, 2024

Apr 16, 2024

I found Mistral and some LLMS put positional Embedding in each attention layer(Transformer block) instead. Initially in transformer, the network only has one the first attention layer for each Transformer(one for encoder, one for decoder)?
Why is it better?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment