meta-llama/Llama-4-Scout-17B-16E-Instruct · Does LLama4 have chunked attention in generation phase ?

Apr 15

Same as title.
I know chunked attention mask is there for context phase. But does LLama4 implement chunked attention mask in generation phase too ?

ArthurZ

Meta Llama org May 20

yes

bayatdariush90

Jun 12

Sexy girl

dbl0207

1 day ago

Yes, Llama 4 implements chunked attention during the generation phase, but only on specific layers. This is part of the innovative "iRoPE" architecture, which allows for an extremely large context length of 10 million tokens while managing memory efficiently

dbl0207

1 day ago

Here's how it works:
Interleaved architecture: The Llama 4 model uses two different types of attention layers in an alternating pattern:

RoPE layers: These layers apply a chunked attention mask, meaning they can only attend to a fixed-size window of recent tokens (e.g., 8K tokens). During the generation phase, the KV cache for these layers is also fixed in size and only stores keys and values for the current chunk.

NoPE layers: These layers have no positional encoding and use a full causal mask, allowing them to access the entire context history. This is critical for long-range reasoning.
Memory efficiency: By applying chunked attention on most layers, Llama 4 avoids the massive memory growth that typically occurs with long context windows during the generation phase. This makes it possible to run models with enormous context lengths on commercially available GPUs.

Balancing efficiency and performance: The interleaved design is a compromise. The NoPE layers handle the long-range context, while the chunked RoPE layers provide local, high-fidelity attention more efficiently. This gives the model the capability to handle extremely long sequences without a massive increase in hardware requirements.

In summary, Llama 4's approach to attention during generation is not uniformly chunked. Instead, it strategically uses chunked attention on some layers and full causal attention on others. This innovative design is a key reason for its high performance on long-context tasks with relatively modest resource requirements.