-
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5 -
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 11 -
Striped Attention: Faster Ring Attention for Causal Transformers
Paper • 2311.09431 • Published • 4 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 18
Collections
Discover the best community collections!
Collections including paper arxiv:2311.09431
-
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5 -
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 11 -
Striped Attention: Faster Ring Attention for Causal Transformers
Paper • 2311.09431 • Published • 4 -
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 38
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 21 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
-
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
Paper • 2401.03462 • Published • 27 -
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Paper • 2305.07185 • Published • 9 -
YaRN: Efficient Context Window Extension of Large Language Models
Paper • 2309.00071 • Published • 68 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 16