Collections
Discover the best community collections!
Collections including paper arxiv:2404.07143
-
STaR: Bootstrapping Reasoning With Reasoning
Paper β’ 2203.14465 β’ Published β’ 8 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper β’ 2401.06066 β’ Published β’ 47 -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper β’ 2405.04434 β’ Published β’ 17 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper β’ 2311.04934 β’ Published β’ 29
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper β’ 2312.00752 β’ Published β’ 140 -
Elucidating the Design Space of Diffusion-Based Generative Models
Paper β’ 2206.00364 β’ Published β’ 15 -
GLU Variants Improve Transformer
Paper β’ 2002.05202 β’ Published β’ 2 -
StarCoder 2 and The Stack v2: The Next Generation
Paper β’ 2402.19173 β’ Published β’ 137
-
LLoCO: Learning Long Contexts Offline
Paper β’ 2404.07979 β’ Published β’ 21 -
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper β’ 2402.13753 β’ Published β’ 115 -
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
Paper β’ 2402.11550 β’ Published β’ 17 -
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper β’ 2401.18058 β’ Published β’ 21
-
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper β’ 2404.08801 β’ Published β’ 66 -
TransformerFAM: Feedback attention is working memory
Paper β’ 2404.09173 β’ Published β’ 43 -
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Paper β’ 2404.07143 β’ Published β’ 105 -
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper β’ 2406.02657 β’ Published β’ 38
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper β’ 2402.17764 β’ Published β’ 608 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper β’ 2310.11453 β’ Published β’ 96 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper β’ 2404.02258 β’ Published β’ 104 -
TransformerFAM: Feedback attention is working memory
Paper β’ 2404.09173 β’ Published β’ 43