EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test Paper • 2503.01840 • Published 10 days ago • 4
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model Paper • 2503.07703 • Published 3 days ago • 29
Identifying Sensitive Weights via Post-quantization Integral Paper • 2503.01901 • Published 14 days ago • 7
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting Paper • 2503.00784 • Published 12 days ago • 10
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 10 days ago • 72
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling Paper • 2502.14856 • Published 21 days ago • 7
Iterative Value Function Optimization for Guided Decoding Paper • 2503.02368 • Published 10 days ago • 14
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization Paper • 2503.01328 • Published 11 days ago • 14
Running 2.24k 2.24k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference Paper • 2502.18137 • Published 17 days ago • 53
Rank1: Test-Time Compute for Reranking in Information Retrieval Paper • 2502.18418 • Published 16 days ago • 25
MoBA: Mixture of Block Attention for Long-Context LLMs Paper • 2502.13189 • Published 24 days ago • 14
LightThinker: Thinking Step-by-Step Compression Paper • 2502.15589 • Published 20 days ago • 26
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published 21 days ago • 12
Autellix: An Efficient Serving Engine for LLM Agents as General Programs Paper • 2502.13965 • Published 22 days ago • 18