Collections
Discover the best community collections!
Collections including paper arxiv:2312.11514
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 259 -
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper • 2312.12456 • Published • 42 -
Accelerating LLM Inference with Staged Speculative Decoding
Paper • 2308.04623 • Published • 25 -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper • 2208.07339 • Published • 5