-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 7 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 118
Collections
Discover the best community collections!
Collections including paper arxiv:2311.01282
-
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 42 -
Data Filtering Networks
Paper • 2309.17425 • Published • 6 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Paper • 2311.00945 • Published • 16
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper • 2311.03285 • Published • 32 -
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper • 2311.06243 • Published • 22 -
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Paper • 2311.05908 • Published • 16
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 16 -
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper • 2402.11131 • Published • 43
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Paper • 2402.17193 • Published • 25 -
Training Language Models to Self-Correct via Reinforcement Learning
Paper • 2409.12917 • Published • 138