Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Paper • 2501.12370 • Published Jan 21 • 11
Decomposing and Editing Predictions by Modeling Model Computation Paper • 2404.11534 • Published Apr 17, 2024