Papers
arxiv:2407.17465

u-μP: The Unit-Scaled Maximal Update Parametrization

Published on Jul 24, 2024
Authors:
,
,
,
,
,
,
,
,
,

Abstract

The Maximal Update Parametrization (muP) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-<PRE_TAG>size target model</POST_TAG>. We present a new scheme, u-muP, which improves upon muP by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: muP ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-muP models reaching a lower loss than comparable muP models and working out-of-the-box in FP8.

Community

Sign up or log in to comment

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.17465 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.17465 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.