arxiv:2409.04585

CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance

Published on Sep 6, 2024

Authors:

Wei Wen ,

Abstract

Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of distributed ML systems. In CubicML, we use a ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models and large language models at Meta.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2409.04585 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2409.04585 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2409.04585 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.