arxiv:2511.11238

Virtual Width Networks

Published on Nov 14

· Submitted by

taesiri on Nov 17

#3 Paper of the day

ByteDance Seed

Upvote

Authors:

Defa Zhu ,

Ge Zhang ,

Abstract

Virtual Width Networks (VWN) enhance model efficiency by expanding representational width without increasing computational cost, accelerating optimization and improving loss reduction.

AI-generated summary

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter 4 days ago

VWN decouples representational width from backbone width to expand embedding space with near-constant backbone compute, achieving faster convergence and a log-linear relation between virtual width and loss reduction.