arxiv:2305.12129

Lifting the Curse of Capacity Gap in Distilling Language Models

Published on May 20, 2023

Upvote

Authors:

Jiahao Liu ,

Jingang Wang ,

Benyou Wang ,

Abstract

Pretrained language models (LMs) have shown compelling performance on various downstream tasks, but unfortunately they require a tremendous amount of inference compute. Knowledge distillation finds a path to compress LMs to small ones with a teacher-student paradigm. However, when the capacity gap between the teacher and the student is large, a curse of <PRE_TAG>capacity gap</POST_TAG> appears, invoking a deficiency in distilling LMs. While a few studies have been carried out to fill the gap, the curse is not yet well tackled. In this paper, we aim at lifting the curse of <PRE_TAG>capacity gap</POST_TAG> via enlarging the capacity of the student without notably increasing the inference compute. Largely motivated by sparse activation regime of mixture of experts (MoE), we propose a mixture of minimal experts (MiniMoE), which imposes extra parameters to the student but introduces almost no additional inference compute. Experimental results on GLUE and CoNLL demonstrate the curse of <PRE_TAG>capacity gap</POST_TAG> is lifted by the magic of MiniMoE to a large extent. MiniMoE also achieves the state-of-the-art performance at small FLOPs compared with a range of competitive baselines. With a compression rate as much as sim50times, MiniMoE preserves sim95\% GLUE score of the teacher.