arxiv:2112.08723

Distilled Dual-Encoder Model for Vision-Language Understanding

Published on Dec 16, 2021

Authors:

Zekun Wang ,

Abstract

We propose a cross-modal attention <PRE_TAG>distillation</POST_TAG> framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in <PRE_TAG>dual-encoder models</POST_TAG> is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention <PRE_TAG>distillation</POST_TAG>, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention <PRE_TAG>distillation</POST_TAG> for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2112.08723 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2112.08723 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2112.08723 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.