arxiv:2004.05160

On the Language Neutrality of Pre-trained Multilingual Representations

Published on Apr 9, 2020

Authors:

Abstract

Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual <PRE_TAG>contextual embeddings</POST_TAG> directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings, which are explicitly trained for language neutrality. Contextual embeddings are still only moderately language-neutral by default, so we propose two simple methods for achieving stronger language neutrality: first, by unsupervised centering of the representation for each language and second, by fitting an explicit projection on small parallel data. Besides, we show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences without using parallel data.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2004.05160 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2004.05160 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2004.05160 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.