arxiv:2306.12907

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Published on Jun 22, 2023

Authors:

Abstract

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: <PRE_TAG>xSIM++</POST_TAG>. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that <PRE_TAG>xSIM++</POST_TAG> is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. <PRE_TAG>xSIM++</POST_TAG> also reports performance for different error types, offering more fine-grained feedback for model development.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.12907 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.12907 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.