Papers
arxiv:2306.12907

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Published on Jun 22, 2023
Authors:
,
,
,
,

Abstract

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: <PRE_TAG>xSIM++</POST_TAG>. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that <PRE_TAG>xSIM++</POST_TAG> is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. <PRE_TAG>xSIM++</POST_TAG> also reports performance for different error types, offering more fine-grained feedback for model development.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.12907 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.12907 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.