Papers
arxiv:2411.11770

CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task

Published on Nov 18, 2024
Authors:
,

Abstract

The task of converting hanyu pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC) behind many downstream applications. This task is typically one of text-length alignment and seems easy to solve; however, due to the limited informational content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a Fill-Mask task then propose CNMBert, which stands for zh-CN Pinyin Multi-mask Bert Model, as a solution to this issue. CNMBert surpasses fine-tuning GPT models, achieving a 60.56 MRR score and 51.09 accuracy on a 10,229-sample pinyin abbreviation test dataset, providing a viable solution to this task.

Community

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.11770 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.11770 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.