Papers
arxiv:2303.04715

Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

Published on Mar 8, 2023
Authors:
,
,
,
,
,
,
,
,

Abstract

In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most <PRE_TAG>Traditional Chinese benchmarks</POST_TAG> while maintaining its English capability. We release all our models to the research community.

Community

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.04715 in a dataset README.md to link it from this page.

Spaces citing this paper 5

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.