arxiv:2303.04715

Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

Published on Mar 8, 2023

Authors:

,

,

,

,

,

,

,

,

Abstract

In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most <PRE_TAG>Traditional Chinese benchmarks</POST_TAG> while maintaining its English capability. We release all our models to the research community.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.04715 in a dataset README.md to link it from this page.

Spaces citing this paper 5

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.