Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results
Abstract
In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most <PRE_TAG>Traditional Chinese benchmarks</POST_TAG> while maintaining its English capability. We release all our models to the research community.
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 5
Collections including this paper 0
No Collection including this paper