--- license: cc language: - ve - ts - zu - xh - nso - tn library_name: transformers tags: - tshivenda - low-resouce - masked-language-model - south africa --- # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages. ## Model Overview This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include: - Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda - Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi - Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda - Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda - Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages ## Model Details - **Model Name:** Zabantu-XLMR - **Model Version:** 1.0.0 - **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116) - **Model Size:** 80 - 250 million parameters - **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga. ## Intended Use The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as: - Text classification and sentiment analysis in Tshivenda and related languages. - Named Entity Recognition (NER) for identifying entities in Tshivenda text. - Machine Translation between Tshivenda and other South African languages. - Cross-lingual document retrieval and question answering. ## Performance and Limitations - **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base). **Monolingual test F1 scores on News Topic Classification** | Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |-----------------|-----------------|----------|----------------|----------------|---------------| | nso | 71.4 | 71.6 | 74.3 | 69 | 70.6 | | ven | 74.3 | 74.1 | 77 | 76 | 75.6 | **Few-shot(50 shots) test F1 scores on News Topic Classification** | Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |-----------------|-----------|----------|----------------|----------------|---------------| | ven | 60 | 62 | 66 | 69 | 55 | - **Limitations:** Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa. We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics. As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity. ## Training Data The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data