|
--- |
|
license: cc |
|
language: |
|
- ve |
|
- ts |
|
- zu |
|
- xh |
|
- nso |
|
- tn |
|
library_name: transformers |
|
tags: |
|
- tshivenda |
|
- low-resouce |
|
- masked-language-model |
|
- south africa |
|
--- |
|
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages |
|
|
|
Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages. |
|
|
|
## Model Overview |
|
|
|
This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include: |
|
|
|
- Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda |
|
- Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi |
|
- Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda |
|
- Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda |
|
- Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages |
|
|
|
## Model Details |
|
|
|
- **Model Name:** Zabantu-XLMR |
|
- **Model Version:** 1.0.0 |
|
- **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116) |
|
- **Model Size:** 80 - 250 million parameters |
|
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga. |
|
|
|
## Intended Use |
|
|
|
The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as: |
|
|
|
- Text classification and sentiment analysis in Tshivenda and related languages. |
|
- Named Entity Recognition (NER) for identifying entities in Tshivenda text. |
|
- Machine Translation between Tshivenda and other South African languages. |
|
- Cross-lingual document retrieval and question answering. |
|
|
|
## Performance and Limitations |
|
|
|
- **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base). |
|
|
|
**Monolingual test F1 scores on News Topic Classification** |
|
|
|
| Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |
|
|-----------------|-----------------|----------|----------------|----------------|---------------| |
|
| nso | 71.4 | 71.6 | 74.3 | 69 | 70.6 | |
|
| ven | 74.3 | 74.1 | 77 | 76 | 75.6 | |
|
|
|
**Few-shot(50 shots) test F1 scores on News Topic Classification** |
|
|
|
| Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |
|
|-----------------|-----------|----------|----------------|----------------|---------------| |
|
| ven | 60 | 62 | 66 | 69 | 55 | |
|
|
|
- **Limitations:** |
|
|
|
Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa. |
|
|
|
We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics. |
|
|
|
As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity. |
|
|
|
## Training Data |
|
|
|
The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data |