Ndamulelo Nemakhavhani
commited on
Commit
·
804387b
1
Parent(s):
5962540
Adding model card
Browse filesMore information on the model can be obtained on this presentation available on YouTube:
README.md
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
2 |
+
|
3 |
+
Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
|
4 |
+
|
5 |
+
## Model Overview
|
6 |
+
|
7 |
+
This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
|
8 |
+
|
9 |
+
- Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
|
10 |
+
- Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
|
11 |
+
- Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
|
12 |
+
- Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
|
13 |
+
- Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
|
17 |
+
- **Model Name:** Zabantu-XLMR
|
18 |
+
- **Model Version:** 1.0.0
|
19 |
+
- **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
|
20 |
+
- **Model Size:** 80 - 250 million parameters
|
21 |
+
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
|
22 |
+
|
23 |
+
## Intended Use
|
24 |
+
|
25 |
+
The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
|
26 |
+
|
27 |
+
- Text classification and sentiment analysis in Tshivenda and related languages.
|
28 |
+
- Named Entity Recognition (NER) for identifying entities in Tshivenda text.
|
29 |
+
- Machine Translation between Tshivenda and other South African languages.
|
30 |
+
- Cross-lingual document retrieval and question answering.
|
31 |
+
|
32 |
+
## Performance and Limitations
|
33 |
+
|
34 |
+
- **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base).
|
35 |
+
|
36 |
+
**Monolingual test F1 scores on News Topic Classification**
|
37 |
+
|
38 |
+
| Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
|
39 |
+
|-----------------|-----------------|----------|----------------|----------------|---------------|
|
40 |
+
| nso | 71.4 | 71.6 | 74.3 | 69 | 70.6 |
|
41 |
+
| ven | 74.3 | 74.1 | 77 | 76 | 75.6 |
|
42 |
+
|
43 |
+
**Few-shot(50 shots) test F1 scores on News Topic Classification**
|
44 |
+
|
45 |
+
| Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
|
46 |
+
|-----------------|-----------|----------|----------------|----------------|---------------|
|
47 |
+
| ven | 60 | 62 | 66 | 69 | 55 |
|
48 |
+
|
49 |
+
- **Limitations:**
|
50 |
+
|
51 |
+
Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
|
52 |
+
|
53 |
+
We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
|
54 |
+
|
55 |
+
As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
|
56 |
+
|
57 |
+
## Training Data
|
58 |
+
|
59 |
+
The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data
|