Ndamulelo Nemakhavhani
commited on
Commit
·
c0d385d
1
Parent(s):
667902f
Update README.md
Browse files
README.md
CHANGED
@@ -14,28 +14,32 @@ tags:
|
|
14 |
- masked-language-model
|
15 |
- south africa
|
16 |
---
|
17 |
-
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
18 |
-
|
19 |
-
Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
|
20 |
|
21 |
-
|
22 |
|
23 |
-
|
24 |
|
25 |
-
- Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
|
26 |
-
- Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
|
27 |
-
- Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
|
28 |
-
- Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
|
29 |
-
- Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
|
30 |
|
31 |
-
|
32 |
|
33 |
-
- **Model Name:** Zabantu-
|
34 |
-
- **Model Version:**
|
35 |
- **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
|
36 |
- **Model Size:** 80 - 250 million parameters
|
37 |
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Intended Use
|
40 |
|
41 |
The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
|
@@ -70,6 +74,12 @@ We also acknowledge the potential to further improve the model by training it on
|
|
70 |
|
71 |
As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
|
72 |
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
-
The models
|
|
|
14 |
- masked-language-model
|
15 |
- south africa
|
16 |
---
|
|
|
|
|
|
|
17 |
|
18 |
+
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
19 |
|
20 |
+
> Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
|
21 |
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
+
# Model Details
|
24 |
|
25 |
+
- **Model Name:** Zabantu-XLM-Roberta
|
26 |
+
- **Model Version:** 0.0.1
|
27 |
- **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
|
28 |
- **Model Size:** 80 - 250 million parameters
|
29 |
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
|
30 |
|
31 |
+
|
32 |
+
## Model Varients
|
33 |
+
|
34 |
+
This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
|
35 |
+
|
36 |
+
- [Zabantu-VEN](https://huggingface.co/dsfsi/zabantu-ven-120m): A monolingual language model trained on 73k raw sentences in Tshivenda
|
37 |
+
- [Zabantu-NSO](https://huggingface.co/dsfsi/zabantu-nso-80m): A monolingual language model trained on 179k raw sentences in Sepedi
|
38 |
+
- [Zabantu-NSO+VEN](https://huggingface.co/dsfsi/zabantu-nso-ven-170m): A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
|
39 |
+
- [Zabantu-SOT+VEN](https://huggingface.co/dsfsi/zabantu-sot-ven-170m): A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
|
40 |
+
- [Zabantu-BANTU](https://huggingface.co/dsfsi/zabantu-bantu-250m): A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
|
41 |
+
|
42 |
+
|
43 |
## Intended Use
|
44 |
|
45 |
The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
|
|
|
74 |
|
75 |
As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
|
76 |
|
77 |
+
# Training Data
|
78 |
+
|
79 |
+
The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data covers a wide range of topics and domains, notably religion, politics, academics and health (mostly Covid-19).
|
80 |
+
|
81 |
+
<hr/>
|
82 |
+
|
83 |
+
# Closing Remarks
|
84 |
|
85 |
+
The Zabantu models provide a valuable resource for advancing Tshivenda NLP coverage and promoting cross-lingual learning techniques for South African languages. They have the potential to enhance various NLP applications, foster linguistic diversity, and contribute to the development of language technologies in the South African context.
|