Ndamulelo Nemakhavhani
commited on
Commit
·
667902f
1
Parent(s):
804387b
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
2 |
|
3 |
Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
|
@@ -56,4 +72,4 @@ As with any language model, the generated output should be carefully reviewed an
|
|
56 |
|
57 |
## Training Data
|
58 |
|
59 |
-
The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data
|
|
|
1 |
+
---
|
2 |
+
license: cc
|
3 |
+
language:
|
4 |
+
- ve
|
5 |
+
- ts
|
6 |
+
- zu
|
7 |
+
- xh
|
8 |
+
- nso
|
9 |
+
- tn
|
10 |
+
library_name: transformers
|
11 |
+
tags:
|
12 |
+
- tshivenda
|
13 |
+
- low-resouce
|
14 |
+
- masked-language-model
|
15 |
+
- south africa
|
16 |
+
---
|
17 |
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
18 |
|
19 |
Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
|
|
|
72 |
|
73 |
## Training Data
|
74 |
|
75 |
+
The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data
|