Ndamulelo Nemakhavhani
commited on
Commit
·
02a7e31
1
Parent(s):
c0d385d
Update README.md
Browse files
README.md
CHANGED
@@ -9,10 +9,10 @@ language:
|
|
9 |
- tn
|
10 |
library_name: transformers
|
11 |
tags:
|
12 |
-
- tshivenda
|
13 |
- low-resouce
|
14 |
- masked-language-model
|
15 |
- south africa
|
|
|
16 |
---
|
17 |
|
18 |
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
@@ -24,12 +24,12 @@ tags:
|
|
24 |
|
25 |
- **Model Name:** Zabantu-XLM-Roberta
|
26 |
- **Model Version:** 0.0.1
|
27 |
-
- **Model Architecture:** [XLM-RoBERTa
|
28 |
- **Model Size:** 80 - 250 million parameters
|
29 |
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
|
30 |
|
31 |
|
32 |
-
## Model
|
33 |
|
34 |
This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
|
35 |
|
@@ -42,12 +42,14 @@ This model card provides an overview of the multilingual language models develop
|
|
42 |
|
43 |
## Intended Use
|
44 |
|
45 |
-
|
46 |
|
47 |
-
- Text
|
48 |
-
-
|
49 |
-
-
|
50 |
-
-
|
|
|
|
|
51 |
|
52 |
## Performance and Limitations
|
53 |
|
@@ -68,11 +70,12 @@ The Zabantu models are intended to be used for various NLP tasks involving Tshiv
|
|
68 |
|
69 |
- **Limitations:**
|
70 |
|
71 |
-
Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
|
72 |
|
73 |
-
We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
|
74 |
|
75 |
-
As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
|
|
|
76 |
|
77 |
# Training Data
|
78 |
|
@@ -82,4 +85,4 @@ The models have been trained on a large corpus of text data collected from vario
|
|
82 |
|
83 |
# Closing Remarks
|
84 |
|
85 |
-
The Zabantu models provide a valuable resource for advancing Tshivenda NLP coverage and promoting cross-lingual learning techniques for South African languages. They have the potential to enhance various NLP applications, foster linguistic diversity, and contribute to the development of language technologies in the South African context.
|
|
|
9 |
- tn
|
10 |
library_name: transformers
|
11 |
tags:
|
|
|
12 |
- low-resouce
|
13 |
- masked-language-model
|
14 |
- south africa
|
15 |
+
- tshivenda
|
16 |
---
|
17 |
|
18 |
# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
|
|
|
24 |
|
25 |
- **Model Name:** Zabantu-XLM-Roberta
|
26 |
- **Model Version:** 0.0.1
|
27 |
+
- **Model Architecture:** [XLM-RoBERTa](https://arxiv.org/abs/1911.02116)
|
28 |
- **Model Size:** 80 - 250 million parameters
|
29 |
- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
|
30 |
|
31 |
|
32 |
+
## Model Variants
|
33 |
|
34 |
This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
|
35 |
|
|
|
42 |
|
43 |
## Intended Use
|
44 |
|
45 |
+
Like any [Masked Language Model (MLM)](https://huggingface.co/docs/transformers/tasks/masked_language_modeling), Zabantu models can be adapted to a variety of semantic tasks such as:
|
46 |
|
47 |
+
- Text Classification/Categorization: Assigning categories or labels to a whole document, or sections of a document, based on its content.
|
48 |
+
- Sentiment Analysis: Determining the sentiment of a text, such as whether the opinion is positive, negative, or neutral.
|
49 |
+
- Named Entity Recognition (NER): Identifying and classifying key information (entities) in text into predefined categories such as the names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
|
50 |
+
- Part-of-Speech Tagging (POS): Assigning word types to each word (like noun, verb, adjective, etc.), based on both its definition and its context.
|
51 |
+
- Semantic Text Similarity: Measuring how similar two pieces of texts are, which is useful in various applications such as information retrieval, document clustering, and duplicate detection.
|
52 |
+
- etc.
|
53 |
|
54 |
## Performance and Limitations
|
55 |
|
|
|
70 |
|
71 |
- **Limitations:**
|
72 |
|
73 |
+
* Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
|
74 |
|
75 |
+
* We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
|
76 |
|
77 |
+
* As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
|
78 |
+
|
79 |
|
80 |
# Training Data
|
81 |
|
|
|
85 |
|
86 |
# Closing Remarks
|
87 |
|
88 |
+
The Zabantu models provide a valuable resource for advancing Tshivenda NLP coverage and promoting cross-lingual learning techniques for South African languages. They have the potential to enhance various NLP applications, foster linguistic diversity, and contribute to the development of language technologies in the South African context.
|