update readme
Browse files
README.md
CHANGED
@@ -2,27 +2,26 @@
|
|
2 |
|
3 |
## Model Details
|
4 |
The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.
|
|
|
5 |
banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.`
|
6 |
## Training Data
|
7 |
-
The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to 27 GB of raw data.
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
|
12 |
-
|
|
13 |
-
| Total
|
14 |
-
| Total documents | 7,670,661 (7.67 million) |
|
15 |
## Results
|
16 |
-
The banT5 model demonstrated strong performance on downstream tasks
|
17 |
-
| Task
|
18 |
-
|
19 |
-
| Named Entity Recognition (NER) | Precision
|
20 |
-
|
|
21 |
-
|
|
22 |
-
| Part-of-Speech (POS) Tagging
|
23 |
-
|
|
24 |
-
|
|
25 |
-
|
26 |
## Using this model in `transformers`
|
27 |
|
28 |
```bash
|
|
|
2 |
|
3 |
## Model Details
|
4 |
The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.
|
5 |
+
|
6 |
banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.`
|
7 |
## Training Data
|
8 |
+
The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to **27 GB** of raw data. After cleaning and normalization, the processed dataset increased to **36 GB**. Below is an overview of the data cardinalities:
|
9 |
+
| **Metric** | **Count** |
|
10 |
+
|-----------------------|---------------------------------|
|
11 |
+
| **Total words** | 1,646,252,743 (1.65 billion) |
|
12 |
+
| **Unique words** | 15,223,848 (15.23 million) |
|
13 |
+
| **Total sentences** | 131,412,177 (131.4 million) |
|
14 |
+
| **Total documents** | 7,670,661 (7.67 million) |
|
|
|
15 |
## Results
|
16 |
+
The banT5 model demonstrated strong performance on downstream tasks, as summarized below:
|
17 |
+
| **Task** | **Metric** | **Value** |
|
18 |
+
|--------------------------|--------------|------------|
|
19 |
+
| **Named Entity Recognition (NER)** | Precision | 0.8882 |
|
20 |
+
| | Recall | 0.8563 |
|
21 |
+
| | Macro F1 | 0.8686 |
|
22 |
+
| **Part-of-Speech (POS) Tagging** | Precision | 0.8813 |
|
23 |
+
| | Recall | 0.8813 |
|
24 |
+
| | Macro F1 | 0.8791 |
|
|
|
25 |
## Using this model in `transformers`
|
26 |
|
27 |
```bash
|