File size: 2,232 Bytes
9b99005
 
 
 
1dd7b38
9b99005
 
1dd7b38
 
 
 
 
 
 
9b99005
1dd7b38
17f11f6
 
 
 
 
9b99005
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Model Card: banT5

## Model Details
The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.

banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.`
## Training Data
The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to **27 GB** of raw data. After cleaning and normalization, the processed dataset increased to **36 GB**. Below is an overview of the data cardinalities:
| **Metric**           | **Count**                       |
|-----------------------|---------------------------------|
| **Total words**       | 1,646,252,743 (1.65 billion)   |
| **Unique words**      | 15,223,848 (15.23 million)     |
| **Total sentences**   | 131,412,177 (131.4 million)    |
| **Total documents**   | 7,670,661 (7.67 million)       |
## Results
The banT5 model demonstrated strong performance on downstream tasks, as summarized below:
| Task                           | Precision | Recall  | F1     |
|--------------------------------|-----------|---------|--------|
| Named Entity Recognition (NER) | 0.8882    | 0.8563  | 0.8686 |
| Part-of-Speech (POS) Tagging   | 0.8813    | 0.8813  | 0.8791 |

## Using this model in `transformers`

```bash
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "banglagov/banT5-Base" 
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input text
input_text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

input_ids = tokenizer.encode(input_text, return_tensors="pt")

print("input_ids :", input_ids)


```