Model Card: banT5

Model Details

The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.

banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.

Training Data

The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to 27 GB of raw data. After cleaning and normalization, the processed dataset increased to 36 GB. Below is an overview of the data cardinalities:

Metric	Count
Total words	1,646,252,743 (1.65 billion)
Unique words	15,223,848 (15.23 million)
Total sentences	131,412,177 (131.4 million)
Total documents	7,670,661 (7.67 million)

Results

The banT5 model demonstrated strong performance on downstream tasks, as summarized below:

Task	Precision	Recall	F1
Named Entity Recognition (NER)	0.8882	0.8563	0.8686
Part-of-Speech (POS) Tagging	0.8813	0.8813	0.8791

Using this model in `transformers`

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "banglagov/banT5-Base" 
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input text
input_text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

input_ids = tokenizer.encode(input_text, return_tensors="pt")

print("input_ids :", input_ids)

Model Card: banT5

Model Details

Training Data

Results

Using this model in transformers

Using this model in `transformers`