banT5-Base / README.md
banglagov's picture
Update README.md
17f11f6 verified
|
raw
history blame
2.23 kB

Model Card: banT5

Model Details

The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.

banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.

Training Data

The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to 27 GB of raw data. After cleaning and normalization, the processed dataset increased to 36 GB. Below is an overview of the data cardinalities:

Metric Count
Total words 1,646,252,743 (1.65 billion)
Unique words 15,223,848 (15.23 million)
Total sentences 131,412,177 (131.4 million)
Total documents 7,670,661 (7.67 million)

Results

The banT5 model demonstrated strong performance on downstream tasks, as summarized below:

Task Precision Recall F1
Named Entity Recognition (NER) 0.8882 0.8563 0.8686
Part-of-Speech (POS) Tagging 0.8813 0.8813 0.8791

Using this model in transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "banglagov/banT5-Base" 
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input text
input_text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

input_ids = tokenizer.encode(input_text, return_tensors="pt")

print("input_ids :", input_ids)