# Model Card: banT5 ## Model Details The banT5 model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats. banT5 is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.` ## Training Data The banT5 model was pre-trained on a large-scale Bangla text dataset, amounting to **27 GB** of raw data. After cleaning and normalization, the processed dataset increased to **36 GB**. Below is an overview of the data cardinalities: | **Metric** | **Count** | |-----------------------|---------------------------------| | **Total words** | 1,646,252,743 (1.65 billion) | | **Unique words** | 15,223,848 (15.23 million) | | **Total sentences** | 131,412,177 (131.4 million) | | **Total documents** | 7,670,661 (7.67 million) | ## Results The banT5 model demonstrated strong performance on downstream tasks, as summarized below: | Task | Precision | Recall | F1 | |--------------------------------|-----------|---------|--------| | Named Entity Recognition (NER) | 0.8882 | 0.8563 | 0.8686 | | Part-of-Speech (POS) Tagging | 0.8813 | 0.8813 | 0.8791 | ## Using this model in `transformers` ```bash from transformers import T5Tokenizer, T5ForConditionalGeneration model_name = "banglagov/banT5-Base" tokenizer = T5Tokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) # Example input text input_text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।" input_ids = tokenizer.encode(input_text, return_tensors="pt") print("input_ids :", input_ids) ```