banglat5_banglaparaphrase

This repository contains the pretrained checkpoint of the model BanglaT5 finetuned on BanglaParaphrase dataset. This is a sequence to sequence transformer model pretrained with the "Span Corruption" objective. Finetuned models using this checkpoint achieve competitive results on the dataset.

For finetuning and inference, refer to the scripts in the official GitHub repository of BanglaNLG.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in the official GitHub repository use this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is given below:

Using this model in transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from normalizer import normalize # pip install git+https://github.com/csebuetnlp/normalizer

model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_banglaparaphrase")
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_banglaparaphrase", use_fast=False)

input_sentence = ""
input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids
generated_tokens = model.generate(input_ids)
decoded_tokens = tokenizer.batch_decode(generated_tokens)[0]

print(decoded_tokens)

Benchmarks

  • Supervised fine-tuning
Test Set Model sacreBLEU ROUGE-L PINC BERTScore BERT-iBLEU
BanglaParaphrase BanglaT5
IndicBART
IndicBARTSS
32.8
5.60
4.90
63.58
35.61
33.66
74.40
80.26
82.10
94.80
91.50
91.10
92.18
91.16
90.95
IndicParaphrase BanglaT5
IndicBART
IndicBARTSS
11.0
12.0
10.7
19.99
21.58
20.59
74.50
76.83
77.60
94.80
93.30
93.10
87.738
90.65
90.54

The dataset can be found in the link below:

Citation

If you use this model, please cite the following paper:

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}
Downloads last month
790,725
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Spaces using csebuetnlp/banglat5_banglaparaphrase 3