|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceFW/fineweb-2 |
|
language: |
|
- tr |
|
tags: |
|
- turkish |
|
- ul2 |
|
- t5 |
|
--- |
|
|
|
# BERT5urk |
|
|
|
 |
|
|
|
This repository hosts the new 1.42B Turkish T5 model named BERT5urk. |
|
|
|
BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome |
|
[T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective. |
|
|
|
Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP) |
|
group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks |
|
to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models! |
|
|
|
# Pretraining Data |
|
|
|
BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus. |
|
Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB. |
|
|
|
We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus. |
|
|
|
# Pretraining |
|
|
|
BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights: |
|
|
|
* One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days |
|
* Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128 |
|
* The resulting model has 1.42B parameters |
|
|
|
# Evaluation |
|
|
|
Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned |
|
[TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison. |
|
|
|
## Encoder-only Results |
|
|
|
For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA. |
|
The overall performance can be seen in the following table: |
|
|
|
| Model Name | Overall Development | Overall Test | |
|
|-----------------------------------------------------------------------------------------------------------|--------------------:|-------------:| |
|
| [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased) | 89.72 | 90.05 | |
|
| [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) | 89.25 | 89.95 | |
|
| [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased) | 88.98 | 89.49 | |
|
| [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased) | 89.28 | 89.67 | |
|
| [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased) | **90.06** | 90.27 | |
|
| [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased) | 90.03 | 90.09 | |
|
| [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased) | 89.76 | 89.97 | |
|
| [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased) | 87.95 | 88.16 | |
|
| [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator) | 89.08 | 89.91 | |
|
| [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator) | 89.24 | 90.03 | |
|
| [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) | 89.09 | 89.62 | |
|
| [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator) | 87.27 | 88.28 | |
|
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 89.96 | 90.26 | |
|
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) | 88.81 | 89.36 | |
|
|
|
## Encoder-decoder Results |
|
|
|
We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library. |
|
|
|
### Paraphrasing - Tatoeba |
|
|
|
We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper |
|
is also shown in the following table: |
|
|
|
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor | |
|
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:| |
|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 90.22 | 80.23 | 88.95 | 71.14 | 87.56 | |
|
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 90.36 | 80.50 | 89.10 | 71.48 | 87.63 | |
|
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 90.47 | 80.78 | 89.21 | 71.89 | 87.74 | |
|
|
|
### Paraphrasing - OpenSubtitles |
|
|
|
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper): |
|
|
|
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor | |
|
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:| |
|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 78.43 | 63.58 | 76.81 | 51.47 | 74.79 | |
|
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 78.36 | 63.42 | 76.71 | 51.39 | 74.94 | |
|
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 78.56 | 63.80 | 76.95 | 51.74 | 75.07 | |
|
|
|
#### Title Generation - TrNews |
|
|
|
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper): |
|
|
|
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor | |
|
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:| |
|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 36.47 | 22.88 | 35.47 | 12.64 | 23.62 | |
|
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 41.65 | 27.60 | 36.77 | 18.60 | 34.55 | |
|
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.79 | 27.77 | 37.00 | 19.08 | 34.69 | |
|
|
|
### Summarization - TrNews |
|
|
|
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper): |
|
|
|
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor | |
|
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:| |
|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 41.77 | 27.81 | 36.99 | 19.05 | 34.61 | |
|
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 40.75 | 26.82 | 35.88 | 18.00 | 33.91 | |
|
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.00 | 27.08 | 36.24 | 18.78 | 23.96 | |
|
|
|
# Acknowledgments |
|
|
|
Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC). |
|
Many Thanks for providing access to the TPUs over many years ❤️ |
|
|
|
Made from Bavarian Oberland with ❤️ and 🥨. |