bert5urk / README.md

readme: add initial version (#1)

7e3d4d5 verified 14 days ago

8.52 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceFW/fineweb-2
	language:
	- tr
	tags:
	- turkish
	- ul2
	- t5
	---

	# BERT5urk

	![BERT5urk](bert5urk_logo.png)

	This repository hosts the new 1.42B Turkish T5 model named BERT5urk.

	BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome
	[T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective.

	Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP)
	group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks
	to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models!

	# Pretraining Data

	BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus.
	Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.

	We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.

	# Pretraining

	BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights:

	* One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
	* Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
	* The resulting model has 1.42B parameters

	# Evaluation

	Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned
	[TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison.

	## Encoder-only Results

	For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA.
	The overall performance can be seen in the following table:

	\| Model Name \| Overall Development \| Overall Test \|
	\|-----------------------------------------------------------------------------------------------------------\|--------------------:\|-------------:\|
	\| [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased) \| 89.72 \| 90.05 \|
	\| [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) \| 89.25 \| 89.95 \|
	\| [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased) \| 88.98 \| 89.49 \|
	\| [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased) \| 89.28 \| 89.67 \|
	\| [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased) \| 90.06 \| 90.27 \|
	\| [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased) \| 90.03 \| 90.09 \|
	\| [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased) \| 89.76 \| 89.97 \|
	\| [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased) \| 87.95 \| 88.16 \|
	\| [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator) \| 89.08 \| 89.91 \|
	\| [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator) \| 89.24 \| 90.03 \|
	\| [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) \| 89.09 \| 89.62 \|
	\| [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator) \| 87.27 \| 88.28 \|
	\| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) \| 89.96 \| 90.26 \|
	\| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) \| 88.81 \| 89.36 \|

	## Encoder-decoder Results

	We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library.

	### Paraphrasing - Tatoeba

	We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper
	is also shown in the following table:

	\| Model \| test_rouge1 \| test_rouge2 \| test_rougeL \| test_bleu \| test_meteor \|
	\|:-----------------------------------------------------------------\|------------:\|------------:\|------------:\|----------:\|------------:\|
	\| [TURNA](https://arxiv.org/abs/2401.14373) (paper) \| 90.22 \| 80.23 \| 88.95 \| 71.14 \| 87.56 \|
	\| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) \| 90.36 \| 80.50 \| 89.10 \| 71.48 \| 87.63 \|
	\| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) \| 90.47 \| 80.78 \| 89.21 \| 71.89 \| 87.74 \|

	### Paraphrasing - OpenSubtitles

	We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

	\| Model \| test_rouge1 \| test_rouge2 \| test_rougeL \| test_bleu \| test_meteor \|
	\|:-----------------------------------------------------------------\|------------:\|------------:\|------------:\|----------:\|------------:\|
	\| [TURNA](https://arxiv.org/abs/2401.14373) (paper) \| 78.43 \| 63.58 \| 76.81 \| 51.47 \| 74.79 \|
	\| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) \| 78.36 \| 63.42 \| 76.71 \| 51.39 \| 74.94 \|
	\| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) \| 78.56 \| 63.80 \| 76.95 \| 51.74 \| 75.07 \|

	#### Title Generation - TrNews

	We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

	\| Model \| test_rouge1 \| test_rouge2 \| test_rougeL \| test_bleu \| test_meteor \|
	\|:-----------------------------------------------------------------\|------------:\|------------:\|------------:\|----------:\|------------:\|
	\| [TURNA](https://arxiv.org/abs/2401.14373) (paper) \| 36.47 \| 22.88 \| 35.47 \| 12.64 \| 23.62 \|
	\| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) \| 41.65 \| 27.60 \| 36.77 \| 18.60 \| 34.55 \|
	\| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) \| 41.79 \| 27.77 \| 37.00 \| 19.08 \| 34.69 \|

	### Summarization - TrNews

	We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

	\| Model \| test_rouge1 \| test_rouge2 \| test_rougeL \| test_bleu \| test_meteor \|
	\|:-----------------------------------------------------------------\|------------:\|------------:\|------------:\|----------:\|------------:\|
	\| [TURNA](https://arxiv.org/abs/2401.14373) (paper) \| 41.77 \| 27.81 \| 36.99 \| 19.05 \| 34.61 \|
	\| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) \| 40.75 \| 26.82 \| 35.88 \| 18.00 \| 33.91 \|
	\| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) \| 41.00 \| 27.08 \| 36.24 \| 18.78 \| 23.96 \|

	# Acknowledgments

	Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
	Many Thanks for providing access to the TPUs over many years ❤️

	Made from Bavarian Oberland with ❤️ and 🥨.