VRIS_vip / RIS-DMMI /bert /multilingual.md

Upload folder using huggingface_hub

0b32e3c verified 28 days ago

11.2 kB

	## Models

	There are two multilingual models currently available. We do not plan to release
	more single-language models, but we may release `BERT-Large` versions of these
	two in the future:

	* [`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip):
	104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
	* [`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip):
	102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
	* [`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip):
	Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
	parameters

	**The `Multilingual Cased (New)` model also fixes normalization issues in many
	languages, so it is recommended in languages with non-Latin alphabets (and is
	often better for most languages with Latin alphabets). When using this model,
	make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other
	scripts.**

	See the [list of languages](#list-of-languages) that the Multilingual model
	supports. The Multilingual model does include Chinese (and English), but if your
	fine-tuning data is Chinese-only, then the Chinese model will likely produce
	better results.

	## Results

	To evaluate these systems, we use the
	[XNLI dataset](https://github.com/facebookresearch/XNLI) dataset, which is a
	version of [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) where the
	dev and test sets have been translated (by humans) into 15 languages. Note that
	the training set was machine translated (we used the translations provided by
	XNLI, not Google NMT). For clarity, we only report on 6 languages below:

	<!-- mdformat off(no table) -->

	\| System \| English \| Chinese \| Spanish \| German \| Arabic \| Urdu \|
	\| --------------------------------- \| -------- \| -------- \| -------- \| -------- \| -------- \| -------- \|
	\| XNLI Baseline - Translate Train \| 73.7 \| 67.0 \| 68.8 \| 66.5 \| 65.8 \| 56.6 \|
	\| XNLI Baseline - Translate Test \| 73.7 \| 68.3 \| 70.7 \| 68.7 \| 66.8 \| 59.3 \|
	\| BERT - Translate Train Cased \| 81.9 \| 76.6 \| 77.8 \| 75.9 \| 70.7 \| 61.6 \|
	\| BERT - Translate Train Uncased \| 81.4 \| 74.2 \| 77.3 \| 75.2 \| 70.5 \| 61.7 \|
	\| BERT - Translate Test Uncased \| 81.4 \| 70.1 \| 74.9 \| 74.4 \| 70.4 \| 62.1 \|
	\| BERT - Zero Shot Uncased \| 81.4 \| 63.8 \| 74.3 \| 70.5 \| 62.1 \| 58.3 \|

	<!-- mdformat on -->

	The first two rows are baselines from the XNLI paper and the last three rows are
	our results with BERT.

	Translate Train means that the MultiNLI training set was machine translated
	from English into the foreign language. So training and evaluation were both
	done in the foreign language. Unfortunately, training was done on
	machine-translated data, so it is impossible to quantify how much of the lower
	accuracy (compared to English) is due to the quality of the machine translation
	vs. the quality of the pre-trained model.

	Translate Test means that the XNLI test set was machine translated from the
	foreign language into English. So training and evaluation were both done on
	English. However, test evaluation was done on machine-translated English, so the
	accuracy depends on the quality of the machine translation system.

	Zero Shot means that the Multilingual BERT system was fine-tuned on English
	MultiNLI, and then evaluated on the foreign language XNLI test. In this case,
	machine translation was not involved at all in either the pre-training or
	fine-tuning.

	Note that the English result is worse than the 84.2 MultiNLI baseline because
	this training used Multilingual BERT rather than English-only BERT. This implies
	that for high-resource languages, the Multilingual model is somewhat worse than
	a single-language model. However, it is not feasible for us to train and
	maintain dozens of single-language models. Therefore, if your goal is to maximize
	performance with a language other than English or Chinese, you might find it
	beneficial to run pre-training for additional steps starting from our
	Multilingual model on data from your language of interest.

	Here is a comparison of training Chinese models with the Multilingual
	`BERT-Base` and Chinese-only `BERT-Base`:

	System \| Chinese
	----------------------- \| -------
	XNLI Baseline \| 67.0
	BERT Multilingual Model \| 74.2
	BERT Chinese-only Model \| 77.2

	Similar to English, the single-language model does 3% better than the
	Multilingual model.

	## Fine-tuning Example

	The multilingual model does not require any special consideration or API
	changes. We did update the implementation of `BasicTokenizer` in
	`tokenization.py` to support Chinese character tokenization, so please update if
	you forked it. However, we did not change the tokenization API.

	To test the new models, we did modify `run_classifier.py` to add support for the
	[XNLI dataset](https://github.com/facebookresearch/XNLI). This is a 15-language
	version of MultiNLI where the dev/test sets have been human-translated, and the
	training set has been machine-translated.

	To run the fine-tuning code, please download the
	[XNLI dev/test set](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) and the
	[XNLI machine-translated training set](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
	and then unpack both .zip files into some directory `$XNLI_DIR`.

	To run fine-tuning on XNLI. The language is hard-coded into `run_classifier.py`
	(Chinese by default), so please modify `XnliProcessor` if you want to run on
	another language.

	This is a large dataset, so this will training will take a few hours on a GPU
	(or about 30 minutes on a Cloud TPU). To run an experiment quickly for
	debugging, just set `num_train_epochs` to a small value like `0.1`.

	```shell
	export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12 # or multilingual_L-12_H-768_A-12
	export XNLI_DIR=/path/to/xnli

	python run_classifier.py \
	--task_name=XNLI \
	--do_train=true \
	--do_eval=true \
	--data_dir=$XNLI_DIR \
	--vocab_file=$BERT_BASE_DIR/vocab.txt \
	--bert_config_file=$BERT_BASE_DIR/bert_config.json \
	--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
	--max_seq_length=128 \
	--train_batch_size=32 \
	--learning_rate=5e-5 \
	--num_train_epochs=2.0 \
	--output_dir=/tmp/xnli_output/
	```

	With the Chinese-only model, the results should look something like this:

	```
	*** Eval results ***
	eval_accuracy = 0.774116
	eval_loss = 0.83554
	global_step = 24543
	loss = 0.74603
	```

	## Details

	### Data Source and Sampling

	The languages chosen were the
	[top 100 languages with the largest Wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).
	The entire Wikipedia dump for each language (excluding user and talk pages) was
	taken as the training data for each language

	However, the size of the Wikipedia for a given language varies greatly, and
	therefore low-resource languages may be "under-represented" in terms of the
	neural network model (under the assumption that languages are "competing" for
	limited model capacity to some extent). At the same time, we also don't want
	to overfit the model by performing thousands of epochs over a tiny Wikipedia
	for a particular language.

	To balance these two factors, we performed exponentially smoothed weighting of
	the data during pre-training data creation (and WordPiece vocab creation). In
	other words, let's say that the probability of a language is P(L), e.g.,
	P(English) = 0.21 means that after concatenating all of the Wikipedias
	together, 21% of our data is English. We exponentiate each probability by some
	factor S and then re-normalize, and sample from that distribution. In our case
	we use S=0.7. So, high-resource languages like English will be under-sampled,
	and low-resource languages like Icelandic will be over-sampled. E.g., in the
	original distribution English would be sampled 1000x more than Icelandic, but
	after smoothing it's only sampled 100x more.

	### Tokenization

	For tokenization, we use a 110k shared WordPiece vocabulary. The word counts are
	weighted the same way as the data, so low-resource languages are upweighted by
	some factor. We intentionally do not use any marker to denote the input
	language (so that zero-shot training can work).

	Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace
	characters, we add spaces around every character in the
	[CJK Unicode range](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_$Unicode_block$)
	before applying WordPiece. This means that Chinese is effectively
	character-tokenized. Note that the CJK Unicode block only includes
	Chinese-origin characters and does not include Hangul Korean or
	Katakana/Hiragana Japanese, which are tokenized with whitespace+WordPiece like
	all other languages.

	For all other languages, we apply the
	[same recipe as English](https://github.com/google-research/bert#tokenization):
	(a) lower casing+accent removal, (b) punctuation splitting, (c) whitespace
	tokenization. We understand that accent markers have substantial meaning in some
	languages, but felt that the benefits of reducing the effective vocabulary make
	up for this. Generally the strong contextual models of BERT should make up for
	any ambiguity introduced by stripping accent markers.

	### List of Languages

	The multilingual model supports the following languages. These languages were
	chosen because they are the top 100 languages with the largest Wikipedias:

	* Afrikaans
	* Albanian
	* Arabic
	* Aragonese
	* Armenian
	* Asturian
	* Azerbaijani
	* Bashkir
	* Basque
	* Bavarian
	* Belarusian
	* Bengali
	* Bishnupriya Manipuri
	* Bosnian
	* Breton
	* Bulgarian
	* Burmese
	* Catalan
	* Cebuano
	* Chechen
	* Chinese (Simplified)
	* Chinese (Traditional)
	* Chuvash
	* Croatian
	* Czech
	* Danish
	* Dutch
	* English
	* Estonian
	* Finnish
	* French
	* Galician
	* Georgian
	* German
	* Greek
	* Gujarati
	* Haitian
	* Hebrew
	* Hindi
	* Hungarian
	* Icelandic
	* Ido
	* Indonesian
	* Irish
	* Italian
	* Japanese
	* Javanese
	* Kannada
	* Kazakh
	* Kirghiz
	* Korean
	* Latin
	* Latvian
	* Lithuanian
	* Lombard
	* Low Saxon
	* Luxembourgish
	* Macedonian
	* Malagasy
	* Malay
	* Malayalam
	* Marathi
	* Minangkabau
	* Nepali
	* Newar
	* Norwegian (Bokmal)
	* Norwegian (Nynorsk)
	* Occitan
	* Persian (Farsi)
	* Piedmontese
	* Polish
	* Portuguese
	* Punjabi
	* Romanian
	* Russian
	* Scots
	* Serbian
	* Serbo-Croatian
	* Sicilian
	* Slovak
	* Slovenian
	* South Azerbaijani
	* Spanish
	* Sundanese
	* Swahili
	* Swedish
	* Tagalog
	* Tajik
	* Tamil
	* Tatar
	* Telugu
	* Turkish
	* Ukrainian
	* Urdu
	* Uzbek
	* Vietnamese
	* Volapük
	* Waray-Waray
	* Welsh
	* West Frisian
	* Western Punjabi
	* Yoruba

	The Multilingual Cased (New) release contains additionally Thai and
	Mongolian, which were not included in the original release.