|
## Models |
|
|
|
There are two multilingual models currently available. We do not plan to release |
|
more single-language models, but we may release `BERT-Large` versions of these |
|
two in the future: |
|
|
|
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**: |
|
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**: |
|
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**: |
|
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M |
|
parameters |
|
|
|
**The `Multilingual Cased (New)` model also fixes normalization issues in many |
|
languages, so it is recommended in languages with non-Latin alphabets (and is |
|
often better for most languages with Latin alphabets). When using this model, |
|
make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other |
|
scripts.** |
|
|
|
See the [list of languages](#list-of-languages) that the Multilingual model |
|
supports. The Multilingual model does include Chinese (and English), but if your |
|
fine-tuning data is Chinese-only, then the Chinese model will likely produce |
|
better results. |
|
|
|
## Results |
|
|
|
To evaluate these systems, we use the |
|
[XNLI dataset](https://github.com/facebookresearch/XNLI) dataset, which is a |
|
version of [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) where the |
|
dev and test sets have been translated (by humans) into 15 languages. Note that |
|
the training set was *machine* translated (we used the translations provided by |
|
XNLI, not Google NMT). For clarity, we only report on 6 languages below: |
|
|
|
<!-- mdformat off(no table) --> |
|
|
|
| System | English | Chinese | Spanish | German | Arabic | Urdu | |
|
| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | |
|
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 | |
|
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 | |
|
| BERT - Translate Train Cased | **81.9** | **76.6** | **77.8** | **75.9** | **70.7** | 61.6 | |
|
| BERT - Translate Train Uncased | 81.4 | 74.2 | 77.3 | 75.2 | 70.5 | 61.7 | |
|
| BERT - Translate Test Uncased | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | **62.1** | |
|
| BERT - Zero Shot Uncased | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 | |
|
|
|
<!-- mdformat on --> |
|
|
|
The first two rows are baselines from the XNLI paper and the last three rows are |
|
our results with BERT. |
|
|
|
**Translate Train** means that the MultiNLI training set was machine translated |
|
from English into the foreign language. So training and evaluation were both |
|
done in the foreign language. Unfortunately, training was done on |
|
machine-translated data, so it is impossible to quantify how much of the lower |
|
accuracy (compared to English) is due to the quality of the machine translation |
|
vs. the quality of the pre-trained model. |
|
|
|
**Translate Test** means that the XNLI test set was machine translated from the |
|
foreign language into English. So training and evaluation were both done on |
|
English. However, test evaluation was done on machine-translated English, so the |
|
accuracy depends on the quality of the machine translation system. |
|
|
|
**Zero Shot** means that the Multilingual BERT system was fine-tuned on English |
|
MultiNLI, and then evaluated on the foreign language XNLI test. In this case, |
|
machine translation was not involved at all in either the pre-training or |
|
fine-tuning. |
|
|
|
Note that the English result is worse than the 84.2 MultiNLI baseline because |
|
this training used Multilingual BERT rather than English-only BERT. This implies |
|
that for high-resource languages, the Multilingual model is somewhat worse than |
|
a single-language model. However, it is not feasible for us to train and |
|
maintain dozens of single-language models. Therefore, if your goal is to maximize |
|
performance with a language other than English or Chinese, you might find it |
|
beneficial to run pre-training for additional steps starting from our |
|
Multilingual model on data from your language of interest. |
|
|
|
Here is a comparison of training Chinese models with the Multilingual |
|
`BERT-Base` and Chinese-only `BERT-Base`: |
|
|
|
System | Chinese |
|
----------------------- | ------- |
|
XNLI Baseline | 67.0 |
|
BERT Multilingual Model | 74.2 |
|
BERT Chinese-only Model | 77.2 |
|
|
|
Similar to English, the single-language model does 3% better than the |
|
Multilingual model. |
|
|
|
## Fine-tuning Example |
|
|
|
The multilingual model does **not** require any special consideration or API |
|
changes. We did update the implementation of `BasicTokenizer` in |
|
`tokenization.py` to support Chinese character tokenization, so please update if |
|
you forked it. However, we did not change the tokenization API. |
|
|
|
To test the new models, we did modify `run_classifier.py` to add support for the |
|
[XNLI dataset](https://github.com/facebookresearch/XNLI). This is a 15-language |
|
version of MultiNLI where the dev/test sets have been human-translated, and the |
|
training set has been machine-translated. |
|
|
|
To run the fine-tuning code, please download the |
|
[XNLI dev/test set](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) and the |
|
[XNLI machine-translated training set](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip) |
|
and then unpack both .zip files into some directory `$XNLI_DIR`. |
|
|
|
To run fine-tuning on XNLI. The language is hard-coded into `run_classifier.py` |
|
(Chinese by default), so please modify `XnliProcessor` if you want to run on |
|
another language. |
|
|
|
This is a large dataset, so this will training will take a few hours on a GPU |
|
(or about 30 minutes on a Cloud TPU). To run an experiment quickly for |
|
debugging, just set `num_train_epochs` to a small value like `0.1`. |
|
|
|
```shell |
|
export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12 # or multilingual_L-12_H-768_A-12 |
|
export XNLI_DIR=/path/to/xnli |
|
|
|
python run_classifier.py \ |
|
--task_name=XNLI \ |
|
--do_train=true \ |
|
--do_eval=true \ |
|
--data_dir=$XNLI_DIR \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ |
|
--max_seq_length=128 \ |
|
--train_batch_size=32 \ |
|
--learning_rate=5e-5 \ |
|
--num_train_epochs=2.0 \ |
|
--output_dir=/tmp/xnli_output/ |
|
``` |
|
|
|
With the Chinese-only model, the results should look something like this: |
|
|
|
``` |
|
***** Eval results ***** |
|
eval_accuracy = 0.774116 |
|
eval_loss = 0.83554 |
|
global_step = 24543 |
|
loss = 0.74603 |
|
``` |
|
|
|
## Details |
|
|
|
### Data Source and Sampling |
|
|
|
The languages chosen were the |
|
[top 100 languages with the largest Wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias). |
|
The entire Wikipedia dump for each language (excluding user and talk pages) was |
|
taken as the training data for each language |
|
|
|
However, the size of the Wikipedia for a given language varies greatly, and |
|
therefore low-resource languages may be "under-represented" in terms of the |
|
neural network model (under the assumption that languages are "competing" for |
|
limited model capacity to some extent). At the same time, we also don't want |
|
to overfit the model by performing thousands of epochs over a tiny Wikipedia |
|
for a particular language. |
|
|
|
To balance these two factors, we performed exponentially smoothed weighting of |
|
the data during pre-training data creation (and WordPiece vocab creation). In |
|
other words, let's say that the probability of a language is *P(L)*, e.g., |
|
*P(English) = 0.21* means that after concatenating all of the Wikipedias |
|
together, 21% of our data is English. We exponentiate each probability by some |
|
factor *S* and then re-normalize, and sample from that distribution. In our case |
|
we use *S=0.7*. So, high-resource languages like English will be under-sampled, |
|
and low-resource languages like Icelandic will be over-sampled. E.g., in the |
|
original distribution English would be sampled 1000x more than Icelandic, but |
|
after smoothing it's only sampled 100x more. |
|
|
|
### Tokenization |
|
|
|
For tokenization, we use a 110k shared WordPiece vocabulary. The word counts are |
|
weighted the same way as the data, so low-resource languages are upweighted by |
|
some factor. We intentionally do *not* use any marker to denote the input |
|
language (so that zero-shot training can work). |
|
|
|
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace |
|
characters, we add spaces around every character in the |
|
[CJK Unicode range](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_\(Unicode_block\)) |
|
before applying WordPiece. This means that Chinese is effectively |
|
character-tokenized. Note that the CJK Unicode block only includes |
|
Chinese-origin characters and does *not* include Hangul Korean or |
|
Katakana/Hiragana Japanese, which are tokenized with whitespace+WordPiece like |
|
all other languages. |
|
|
|
For all other languages, we apply the |
|
[same recipe as English](https://github.com/google-research/bert#tokenization): |
|
(a) lower casing+accent removal, (b) punctuation splitting, (c) whitespace |
|
tokenization. We understand that accent markers have substantial meaning in some |
|
languages, but felt that the benefits of reducing the effective vocabulary make |
|
up for this. Generally the strong contextual models of BERT should make up for |
|
any ambiguity introduced by stripping accent markers. |
|
|
|
### List of Languages |
|
|
|
The multilingual model supports the following languages. These languages were |
|
chosen because they are the top 100 languages with the largest Wikipedias: |
|
|
|
* Afrikaans |
|
* Albanian |
|
* Arabic |
|
* Aragonese |
|
* Armenian |
|
* Asturian |
|
* Azerbaijani |
|
* Bashkir |
|
* Basque |
|
* Bavarian |
|
* Belarusian |
|
* Bengali |
|
* Bishnupriya Manipuri |
|
* Bosnian |
|
* Breton |
|
* Bulgarian |
|
* Burmese |
|
* Catalan |
|
* Cebuano |
|
* Chechen |
|
* Chinese (Simplified) |
|
* Chinese (Traditional) |
|
* Chuvash |
|
* Croatian |
|
* Czech |
|
* Danish |
|
* Dutch |
|
* English |
|
* Estonian |
|
* Finnish |
|
* French |
|
* Galician |
|
* Georgian |
|
* German |
|
* Greek |
|
* Gujarati |
|
* Haitian |
|
* Hebrew |
|
* Hindi |
|
* Hungarian |
|
* Icelandic |
|
* Ido |
|
* Indonesian |
|
* Irish |
|
* Italian |
|
* Japanese |
|
* Javanese |
|
* Kannada |
|
* Kazakh |
|
* Kirghiz |
|
* Korean |
|
* Latin |
|
* Latvian |
|
* Lithuanian |
|
* Lombard |
|
* Low Saxon |
|
* Luxembourgish |
|
* Macedonian |
|
* Malagasy |
|
* Malay |
|
* Malayalam |
|
* Marathi |
|
* Minangkabau |
|
* Nepali |
|
* Newar |
|
* Norwegian (Bokmal) |
|
* Norwegian (Nynorsk) |
|
* Occitan |
|
* Persian (Farsi) |
|
* Piedmontese |
|
* Polish |
|
* Portuguese |
|
* Punjabi |
|
* Romanian |
|
* Russian |
|
* Scots |
|
* Serbian |
|
* Serbo-Croatian |
|
* Sicilian |
|
* Slovak |
|
* Slovenian |
|
* South Azerbaijani |
|
* Spanish |
|
* Sundanese |
|
* Swahili |
|
* Swedish |
|
* Tagalog |
|
* Tajik |
|
* Tamil |
|
* Tatar |
|
* Telugu |
|
* Turkish |
|
* Ukrainian |
|
* Urdu |
|
* Uzbek |
|
* Vietnamese |
|
* Volapük |
|
* Waray-Waray |
|
* Welsh |
|
* West Frisian |
|
* Western Punjabi |
|
* Yoruba |
|
|
|
The **Multilingual Cased (New)** release contains additionally **Thai** and |
|
**Mongolian**, which were not included in the original release. |
|
|