|
# BERT |
|
|
|
**\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\*** |
|
|
|
This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). |
|
|
|
We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. |
|
|
|
Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. |
|
|
|
You can download all 24 from [here][all], or individually from the table below: |
|
|
|
| |H=128|H=256|H=512|H=768| |
|
|---|:---:|:---:|:---:|:---:| |
|
| **L=2** |[**2/128 (BERT-Tiny)**][2_128]|[2/256][2_256]|[2/512][2_512]|[2/768][2_768]| |
|
| **L=4** |[4/128][4_128]|[**4/256 (BERT-Mini)**][4_256]|[**4/512 (BERT-Small)**][4_512]|[4/768][4_768]| |
|
| **L=6** |[6/128][6_128]|[6/256][6_256]|[6/512][6_512]|[6/768][6_768]| |
|
| **L=8** |[8/128][8_128]|[8/256][8_256]|[**8/512 (BERT-Medium)**][8_512]|[8/768][8_768]| |
|
| **L=10** |[10/128][10_128]|[10/256][10_256]|[10/512][10_512]|[10/768][10_768]| |
|
| **L=12** |[12/128][12_128]|[12/256][12_256]|[12/512][12_512]|[**12/768 (BERT-Base)**][12_768]| |
|
|
|
Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. |
|
|
|
Here are the corresponding GLUE scores on the test set: |
|
|
|
|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |
|
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
|
|BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |
|
|BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |
|
|BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |
|
|BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| |
|
|
|
For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: |
|
- batch sizes: 8, 16, 32, 64, 128 |
|
- learning rates: 3e-4, 1e-4, 5e-5, 3e-5 |
|
|
|
If you use these models, please cite the following paper: |
|
|
|
``` |
|
@article{turc2019, |
|
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, |
|
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
|
journal={arXiv preprint arXiv:1908.08962v2 }, |
|
year={2019} |
|
} |
|
``` |
|
|
|
[2_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip |
|
[2_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-256_A-4.zip |
|
[2_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-512_A-8.zip |
|
[2_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-768_A-12.zip |
|
[4_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-128_A-2.zip |
|
[4_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-256_A-4.zip |
|
[4_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip |
|
[4_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-768_A-12.zip |
|
[6_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-128_A-2.zip |
|
[6_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-256_A-4.zip |
|
[6_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-512_A-8.zip |
|
[6_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-768_A-12.zip |
|
[8_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-128_A-2.zip |
|
[8_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-256_A-4.zip |
|
[8_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-512_A-8.zip |
|
[8_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-768_A-12.zip |
|
[10_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-128_A-2.zip |
|
[10_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-256_A-4.zip |
|
[10_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-512_A-8.zip |
|
[10_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-768_A-12.zip |
|
[12_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-128_A-2.zip |
|
[12_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-256_A-4.zip |
|
[12_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-512_A-8.zip |
|
[12_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip |
|
[all]: https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip |
|
|
|
**\*\*\*\*\* New May 31st, 2019: Whole Word Masking Models \*\*\*\*\*** |
|
|
|
This is a release of several new models which were the result of an improvement |
|
the pre-processing code. |
|
|
|
In the original pre-processing code, we randomly select WordPiece tokens to |
|
mask. For example: |
|
|
|
`Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head` |
|
`Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil |
|
[MASK] ##mon ' s head` |
|
|
|
The new technique is called Whole Word Masking. In this case, we always mask |
|
*all* of the the tokens corresponding to a word at once. The overall masking |
|
rate remains the same. |
|
|
|
`Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] |
|
[MASK] ' s head` |
|
|
|
The training is identical -- we still predict each masked WordPiece token |
|
independently. The improvement comes from the fact that the original prediction |
|
task was too 'easy' for words that had been split into multiple WordPieces. |
|
|
|
This can be enabled during data generation by passing the flag |
|
`--do_whole_word_mask=True` to `create_pretraining_data.py`. |
|
|
|
Pre-trained models with Whole Word Masking are linked below. The data and |
|
training were otherwise identical, and the models have identical structure and |
|
vocab to the original models. We only include BERT-Large models. When using |
|
these models, please make it clear in the paper that you are using the Whole |
|
Word Masking variant of BERT-Large. |
|
|
|
* **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
|
* **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
|
Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy |
|
---------------------------------------- | :-------------: | :----------------: |
|
BERT-Large, Uncased (Original) | 91.0/84.3 | 86.05 |
|
BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7 | 87.07 |
|
BERT-Large, Cased (Original) | 91.5/84.8 | 86.09 |
|
BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46 |
|
|
|
**\*\*\*\*\* New February 7th, 2019: TfHub Module \*\*\*\*\*** |
|
|
|
BERT has been uploaded to [TensorFlow Hub](https://tfhub.dev). See |
|
`run_classifier_with_tfhub.py` for an example of how to use the TF Hub module, |
|
or run an example in the browser on |
|
[Colab](https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb). |
|
|
|
**\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai + |
|
Mongolian \*\*\*\*\*** |
|
|
|
We uploaded a new multilingual model which does *not* perform any normalization |
|
on the input (no lower casing, accent stripping, or Unicode normalization), and |
|
additionally inclues Thai and Mongolian. |
|
|
|
**It is recommended to use this version for developing multilingual models, |
|
especially on languages with non-Latin alphabets.** |
|
|
|
This does not require any code changes, and can be downloaded here: |
|
|
|
* **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**: |
|
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
|
|
|
**\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\*** |
|
|
|
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is |
|
currently 1st place on the leaderboard by 3%. See the SQuAD 2.0 section of the |
|
README for details. |
|
|
|
**\*\*\*\*\* New November 5th, 2018: Third-party PyTorch and Chainer versions of |
|
BERT available \*\*\*\*\*** |
|
|
|
NLP researchers from HuggingFace made a |
|
[PyTorch version of BERT available](https://github.com/huggingface/pytorch-pretrained-BERT) |
|
which is compatible with our pre-trained checkpoints and is able to reproduce |
|
our results. Sosuke Kobayashi also made a |
|
[Chainer version of BERT available](https://github.com/soskek/bert-chainer) |
|
(Thanks!) We were not involved in the creation or maintenance of the PyTorch |
|
implementation so please direct any questions towards the authors of that |
|
repository. |
|
|
|
**\*\*\*\*\* New November 3rd, 2018: Multilingual and Chinese models available |
|
\*\*\*\*\*** |
|
|
|
We have made two new BERT models available: |
|
|
|
* **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) |
|
(Not recommended, use `Multilingual Cased` instead)**: 102 languages, |
|
12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**: |
|
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M |
|
parameters |
|
|
|
We use character-based tokenization for Chinese, and WordPiece tokenization for |
|
all other languages. Both models should work out-of-the-box without any code |
|
changes. We did update the implementation of `BasicTokenizer` in |
|
`tokenization.py` to support Chinese character tokenization, so please update if |
|
you forked it. However, we did not change the tokenization API. |
|
|
|
For more, see the |
|
[Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md). |
|
|
|
**\*\*\*\*\* End new information \*\*\*\*\*** |
|
|
|
## Introduction |
|
|
|
**BERT**, or **B**idirectional **E**ncoder **R**epresentations from |
|
**T**ransformers, is a new method of pre-training language representations which |
|
obtains state-of-the-art results on a wide array of Natural Language Processing |
|
(NLP) tasks. |
|
|
|
Our academic paper which describes BERT in detail and provides full results on a |
|
number of tasks can be found here: |
|
[https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805). |
|
|
|
To give a few numbers, here are the results on the |
|
[SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) question answering |
|
task: |
|
|
|
SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 |
|
------------------------------------- | :------: | :------: |
|
1st Place Ensemble - BERT | **87.4** | **93.2** |
|
2nd Place Ensemble - nlnet | 86.0 | 91.7 |
|
1st Place Single Model - BERT | **85.1** | **91.8** |
|
2nd Place Single Model - nlnet | 83.5 | 90.1 |
|
|
|
And several natural language inference tasks: |
|
|
|
System | MultiNLI | Question NLI | SWAG |
|
----------------------- | :------: | :----------: | :------: |
|
BERT | **86.7** | **91.1** | **86.3** |
|
OpenAI GPT (Prev. SOTA) | 82.2 | 88.1 | 75.0 |
|
|
|
Plus many other tasks. |
|
|
|
Moreover, these results were all obtained with almost no task-specific neural |
|
network architecture design. |
|
|
|
If you already know what BERT is and you just want to get started, you can |
|
[download the pre-trained models](#pre-trained-models) and |
|
[run a state-of-the-art fine-tuning](#fine-tuning-with-bert) in only a few |
|
minutes. |
|
|
|
## What is BERT? |
|
|
|
BERT is a method of pre-training language representations, meaning that we train |
|
a general-purpose "language understanding" model on a large text corpus (like |
|
Wikipedia), and then use that model for downstream NLP tasks that we care about |
|
(like question answering). BERT outperforms previous methods because it is the |
|
first *unsupervised*, *deeply bidirectional* system for pre-training NLP. |
|
|
|
*Unsupervised* means that BERT was trained using only a plain text corpus, which |
|
is important because an enormous amount of plain text data is publicly available |
|
on the web in many languages. |
|
|
|
Pre-trained representations can also either be *context-free* or *contextual*, |
|
and contextual representations can further be *unidirectional* or |
|
*bidirectional*. Context-free models such as |
|
[word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) or |
|
[GloVe](https://nlp.stanford.edu/projects/glove/) generate a single "word |
|
embedding" representation for each word in the vocabulary, so `bank` would have |
|
the same representation in `bank deposit` and `river bank`. Contextual models |
|
instead generate a representation of each word that is based on the other words |
|
in the sentence. |
|
|
|
BERT was built upon recent work in pre-training contextual representations — |
|
including [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), |
|
[Generative Pre-Training](https://blog.openai.com/language-unsupervised/), |
|
[ELMo](https://allennlp.org/elmo), and |
|
[ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html) |
|
— but crucially these models are all *unidirectional* or *shallowly |
|
bidirectional*. This means that each word is only contextualized using the words |
|
to its left (or right). For example, in the sentence `I made a bank deposit` the |
|
unidirectional representation of `bank` is only based on `I made a` but not |
|
`deposit`. Some previous work does combine the representations from separate |
|
left-context and right-context models, but only in a "shallow" manner. BERT |
|
represents "bank" using both its left and right context — `I made a ... deposit` |
|
— starting from the very bottom of a deep neural network, so it is *deeply |
|
bidirectional*. |
|
|
|
BERT uses a simple approach for this: We mask out 15% of the words in the input, |
|
run the entire sequence through a deep bidirectional |
|
[Transformer](https://arxiv.org/abs/1706.03762) encoder, and then predict only |
|
the masked words. For example: |
|
|
|
``` |
|
Input: the man went to the [MASK1] . he bought a [MASK2] of milk. |
|
Labels: [MASK1] = store; [MASK2] = gallon |
|
``` |
|
|
|
In order to learn relationships between sentences, we also train on a simple |
|
task which can be generated from any monolingual corpus: Given two sentences `A` |
|
and `B`, is `B` the actual next sentence that comes after `A`, or just a random |
|
sentence from the corpus? |
|
|
|
``` |
|
Sentence A: the man went to the store . |
|
Sentence B: he bought a gallon of milk . |
|
Label: IsNextSentence |
|
``` |
|
|
|
``` |
|
Sentence A: the man went to the store . |
|
Sentence B: penguins are flightless . |
|
Label: NotNextSentence |
|
``` |
|
|
|
We then train a large model (12-layer to 24-layer Transformer) on a large corpus |
|
(Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M |
|
update steps), and that's BERT. |
|
|
|
Using BERT has two stages: *Pre-training* and *fine-tuning*. |
|
|
|
**Pre-training** is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a |
|
one-time procedure for each language (current models are English-only, but |
|
multilingual models will be released in the near future). We are releasing a |
|
number of pre-trained models from the paper which were pre-trained at Google. |
|
Most NLP researchers will never need to pre-train their own model from scratch. |
|
|
|
**Fine-tuning** is inexpensive. All of the results in the paper can be |
|
replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, |
|
starting from the exact same pre-trained model. SQuAD, for example, can be |
|
trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of |
|
91.0%, which is the single system state-of-the-art. |
|
|
|
The other important aspect of BERT is that it can be adapted to many types of |
|
NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on |
|
sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level |
|
(e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific |
|
modifications. |
|
|
|
## What has been released in this repository? |
|
|
|
We are releasing the following: |
|
|
|
* TensorFlow code for the BERT model architecture (which is mostly a standard |
|
[Transformer](https://arxiv.org/abs/1706.03762) architecture). |
|
* Pre-trained checkpoints for both the lowercase and cased version of |
|
`BERT-Base` and `BERT-Large` from the paper. |
|
* TensorFlow code for push-button replication of the most important |
|
fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC. |
|
|
|
All of the code in this repository works out-of-the-box with CPU, GPU, and Cloud |
|
TPU. |
|
|
|
## Pre-trained models |
|
|
|
We are releasing the `BERT-Base` and `BERT-Large` models from the paper. |
|
`Uncased` means that the text has been lowercased before WordPiece tokenization, |
|
e.g., `John Smith` becomes `john smith`. The `Uncased` model also strips out any |
|
accent markers. `Cased` means that the true case and accent markers are |
|
preserved. Typically, the `Uncased` model is better unless you know that case |
|
information is important for your task (e.g., Named Entity Recognition or |
|
Part-of-Speech tagging). |
|
|
|
These models are all released under the same license as the source code (Apache |
|
2.0). |
|
|
|
For information about the Multilingual and Chinese model, see the |
|
[Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md). |
|
|
|
**When using a cased model, make sure to pass `--do_lower=False` to the training |
|
scripts. (Or pass `do_lower_case=False` directly to `FullTokenizer` if you're |
|
using your own script.)** |
|
|
|
The links to the models are here (right-click, 'Save link as...' on the name): |
|
|
|
* **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
* **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
* **[`BERT-Base, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)**: |
|
12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Large, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
* **[`BERT-Base, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip)**: |
|
12-layer, 768-hidden, 12-heads , 110M parameters |
|
* **[`BERT-Large, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip)**: |
|
24-layer, 1024-hidden, 16-heads, 340M parameters |
|
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**: |
|
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) |
|
(Not recommended, use `Multilingual Cased` instead)**: 102 languages, |
|
12-layer, 768-hidden, 12-heads, 110M parameters |
|
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**: |
|
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M |
|
parameters |
|
|
|
Each .zip file contains three items: |
|
|
|
* A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained |
|
weights (which is actually 3 files). |
|
* A vocab file (`vocab.txt`) to map WordPiece to word id. |
|
* A config file (`bert_config.json`) which specifies the hyperparameters of |
|
the model. |
|
|
|
## Fine-tuning with BERT |
|
|
|
**Important**: All results on the paper were fine-tuned on a single Cloud TPU, |
|
which has 64GB of RAM. It is currently not possible to re-produce most of the |
|
`BERT-Large` results on the paper using a GPU with 12GB - 16GB of RAM, because |
|
the maximum batch size that can fit in memory is too small. We are working on |
|
adding code to this repository which allows for much larger effective batch size |
|
on the GPU. See the section on [out-of-memory issues](#out-of-memory-issues) for |
|
more details. |
|
|
|
This code was tested with TensorFlow 1.11.0. It was tested with Python2 and |
|
Python3 (but more thoroughly with Python2, since this is what's used internally |
|
in Google). |
|
|
|
The fine-tuning examples which use `BERT-Base` should be able to run on a GPU |
|
that has at least 12GB of RAM using the hyperparameters given. |
|
|
|
### Fine-tuning with Cloud TPUs |
|
|
|
Most of the examples below assumes that you will be running training/evaluation |
|
on your local machine, using a GPU like a Titan X or GTX 1080. |
|
|
|
However, if you have access to a Cloud TPU that you want to train on, just add |
|
the following flags to `run_classifier.py` or `run_squad.py`: |
|
|
|
``` |
|
--use_tpu=True \ |
|
--tpu_name=$TPU_NAME |
|
``` |
|
|
|
Please see the |
|
[Google Cloud TPU tutorial](https://cloud.google.com/tpu/docs/tutorials/mnist) |
|
for how to use Cloud TPUs. Alternatively, you can use the Google Colab notebook |
|
"[BERT FineTuning with Cloud TPUs](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)". |
|
|
|
On Cloud TPUs, the pretrained model and the output directory will need to be on |
|
Google Cloud Storage. For example, if you have a bucket named `some_bucket`, you |
|
might use the following flags instead: |
|
|
|
``` |
|
--output_dir=gs://some_bucket/my_output_dir/ |
|
``` |
|
|
|
The unzipped pre-trained model files can also be found in the Google Cloud |
|
Storage folder `gs://bert_models/2018_10_18`. For example: |
|
|
|
``` |
|
export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12 |
|
``` |
|
|
|
### Sentence (and sentence-pair) classification tasks |
|
|
|
Before running this example you must download the |
|
[GLUE data](https://gluebenchmark.com/tasks) by running |
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) |
|
and unpack it to some directory `$GLUE_DIR`. Next, download the `BERT-Base` |
|
checkpoint and unzip it to some directory `$BERT_BASE_DIR`. |
|
|
|
This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase |
|
Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a |
|
few minutes on most GPUs. |
|
|
|
```shell |
|
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 |
|
export GLUE_DIR=/path/to/glue |
|
|
|
python run_classifier.py \ |
|
--task_name=MRPC \ |
|
--do_train=true \ |
|
--do_eval=true \ |
|
--data_dir=$GLUE_DIR/MRPC \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ |
|
--max_seq_length=128 \ |
|
--train_batch_size=32 \ |
|
--learning_rate=2e-5 \ |
|
--num_train_epochs=3.0 \ |
|
--output_dir=/tmp/mrpc_output/ |
|
``` |
|
|
|
You should see output like this: |
|
|
|
``` |
|
***** Eval results ***** |
|
eval_accuracy = 0.845588 |
|
eval_loss = 0.505248 |
|
global_step = 343 |
|
loss = 0.505248 |
|
``` |
|
|
|
This means that the Dev set accuracy was 84.55%. Small sets like MRPC have a |
|
high variance in the Dev set accuracy, even when starting from the same |
|
pre-training checkpoint. If you re-run multiple times (making sure to point to |
|
different `output_dir`), you should see results between 84% and 88%. |
|
|
|
A few other pre-trained models are implemented off-the-shelf in |
|
`run_classifier.py`, so it should be straightforward to follow those examples to |
|
use BERT for any single-sentence or sentence-pair classification task. |
|
|
|
Note: You might see a message `Running train on CPU`. This really just means |
|
that it's running on something other than a Cloud TPU, which includes a GPU. |
|
|
|
#### Prediction from classifier |
|
|
|
Once you have trained your classifier you can use it in inference mode by using |
|
the --do_predict=true command. You need to have a file named test.tsv in the |
|
input folder. Output will be created in file called test_results.tsv in the |
|
output folder. Each line will contain output for each sample, columns are the |
|
class probabilities. |
|
|
|
```shell |
|
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 |
|
export GLUE_DIR=/path/to/glue |
|
export TRAINED_CLASSIFIER=/path/to/fine/tuned/classifier |
|
|
|
python run_classifier.py \ |
|
--task_name=MRPC \ |
|
--do_predict=true \ |
|
--data_dir=$GLUE_DIR/MRPC \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$TRAINED_CLASSIFIER \ |
|
--max_seq_length=128 \ |
|
--output_dir=/tmp/mrpc_output/ |
|
``` |
|
|
|
### SQuAD 1.1 |
|
|
|
The Stanford Question Answering Dataset (SQuAD) is a popular question answering |
|
benchmark dataset. BERT (at the time of the release) obtains state-of-the-art |
|
results on SQuAD with almost no task-specific network architecture modifications |
|
or data augmentation. However, it does require semi-complex data pre-processing |
|
and post-processing to deal with (a) the variable-length nature of SQuAD context |
|
paragraphs, and (b) the character-level answer annotations which are used for |
|
SQuAD training. This processing is implemented and documented in `run_squad.py`. |
|
|
|
To run on SQuAD, you will first need to download the dataset. The |
|
[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) does not seem to |
|
link to the v1.1 datasets any longer, but the necessary files can be found here: |
|
|
|
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) |
|
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) |
|
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) |
|
|
|
Download these to some directory `$SQUAD_DIR`. |
|
|
|
The state-of-the-art SQuAD results from the paper currently cannot be reproduced |
|
on a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 does |
|
not seem to fit on a 12GB GPU using `BERT-Large`). However, a reasonably strong |
|
`BERT-Base` model can be trained on the GPU with these hyperparameters: |
|
|
|
```shell |
|
python run_squad.py \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ |
|
--do_train=True \ |
|
--train_file=$SQUAD_DIR/train-v1.1.json \ |
|
--do_predict=True \ |
|
--predict_file=$SQUAD_DIR/dev-v1.1.json \ |
|
--train_batch_size=12 \ |
|
--learning_rate=3e-5 \ |
|
--num_train_epochs=2.0 \ |
|
--max_seq_length=384 \ |
|
--doc_stride=128 \ |
|
--output_dir=/tmp/squad_base/ |
|
``` |
|
|
|
The dev set predictions will be saved into a file called `predictions.json` in |
|
the `output_dir`: |
|
|
|
```shell |
|
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json |
|
``` |
|
|
|
Which should produce an output like this: |
|
|
|
```shell |
|
{"f1": 88.41249612335034, "exact_match": 81.2488174077578} |
|
``` |
|
|
|
You should see a result similar to the 88.5% reported in the paper for |
|
`BERT-Base`. |
|
|
|
If you have access to a Cloud TPU, you can train with `BERT-Large`. Here is a |
|
set of hyperparameters (slightly different than the paper) which consistently |
|
obtain around 90.5%-91.0% F1 single-system trained only on SQuAD: |
|
|
|
```shell |
|
python run_squad.py \ |
|
--vocab_file=$BERT_LARGE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_LARGE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ |
|
--do_train=True \ |
|
--train_file=$SQUAD_DIR/train-v1.1.json \ |
|
--do_predict=True \ |
|
--predict_file=$SQUAD_DIR/dev-v1.1.json \ |
|
--train_batch_size=24 \ |
|
--learning_rate=3e-5 \ |
|
--num_train_epochs=2.0 \ |
|
--max_seq_length=384 \ |
|
--doc_stride=128 \ |
|
--output_dir=gs://some_bucket/squad_large/ \ |
|
--use_tpu=True \ |
|
--tpu_name=$TPU_NAME |
|
``` |
|
|
|
For example, one random run with these parameters produces the following Dev |
|
scores: |
|
|
|
```shell |
|
{"f1": 90.87081895814865, "exact_match": 84.38978240302744} |
|
``` |
|
|
|
If you fine-tune for one epoch on |
|
[TriviaQA](http://nlp.cs.washington.edu/triviaqa/) before this the results will |
|
be even better, but you will need to convert TriviaQA into the SQuAD json |
|
format. |
|
|
|
### SQuAD 2.0 |
|
|
|
This model is also implemented and documented in `run_squad.py`. |
|
|
|
To run on SQuAD 2.0, you will first need to download the dataset. The necessary |
|
files can be found here: |
|
|
|
* [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) |
|
* [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) |
|
* [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) |
|
|
|
Download these to some directory `$SQUAD_DIR`. |
|
|
|
On Cloud TPU you can run with BERT-Large as follows: |
|
|
|
```shell |
|
python run_squad.py \ |
|
--vocab_file=$BERT_LARGE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_LARGE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ |
|
--do_train=True \ |
|
--train_file=$SQUAD_DIR/train-v2.0.json \ |
|
--do_predict=True \ |
|
--predict_file=$SQUAD_DIR/dev-v2.0.json \ |
|
--train_batch_size=24 \ |
|
--learning_rate=3e-5 \ |
|
--num_train_epochs=2.0 \ |
|
--max_seq_length=384 \ |
|
--doc_stride=128 \ |
|
--output_dir=gs://some_bucket/squad_large/ \ |
|
--use_tpu=True \ |
|
--tpu_name=$TPU_NAME \ |
|
--version_2_with_negative=True |
|
``` |
|
|
|
We assume you have copied everything from the output directory to a local |
|
directory called ./squad/. The initial dev set predictions will be at |
|
./squad/predictions.json and the differences between the score of no answer ("") |
|
and the best non-null answer for each question will be in the file |
|
./squad/null_odds.json |
|
|
|
Run this script to tune a threshold for predicting null versus non-null answers: |
|
|
|
python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json |
|
./squad/predictions.json --na-prob-file ./squad/null_odds.json |
|
|
|
Assume the script outputs "best_f1_thresh" THRESH. (Typical values are between |
|
-1.0 and -5.0). You can now re-run the model to generate predictions with the |
|
derived threshold or alternatively you can extract the appropriate answers from |
|
./squad/nbest_predictions.json. |
|
|
|
```shell |
|
python run_squad.py \ |
|
--vocab_file=$BERT_LARGE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_LARGE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ |
|
--do_train=False \ |
|
--train_file=$SQUAD_DIR/train-v2.0.json \ |
|
--do_predict=True \ |
|
--predict_file=$SQUAD_DIR/dev-v2.0.json \ |
|
--train_batch_size=24 \ |
|
--learning_rate=3e-5 \ |
|
--num_train_epochs=2.0 \ |
|
--max_seq_length=384 \ |
|
--doc_stride=128 \ |
|
--output_dir=gs://some_bucket/squad_large/ \ |
|
--use_tpu=True \ |
|
--tpu_name=$TPU_NAME \ |
|
--version_2_with_negative=True \ |
|
--null_score_diff_threshold=$THRESH |
|
``` |
|
|
|
### Out-of-memory issues |
|
|
|
All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB of |
|
device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely |
|
to encounter out-of-memory issues if you use the same hyperparameters described |
|
in the paper. |
|
|
|
The factors that affect memory usage are: |
|
|
|
* **`max_seq_length`**: The released models were trained with sequence lengths |
|
up to 512, but you can fine-tune with a shorter max sequence length to save |
|
substantial memory. This is controlled by the `max_seq_length` flag in our |
|
example code. |
|
|
|
* **`train_batch_size`**: The memory usage is also directly proportional to |
|
the batch size. |
|
|
|
* **Model type, `BERT-Base` vs. `BERT-Large`**: The `BERT-Large` model |
|
requires significantly more memory than `BERT-Base`. |
|
|
|
* **Optimizer**: The default optimizer for BERT is Adam, which requires a lot |
|
of extra memory to store the `m` and `v` vectors. Switching to a more memory |
|
efficient optimizer can reduce memory usage, but can also affect the |
|
results. We have not experimented with other optimizers for fine-tuning. |
|
|
|
Using the default training scripts (`run_classifier.py` and `run_squad.py`), we |
|
benchmarked the maximum batch size on single Titan X GPU (12GB RAM) with |
|
TensorFlow 1.11.0: |
|
|
|
System | Seq Length | Max Batch Size |
|
------------ | ---------- | -------------- |
|
`BERT-Base` | 64 | 64 |
|
... | 128 | 32 |
|
... | 256 | 16 |
|
... | 320 | 14 |
|
... | 384 | 12 |
|
... | 512 | 6 |
|
`BERT-Large` | 64 | 12 |
|
... | 128 | 6 |
|
... | 256 | 2 |
|
... | 320 | 1 |
|
... | 384 | 0 |
|
... | 512 | 0 |
|
|
|
Unfortunately, these max batch sizes for `BERT-Large` are so small that they |
|
will actually harm the model accuracy, regardless of the learning rate used. We |
|
are working on adding code to this repository which will allow much larger |
|
effective batch sizes to be used on the GPU. The code will be based on one (or |
|
both) of the following techniques: |
|
|
|
* **Gradient accumulation**: The samples in a minibatch are typically |
|
independent with respect to gradient computation (excluding batch |
|
normalization, which is not used here). This means that the gradients of |
|
multiple smaller minibatches can be accumulated before performing the weight |
|
update, and this will be exactly equivalent to a single larger update. |
|
|
|
* [**Gradient checkpointing**](https://github.com/openai/gradient-checkpointing): |
|
The major use of GPU/TPU memory during DNN training is caching the |
|
intermediate activations in the forward pass that are necessary for |
|
efficient computation in the backward pass. "Gradient checkpointing" trades |
|
memory for compute time by re-computing the activations in an intelligent |
|
way. |
|
|
|
**However, this is not implemented in the current release.** |
|
|
|
## Using BERT to extract fixed feature vectors (like ELMo) |
|
|
|
In certain cases, rather than fine-tuning the entire pre-trained model |
|
end-to-end, it can be beneficial to obtained *pre-trained contextual |
|
embeddings*, which are fixed contextual representations of each input token |
|
generated from the hidden layers of the pre-trained model. This should also |
|
mitigate most of the out-of-memory issues. |
|
|
|
As an example, we include the script `extract_features.py` which can be used |
|
like this: |
|
|
|
```shell |
|
# Sentence A and Sentence B are separated by the ||| delimiter for sentence |
|
# pair tasks like question answering and entailment. |
|
# For single sentence inputs, put one sentence per line and DON'T use the |
|
# delimiter. |
|
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt |
|
|
|
python extract_features.py \ |
|
--input_file=/tmp/input.txt \ |
|
--output_file=/tmp/output.jsonl \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ |
|
--layers=-1,-2,-3,-4 \ |
|
--max_seq_length=128 \ |
|
--batch_size=8 |
|
``` |
|
|
|
This will create a JSON file (one line per line of input) containing the BERT |
|
activations from each Transformer layer specified by `layers` (-1 is the final |
|
hidden layer of the Transformer, etc.) |
|
|
|
Note that this script will produce very large output files (by default, around |
|
15kb for every input token). |
|
|
|
If you need to maintain alignment between the original and tokenized words (for |
|
projecting training labels), see the [Tokenization](#tokenization) section |
|
below. |
|
|
|
**Note:** You may see a message like `Could not find trained model in model_dir: |
|
/tmp/tmpuB5g5c, running initialization to predict.` This message is expected, it |
|
just means that we are using the `init_from_checkpoint()` API rather than the |
|
saved model API. If you don't specify a checkpoint or specify an invalid |
|
checkpoint, this script will complain. |
|
|
|
## Tokenization |
|
|
|
For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple. |
|
Just follow the example code in `run_classifier.py` and `extract_features.py`. |
|
The basic procedure for sentence-level tasks is: |
|
|
|
1. Instantiate an instance of `tokenizer = tokenization.FullTokenizer` |
|
|
|
2. Tokenize the raw text with `tokens = tokenizer.tokenize(raw_text)`. |
|
|
|
3. Truncate to the maximum sequence length. (You can use up to 512, but you |
|
probably want to use shorter if possible for memory and speed reasons.) |
|
|
|
4. Add the `[CLS]` and `[SEP]` tokens in the right place. |
|
|
|
Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, since |
|
you need to maintain alignment between your input text and output text so that |
|
you can project your training labels. SQuAD is a particularly complex example |
|
because the input labels are *character*-based, and SQuAD paragraphs are often |
|
longer than our maximum sequence length. See the code in `run_squad.py` to show |
|
how we handle this. |
|
|
|
Before we describe the general recipe for handling word-level tasks, it's |
|
important to understand what exactly our tokenizer is doing. It has three main |
|
steps: |
|
|
|
1. **Text normalization**: Convert all whitespace characters to spaces, and |
|
(for the `Uncased` model) lowercase the input and strip out accent markers. |
|
E.g., `John Johanson's, → john johanson's,`. |
|
|
|
2. **Punctuation splitting**: Split *all* punctuation characters on both sides |
|
(i.e., add whitespace around all punctuation characters). Punctuation |
|
characters are defined as (a) Anything with a `P*` Unicode class, (b) any |
|
non-letter/number/space ASCII character (e.g., characters like `$` which are |
|
technically not punctuation). E.g., `john johanson's, → john johanson ' s ,` |
|
|
|
3. **WordPiece tokenization**: Apply whitespace tokenization to the output of |
|
the above procedure, and apply |
|
[WordPiece](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py) |
|
tokenization to each token separately. (Our implementation is directly based |
|
on the one from `tensor2tensor`, which is linked). E.g., `john johanson ' s |
|
, → john johan ##son ' s ,` |
|
|
|
The advantage of this scheme is that it is "compatible" with most existing |
|
English tokenizers. For example, imagine that you have a part-of-speech tagging |
|
task which looks like this: |
|
|
|
``` |
|
Input: John Johanson 's house |
|
Labels: NNP NNP POS NN |
|
``` |
|
|
|
The tokenized output will look like this: |
|
|
|
``` |
|
Tokens: john johan ##son ' s house |
|
``` |
|
|
|
Crucially, this would be the same output as if the raw text were `John |
|
Johanson's house` (with no space before the `'s`). |
|
|
|
If you have a pre-tokenized representation with word-level annotations, you can |
|
simply tokenize each input word independently, and deterministically maintain an |
|
original-to-tokenized alignment: |
|
|
|
```python |
|
### Input |
|
orig_tokens = ["John", "Johanson", "'s", "house"] |
|
labels = ["NNP", "NNP", "POS", "NN"] |
|
|
|
### Output |
|
bert_tokens = [] |
|
|
|
# Token map will be an int -> int mapping between the `orig_tokens` index and |
|
# the `bert_tokens` index. |
|
orig_to_tok_map = [] |
|
|
|
tokenizer = tokenization.FullTokenizer( |
|
vocab_file=vocab_file, do_lower_case=True) |
|
|
|
bert_tokens.append("[CLS]") |
|
for orig_token in orig_tokens: |
|
orig_to_tok_map.append(len(bert_tokens)) |
|
bert_tokens.extend(tokenizer.tokenize(orig_token)) |
|
bert_tokens.append("[SEP]") |
|
|
|
# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"] |
|
# orig_to_tok_map == [1, 2, 4, 6] |
|
``` |
|
|
|
Now `orig_to_tok_map` can be used to project `labels` to the tokenized |
|
representation. |
|
|
|
There are common English tokenization schemes which will cause a slight mismatch |
|
between how BERT was pre-trained. For example, if your input tokenization splits |
|
off contractions like `do n't`, this will cause a mismatch. If it is possible to |
|
do so, you should pre-process your data to convert these back to raw-looking |
|
text, but if it's not possible, this mismatch is likely not a big deal. |
|
|
|
## Pre-training with BERT |
|
|
|
We are releasing code to do "masked LM" and "next sentence prediction" on an |
|
arbitrary text corpus. Note that this is *not* the exact code that was used for |
|
the paper (the original code was written in C++, and had some additional |
|
complexity), but this code does generate pre-training data as described in the |
|
paper. |
|
|
|
Here's how to run the data generation. The input is a plain text file, with one |
|
sentence per line. (It is important that these be actual sentences for the "next |
|
sentence prediction" task). Documents are delimited by empty lines. The output |
|
is a set of `tf.train.Example`s serialized into `TFRecord` file format. |
|
|
|
You can perform sentence segmentation with an off-the-shelf NLP toolkit such as |
|
[spaCy](https://spacy.io/). The `create_pretraining_data.py` script will |
|
concatenate segments until they reach the maximum sequence length to minimize |
|
computational waste from padding (see the script for more details). However, you |
|
may want to intentionally add a slight amount of noise to your input data (e.g., |
|
randomly truncate 2% of input segments) to make it more robust to non-sentential |
|
input during fine-tuning. |
|
|
|
This script stores all of the examples for the entire input file in memory, so |
|
for large data files you should shard the input file and call the script |
|
multiple times. (You can pass in a file glob to `run_pretraining.py`, e.g., |
|
`tf_examples.tf_record*`.) |
|
|
|
The `max_predictions_per_seq` is the maximum number of masked LM predictions per |
|
sequence. You should set this to around `max_seq_length` * `masked_lm_prob` (the |
|
script doesn't do that automatically because the exact value needs to be passed |
|
to both scripts). |
|
|
|
```shell |
|
python create_pretraining_data.py \ |
|
--input_file=./sample_text.txt \ |
|
--output_file=/tmp/tf_examples.tfrecord \ |
|
--vocab_file=$BERT_BASE_DIR/vocab.txt \ |
|
--do_lower_case=True \ |
|
--max_seq_length=128 \ |
|
--max_predictions_per_seq=20 \ |
|
--masked_lm_prob=0.15 \ |
|
--random_seed=12345 \ |
|
--dupe_factor=5 |
|
``` |
|
|
|
Here's how to run the pre-training. Do not include `init_checkpoint` if you are |
|
pre-training from scratch. The model configuration (including vocab size) is |
|
specified in `bert_config_file`. This demo code only pre-trains for a small |
|
number of steps (20), but in practice you will probably want to set |
|
`num_train_steps` to 10000 steps or more. The `max_seq_length` and |
|
`max_predictions_per_seq` parameters passed to `run_pretraining.py` must be the |
|
same as `create_pretraining_data.py`. |
|
|
|
```shell |
|
python run_pretraining.py \ |
|
--input_file=/tmp/tf_examples.tfrecord \ |
|
--output_dir=/tmp/pretraining_output \ |
|
--do_train=True \ |
|
--do_eval=True \ |
|
--bert_config_file=$BERT_BASE_DIR/bert_config.json \ |
|
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ |
|
--train_batch_size=32 \ |
|
--max_seq_length=128 \ |
|
--max_predictions_per_seq=20 \ |
|
--num_train_steps=20 \ |
|
--num_warmup_steps=10 \ |
|
--learning_rate=2e-5 |
|
``` |
|
|
|
This will produce an output like this: |
|
|
|
``` |
|
***** Eval results ***** |
|
global_step = 20 |
|
loss = 0.0979674 |
|
masked_lm_accuracy = 0.985479 |
|
masked_lm_loss = 0.0979328 |
|
next_sentence_accuracy = 1.0 |
|
next_sentence_loss = 3.45724e-05 |
|
``` |
|
|
|
Note that since our `sample_text.txt` file is very small, this example training |
|
will overfit that data in only a few steps and produce unrealistically high |
|
accuracy numbers. |
|
|
|
### Pre-training tips and caveats |
|
|
|
* **If using your own vocabulary, make sure to change `vocab_size` in |
|
`bert_config.json`. If you use a larger vocabulary without changing this, |
|
you will likely get NaNs when training on GPU or TPU due to unchecked |
|
out-of-bounds access.** |
|
* If your task has a large domain-specific corpus available (e.g., "movie |
|
reviews" or "scientific papers"), it will likely be beneficial to run |
|
additional steps of pre-training on your corpus, starting from the BERT |
|
checkpoint. |
|
* The learning rate we used in the paper was 1e-4. However, if you are doing |
|
additional steps of pre-training starting from an existing BERT checkpoint, |
|
you should use a smaller learning rate (e.g., 2e-5). |
|
* Current BERT models are English-only, but we do plan to release a |
|
multilingual model which has been pre-trained on a lot of languages in the |
|
near future (hopefully by the end of November 2018). |
|
* Longer sequences are disproportionately expensive because attention is |
|
quadratic to the sequence length. In other words, a batch of 64 sequences of |
|
length 512 is much more expensive than a batch of 256 sequences of |
|
length 128. The fully-connected/convolutional cost is the same, but the |
|
attention cost is far greater for the 512-length sequences. Therefore, one |
|
good recipe is to pre-train for, say, 90,000 steps with a sequence length of |
|
128 and then for 10,000 additional steps with a sequence length of 512. The |
|
very long sequences are mostly needed to learn positional embeddings, which |
|
can be learned fairly quickly. Note that this does require generating the |
|
data twice with different values of `max_seq_length`. |
|
* If you are pre-training from scratch, be prepared that pre-training is |
|
computationally expensive, especially on GPUs. If you are pre-training from |
|
scratch, our recommended recipe is to pre-train a `BERT-Base` on a single |
|
[preemptible Cloud TPU v2](https://cloud.google.com/tpu/docs/pricing), which |
|
takes about 2 weeks at a cost of about $500 USD (based on the pricing in |
|
October 2018). You will have to scale down the batch size when only training |
|
on a single Cloud TPU, compared to what was used in the paper. It is |
|
recommended to use the largest batch size that fits into TPU memory. |
|
|
|
### Pre-training data |
|
|
|
We will **not** be able to release the pre-processed datasets used in the paper. |
|
For Wikipedia, the recommended pre-processing is to download |
|
[the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), |
|
extract the text with |
|
[`WikiExtractor.py`](https://github.com/attardi/wikiextractor), and then apply |
|
any necessary cleanup to convert it into plain text. |
|
|
|
Unfortunately the researchers who collected the |
|
[BookCorpus](http://yknzhu.wixsite.com/mbweb) no longer have it available for |
|
public download. The |
|
[Project Guttenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) |
|
is a somewhat smaller (200M word) collection of older books that are public |
|
domain. |
|
|
|
[Common Crawl](http://commoncrawl.org/) is another very large collection of |
|
text, but you will likely have to do substantial pre-processing and cleanup to |
|
extract a usable corpus for pre-training BERT. |
|
|
|
### Learning a new WordPiece vocabulary |
|
|
|
This repository does not include code for *learning* a new WordPiece vocabulary. |
|
The reason is that the code used in the paper was implemented in C++ with |
|
dependencies on Google's internal libraries. For English, it is almost always |
|
better to just start with our vocabulary and pre-trained models. For learning |
|
vocabularies of other languages, there are a number of open source options |
|
available. However, keep in mind that these are not compatible with our |
|
`tokenization.py` library: |
|
|
|
* [Google's SentencePiece library](https://github.com/google/sentencepiece) |
|
|
|
* [tensor2tensor's WordPiece generation script](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder_build_subword.py) |
|
|
|
* [Rico Sennrich's Byte Pair Encoding library](https://github.com/rsennrich/subword-nmt) |
|
|
|
## Using BERT in Colab |
|
|
|
If you want to use BERT with [Colab](https://colab.research.google.com), you can |
|
get started with the notebook |
|
"[BERT FineTuning with Cloud TPUs](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)". |
|
**At the time of this writing (October 31st, 2018), Colab users can access a |
|
Cloud TPU completely for free.** Note: One per user, availability limited, |
|
requires a Google Cloud Platform account with storage (although storage may be |
|
purchased with free credit for signing up with GCP), and this capability may not |
|
longer be available in the future. Click on the BERT Colab that was just linked |
|
for more information. |
|
|
|
## FAQ |
|
|
|
#### Is this code compatible with Cloud TPUs? What about GPUs? |
|
|
|
Yes, all of the code in this repository works out-of-the-box with CPU, GPU, and |
|
Cloud TPU. However, GPU training is single-GPU only. |
|
|
|
#### I am getting out-of-memory errors, what is wrong? |
|
|
|
See the section on [out-of-memory issues](#out-of-memory-issues) for more |
|
information. |
|
|
|
#### Is there a PyTorch version available? |
|
|
|
There is no official PyTorch implementation. However, NLP researchers from |
|
HuggingFace made a |
|
[PyTorch version of BERT available](https://github.com/huggingface/pytorch-pretrained-BERT) |
|
which is compatible with our pre-trained checkpoints and is able to reproduce |
|
our results. We were not involved in the creation or maintenance of the PyTorch |
|
implementation so please direct any questions towards the authors of that |
|
repository. |
|
|
|
#### Is there a Chainer version available? |
|
|
|
There is no official Chainer implementation. However, Sosuke Kobayashi made a |
|
[Chainer version of BERT available](https://github.com/soskek/bert-chainer) |
|
which is compatible with our pre-trained checkpoints and is able to reproduce |
|
our results. We were not involved in the creation or maintenance of the Chainer |
|
implementation so please direct any questions towards the authors of that |
|
repository. |
|
|
|
#### Will models in other languages be released? |
|
|
|
Yes, we plan to release a multi-lingual BERT model in the near future. We cannot |
|
make promises about exactly which languages will be included, but it will likely |
|
be a single model which includes *most* of the languages which have a |
|
significantly-sized Wikipedia. |
|
|
|
#### Will models larger than `BERT-Large` be released? |
|
|
|
So far we have not attempted to train anything larger than `BERT-Large`. It is |
|
possible that we will release larger models if we are able to obtain significant |
|
improvements. |
|
|
|
#### What license is this library released under? |
|
|
|
All code *and* models are released under the Apache 2.0 license. See the |
|
`LICENSE` file for more information. |
|
|
|
#### How do I cite BERT? |
|
|
|
For now, cite [the Arxiv paper](https://arxiv.org/abs/1810.04805): |
|
|
|
``` |
|
@article{devlin2018bert, |
|
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, |
|
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
|
journal={arXiv preprint arXiv:1810.04805}, |
|
year={2018} |
|
} |
|
``` |
|
|
|
If we submit the paper to a conference or journal, we will update the BibTeX. |
|
|
|
## Disclaimer |
|
|
|
This is not an official Google product. |
|
|
|
## Contact information |
|
|
|
For help or issues using BERT, please submit a GitHub issue. |
|
|
|
For personal communication related to BERT, please contact Jacob Devlin |
|
(`[email protected]`), Ming-Wei Chang (`[email protected]`), or |
|
Kenton Lee (`[email protected]`). |
|
|