Spaces:
Runtime error
Runtime error
# Distil* | |
Author: @VictorSanh | |
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2. | |
**January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs. | |
**December 6, 2019 - Update** We release **DistilmBERT**: 92% of `bert-base-multilingual-cased` on XNLI. The model supports 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). | |
**November 19, 2019 - Update** We release German **DistilBERT**: 98.8% of `bert-base-german-dbmdz-cased` on NER tasks. | |
**October 23, 2019 - Update** We release **DistilRoBERTa**: 95% of `RoBERTa-base`'s performance on GLUE, twice as fast as RoBERTa while being 35% smaller. | |
**October 3, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper supersedes our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.** | |
**September 19, 2019 - Update:** We fixed bugs in the code and released an updated version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 99% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future! | |
## What is Distil* | |
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distilled-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production. | |
We have applied the same method to other Transformer architectures and released the weights: | |
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set). | |
- RoBERTa: **DistilRoBERTa** reaches 95% of `RoBERTa-base`'s performance on GLUE while being twice faster and 35% smaller. | |
- German BERT: **German DistilBERT** reaches 99% of `bert-base-german-dbmdz-cased`'s performance on German NER (CoNLL-2003). | |
- Multilingual BERT: **DistilmBERT** reaches 92% of Multilingual BERT's performance on XNLI while being twice faster and 25% smaller. The model supports 104 languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). | |
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). | |
Here are the results on the dev sets of GLUE: | |
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI | | |
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---: | | |
| BERT-base-uncased | **79.5** | 56.3 | 84.7 | 88.6 | 91.8 | 89.6 | 69.3 | 92.7 | 89.0 | 53.5 | | |
| DistilBERT-base-uncased | **77.0** | 51.3 | 82.1 | 87.5 | 89.2 | 88.5 | 59.9 | 91.3 | 86.9 | 56.3 | | |
| BERT-base-cased | **78.2** | 58.2 | 83.9 | 87.8 | 91.0 | 89.2 | 66.1 | 91.7 | 89.2 | 46.5 | | |
| DistilBERT-base-cased | **75.9** | 47.2 | 81.5 | 85.6 | 88.2 | 87.8 | 60.6 | 90.4 | 85.5 | 56.3 | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| RoBERTa-base (reported) | **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup> | | |
| DistilRoBERTa<sup>1</sup> | **79.0**/**82.3**<sup>2</sup> | 59.3 | 84.0 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1 | | |
<sup>1</sup> We did not use the MNLI checkpoint for fine-tuning but directly perform transfer learning on the pre-trained DistilRoBERTa. | |
<sup>2</sup> Macro-score computed without WNLI. | |
<sup>3</sup> We compute this score ourselves for completeness. | |
Here are the results on the *test* sets for 6 of the languages available in XNLI. The results are computed in the zero shot setting (trained on the English portion and evaluated on the target language portion): | |
| Model | English | Spanish | Chinese | German | Arabic | Urdu | | |
| :---: | :---: | :---: | :---: | :---: | :---: | :---:| | |
| mBERT base cased (computed) | 82.1 | 74.6 | 69.1 | 72.3 | 66.4 | 58.5 | | |
| mBERT base uncased (reported)| 81.4 | 74.3 | 63.8 | 70.5 | 62.1 | 58.3 | | |
| DistilmBERT | 78.2 | 69.1 | 64.0 | 66.3 | 59.1 | 54.7 | | |
## Setup | |
This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`. | |
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breaking changes compared to v1.1.0). | |
## How to use DistilBERT | |
Transformers includes five pre-trained Distil* models, currently only provided for English and German (we are investigating the possibility to train and release a multilingual version of DistilBERT): | |
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters. | |
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knowledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score). | |
- `distilbert-base-cased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-cased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 65M parameters. | |
- `distilbert-base-cased-distilled-squad`: A finetuned version of `distilbert-base-cased` finetuned using (a second step of) knowledge distillation on SQuAD 1.0. This model reaches a F1 score of 87.1 on the dev set (for comparison, Bert `bert-base-cased` version reaches a 88.7 F1 score). | |
- `distilbert-base-german-cased`: DistilBERT German language model pretrained on 1/2 of the data used to pretrain Bert using distillation with the supervision of the `bert-base-german-dbmdz-cased` version of German DBMDZ Bert. For NER tasks the model reaches a F1 score of 83.49 on the CoNLL-2003 test set (for comparison, `bert-base-german-dbmdz-cased` reaches a 84.52 F1 score), and a F1 score of 85.23 on the GermEval 2014 test set (`bert-base-german-dbmdz-cased` reaches a 86.89 F1 score). | |
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2. | |
- `distilroberta-base`: DistilRoBERTa English language model pretrained with the supervision of `roberta-base` solely on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base. | |
- `distilbert-base-multilingual-cased`: DistilmBERT multilingual model pretrained with the supervision of `bert-base-multilingual-cased` on the concatenation of Wikipedia in 104 different languages. The model supports the 104 languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base. | |
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models. | |
```python | |
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased') | |
model = DistilBertModel.from_pretrained('distilbert-base-cased') | |
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) | |
outputs = model(input_ids) | |
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple | |
``` | |
Similarly, using the other Distil* models simply consists in calling the base classes with a different pretrained checkpoint: | |
- DistilBERT uncased: `model = DistilBertModel.from_pretrained('distilbert-base-uncased')` | |
- DistilGPT2: `model = GPT2Model.from_pretrained('distilgpt2')` | |
- DistilRoBERTa: `model = RobertaModel.from_pretrained('distilroberta-base')` | |
- DistilmBERT: `model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')` | |
## How to train Distil* | |
In the following, we will explain how you can train DistilBERT. | |
### A. Preparing the data | |
The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT). | |
To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences). | |
First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary. | |
```bash | |
python scripts/binarized_data.py \ | |
--file_path data/dump.txt \ | |
--tokenizer_type bert \ | |
--tokenizer_name bert-base-uncased \ | |
--dump_file data/binarized_text | |
``` | |
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smooths the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data: | |
```bash | |
python scripts/token_counts.py \ | |
--data_file data/binarized_text.bert-base-uncased.pickle \ | |
--token_counts_dump data/token_counts.bert-base-uncased.pickle \ | |
--vocab_size 30522 | |
``` | |
### B. Training | |
Training with distillation is really simple once you have pre-processed the data: | |
```bash | |
python train.py \ | |
--student_type distilbert \ | |
--student_config training_configs/distilbert-base-uncased.json \ | |
--teacher_type bert \ | |
--teacher_name bert-base-uncased \ | |
--alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --alpha_clm 0.0 --mlm \ | |
--freeze_pos_embs \ | |
--dump_path serialization_dir/my_first_training \ | |
--data_file data/binarized_text.bert-base-uncased.pickle \ | |
--token_counts data/token_counts.bert-base-uncased.pickle \ | |
--force # overwrites the `dump_path` if it already exists. | |
``` | |
By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them. | |
We highly encourage you to use distributed training for training DistilBERT as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs: | |
```bash | |
export NODE_RANK=0 | |
export N_NODES=1 | |
export N_GPU_NODE=4 | |
export WORLD_SIZE=4 | |
export MASTER_PORT=<AN_OPEN_PORT> | |
export MASTER_ADDR=<I.P.> | |
pkill -f 'python -u train.py' | |
python -m torch.distributed.launch \ | |
--nproc_per_node=$N_GPU_NODE \ | |
--nnodes=$N_NODES \ | |
--node_rank $NODE_RANK \ | |
--master_addr $MASTER_ADDR \ | |
--master_port $MASTER_PORT \ | |
train.py \ | |
--force \ | |
--n_gpu $WORLD_SIZE \ | |
--student_type distilbert \ | |
--student_config training_configs/distilbert-base-uncased.json \ | |
--teacher_type bert \ | |
--teacher_name bert-base-uncased \ | |
--alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --alpha_clm 0.0 --mlm \ | |
--freeze_pos_embs \ | |
--dump_path serialization_dir/my_first_training \ | |
--data_file data/binarized_text.bert-base-uncased.pickle \ | |
--token_counts data/token_counts.bert-base-uncased.pickle | |
``` | |
**Tips:** Starting distilled training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training! | |
Happy distillation! | |
## Citation | |
If you find the resource useful, you should cite the following paper: | |
``` | |
@inproceedings{sanh2019distilbert, | |
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, | |
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, | |
booktitle={NeurIPS EMC^2 Workshop}, | |
year={2019} | |
} | |
``` | |